PDA

View Full Version : Collect Data via the Web: redux



stanl
09-14-2007, 02:15 AM
The original thread was getting a bit long:bug: To recap my .02, the web site itself is organized into a catalog with url links set by Product, i.e. "Gear" then page, class/group, then subclass. These links then list the items and if all items do not fit on a single page offer either a View All or Next anchor. The original post was specific to snowboards and the code below will obtain (1) The item name (2) a url link to the item itself (3) features for the item as extracted from the url link. I chose to create an in-memory recordset and persist it to XML as Boards.xml. There is a commented section in the code having to do with navigating to View All - I think I have a handle on that. The code is specific to the IE Browser, and should probably be done as HTTP requests to remove that restriction. In terms of the original OPS overall goals a lot more effort has to be put into the design. Stan

P.S. the OP wanted the data in Excel, but the XML recordset can easily be ported there, just wanted to see if this was going in the right direction. I attached the recordset I created in testing, you can unzip it and view in Notepad.


Sub createlookup()
'variables use to create link for specific catalog items
'these are hard-coded, but could be derived from a lookup
cItem = "Boards"
cPage = "295"
cCat = "7"
cSub = "49"
cRef = "/c" & cCat & "/s" & cSub & "/"
'assign XML recordset based on Item
cXML = ActiveWorkbook.Path & "\" & cItem & ".xml"
If Dir(cXML) <> "" Then Kill (cXML)
'create in-memory recordset
Set oRS = CreateObject("Adodb.Recordset")
oRS.Fields.Append "Item", 200, 100
oRS.Fields.Append "Url", 200, 200
oRS.Fields.Append "Info", 202, 536870910
oRS.Open
cURL = "http://www.backcountry.com/store/group/" & cPage & "/c" & cCat & "/s" & cSub & "/"
Set oIE = CreateObject("InternetExplorer.Application")
oIE.Visible = True
oIE.navigate cURL
cLinks = ""
'Navigate to first page
While oIE.Busy Or oIE.readyState <> 4
DoEvents
Wend
While oIE.Document.readyState <> "complete"
DoEvents
Wend
'///////////////////////////////////////////////////////////
'cannot get this next section to work
'it will click and reload the page
'but the recordset will be blank
'uncomment and see for yourself
'For Each S In oIE.Document.Body.GetElementsByTagName("A")
' If UCase(S.innerText) = "VIEW ALL" Then
' S.Click
' Exit For
' End If
'Next
'While oIE.Busy Or oIE.readyState <> 4
' DoEvents
'Wend
'While oIE.Document.readyState <> "complete"
' DoEvents
'Wend
'/////////////////////////////////////////////////////////////////

'add specific item urls to recordset
For Each S In oIE.Document.Body.GetElementsByTagName("A")
If InStr(S.href, cRef) And S.GetAttribute("alt") <> "" Then
If InStr(S.href, "#rev") = 0 Then
If InStr(S.href, "id=") Then
If InStr(cLinks, S.href) = 0 Then
cLinks = cLinks & "/" & S.href
oRS.Addnew
oRS.Collect("Item") = S.GetAttribute("alt")
oRS.Collect("Url") = S.href
oRS.Update
End If
End If
End If
End If
Next
If oRS.RecordCount > 0 Then
oRS.MoveFirst
While Not oRS.EOF
oIE.navigate (oRS.Collect("Url"))
While oIE.Busy Or oIE.readyState <> 4
DoEvents
Wend
While oIE.Document.readyState <> "complete"
DoEvents
Wend
oRS.Collect("Info") = oIE.Document.Body.GetElementsByTagName("TABLE").Item(2).innerText
oRS.Update
oRS.MoveNext
Wend
End If
oIE.Quit
Set oIE = Nothing
'persist recordset as XML and close it
oRS.Save cXML, 1
oRS.Close
Set oRS = Nothing
End Sub
Sub webscrape()
cItem = "Boards"
cXML = ActiveWorkbook.Path & "\" & cItem & ".xml"
If Dir(cXML) = "" Then Exit Sub
Set oRS = CreateObject("Adodb.Recordset")
oRS.Open cXML, "Provider=MSPersist;", 1, 4, 256
MsgBox oRS.RecordCount
oRS.Close
Set oRS = Nothing
End Sub

stanl
09-14-2007, 08:39 AM
a couple of final notes, you can replace

For Each S In oIE.Document.Body.GetElementsByTagName("A")
If InStr(S.href, cRef) And S.GetAttribute("alt") <> "" Then
If InStr(S.href, "#rev") = 0 Then
If InStr(S.href, "id=") Then
If InStr(cLinks, S.href) = 0 Then
cLinks = cLinks & "/" & S.href
oRS.Addnew
oRS.Collect("Item") = S.GetAttribute("alt")
oRS.Collect("Url") = S.href
oRS.Update
End If
End If
End If
End If
Next


with


For Each S In oIE.Document.Body.GetElementsByTagName("A")
If S.GetAttribute("alt") <> "" Then
cAlt = Replace(S.GetAttribute("alt"), " ", "-")
If InStr(S.href, cAlt) Then
If InStr(cLinks, S.href) = 0 Then
cLinks = cLinks & "/" & S.href
oRS.Addnew
oRS.Collect("Item") = Replace(cAlt, "-", " ")
oRS.Collect("Url") = S.href
oRS.Update
End If
End If
End If
Next


in the createlookup code - ir is better suited for handling processing more items than fit on a single page.

Also, began thinking about drilling down to individual items via hrefs on specific pages. It appears a good starting place is

http://www.backcountry.com/store/

and what you want is a collection of all the \subcat\ links... which in turn reference other subcategories all the way down to the individual items. Seems like a good place for regex. The following code uses regex to obtain relevant subcat links. Not great, but it's a start {again just creating an xml recordset}. Stan


Sub regexparse()
Set regex = CreateObject("vbscript.regexp")
regex.Pattern = "(http://www.backcountry.com/store/subcat/)(\d|\d\d)(/ (http://www.backcountry.com/store/subcat/)(/d%7C/d/d)(/))"
regex.IgnoreCase = False
cURL = "http://www.backcountry.com/store/"
Set oIE = CreateObject("InternetExplorer.Application")
oIE.Visible = True
oIE.navigate cURL
cXML = ActiveWorkbook.Path & "\storelinks.xml"
If Dir(cXML) <> "" Then Kill (cXML)
Set oRS = CreateObject("Adodb.Recordset")
oRS.Fields.Append "level", 200, 30
oRS.Fields.Append "Url", 200, 200
oRS.Fields.Append "Item", 200, 100
oRS.Open
cLevel = "store"
While oIE.Busy Or oIE.readyState <> 4
DoEvents
Wend
While oIE.Document.readyState <> "complete"
DoEvents
Wend
For Each S In oIE.Document.Body.GetElementsByTagName("A")
If regex.test(S.href) Then
oRS.Addnew
oRS.Collect("level") = cLevel
oRS.Collect("Url") = S.href
If S.GetAttribute("innertext") <> "" Then oRS.Collect("Item") = S.innerText
oRS.Update
End If
Next
oRS.Save cXML, 1
oRS.Close
Set oRS = Nothing
oIE.Quit
Set oIE = Nothing
Set regex = Nothing
End Sub

Norie
09-14-2007, 09:51 AM
Stan

Do you really need to open every link for this?

Here's an example of, not particularly sophisticated, code for getting all the links from a page into a text file.

Sub Test()
Dim lnk
Set ie = CreateObject("InternetExplorer.Application")
With ie
.Visible = True
.Navigate "http://www.nu.nl/"
Do Until .ReadyState = 4: DoEvents: Loop

Set doc = ie.Document

Open "C:\TestHTML.txt" For Output As #1
For Each lnk In doc.Links
Print #1, lnk
Next lnk
Close #1
End With

End Sub

stanl
09-14-2007, 10:35 AM
Stan
Do you really need to open every link for this?


I don't think that was the point. I only re-opened each valid item url to get the descriptive data. And to obtain 'valid' urls from the initial page I needed to massage the hrefs. You might want to look at the xml to get a better picture.

Norie
09-14-2007, 10:40 AM
Stan

I've tried looking at it but it just doesn't seem to be working for me.

Tried opening in Notepad, Word, Excel 2007 etc

In Excel 2007 all I seem to get is hyperlinks and the names of the boards.

Mind you it's probably me doing something wrong.:oops:

As to what the OP wanted to actually do, it's still not clear to me.

But then again that's probably just me.:)

stanl
09-14-2007, 12:08 PM
Stan
I've tried looking at it but it just doesn't seem to be working for me.


Norie;

It's just an ADO recordset, so it might look a little strange in Notepad due to the z:row configuration. Try placing the attached workbook in the same directory as the xml file, then run the show macro. My aim from the getgo was just to illustrate one methodology for web-scraping with a site such as Doug referenced. As you can see, I use a lot of late-binding and explicit constants... as normally I write my code outside the VBE, and run as separate compiled exes. Eventually Doug will need a relational structure of tables where

store
-----> subcategory
---->sub-subcategory
-----> Item

these will be a series of keys and urls, pointing to different sections, each sending different parameters to various code modules.

Stan

Norie
09-14-2007, 12:17 PM
Stan

Seem to have got it now.

I actually thought all the Length etc stuff was something to do with the XML.:oops:

I can also see what you are getting at - the data needs to be parsed somehow.

Personally in the past I've just relied on using the HTML/DTHML object to say extract a particular table and place the data on the worksheet.

Perhaps not the best approach but it seems to work.

Mind you I've seen posts/threads elsewhere where it was recommended that the complete source code was read in to wherever and then you parsed that.

Doesn't seem like a good idea to me.:dunno

stanl
09-14-2007, 01:04 PM
Mind you I've seen posts/threads elsewhere where it was recommended that the complete source code was read in to wherever and then you parsed that.

Doesn't seem like a good idea to me.:dunno

You'll have to tell that to the people who came up with RSS feeds:wink:

Norie
09-14-2007, 01:30 PM
Stan

If I knew what an RSS feed was I might.:)

Only joking, but all this new stuff is starting to leave me behind.

I mean I still get the arguments for InStr the wrong way round every time I use it.

stanl
09-14-2007, 01:52 PM
Stan

If I knew what an RSS feed was I might.:)

Only joking, but all this new stuff is starting to leave me behind.


check out the SQL forum; I posted a Logparser script that does the Yahoo news RSS feed; and in another thread there is HTTP code in a workbook that performs a google.translate.

YellowLabPro
09-15-2007, 08:03 AM
Good day Stan,
I downloaded the two files, boards.zip and getboardsxml.xls, copied and placed the code in getboardsxml.xls book, ran it.
It loaded and cycled through several pages on the target address. But that is the only thing I see that occurred. Was I supposed to get any download info?
I did not uncomment, nor replace the section you remarked about- I wanted to follow along w/ your logic.

Thanks for delving and running w/ this project

Best Regards,

Doug

stanl
09-15-2007, 09:04 AM
I obviously got ahead of myself. You should run the createlookup() sub from a workbook, and it will create the xml file. A lot of the rest of the stuff was for Norie, or me just trying to answer the larger issue of parsing the entire site, i.e. code as is gets snowboards... what if we wanted Parkas, or women's underwear????

Createlookup() was intended to permit traversing the website, identifying items/by their urls and capturing that information in a recordset. How you want things ultimately displayed or parsed is up to you, and you need to provide a template.

Do you want to construct a workbook with sheets for each items displaying the items features [maybe a picture] - 1 per sheet... or an Excel database, or an Access table with features encoded as HTML and picture as binary/ole field??????????

YellowLabPro
09-15-2007, 11:30 AM
Stan,
Just got back, will dive into this, this afternoon, parkas are good, but I don't wear women's underwear regardless of what anyone here says....

What I ultimately would like to do is provide a list of desired items and find size, spec information and descriptions, retrieving pictures might actually be very helpful too.

stanl
09-16-2007, 07:13 AM
What I ultimately would like to do is provide a list of desired items and find size, spec information and descriptions, retrieving pictures might actually be very helpful too.

Consider this the final cut. XML recordsets don't generate much interest in this forum, so I wrote a sub to parse data into an Access Table, and included pictures. I have attached the zipped .mdb with 1 row (the first snowboard) as the entire file is ~1.8 meg. The following sub will open the table and create an excel worksheet based on the info in the data. You will note I made a comment about using the clipboard for two fields since it would parse the data into separate rows instead of a text block... but consider that homework.

If this is in the ballpark of the type of output you want, I will post the sub to create the table for all snowboards. Again, this is just my method for web-scraping - collect all links.images and detail in a database and generate worksheets locally as needed. Make sure you unzip the .mdb and place it in the same subdir as the excel file you put the code into.

Stan



Sub createitem()
cPath = ActiveWorkbook.Path
cMDB = cPath & "\wtest.mdb"
If Dir(cMDB) = "" Then Exit Sub
cProv = "MicroSoft.Jet.OLEDB.4.0"
cConn = "Provider=" & cProv & ";Data Source=" & cMDB & ";"
Set oRS = CreateObject("Adodb.Recordset")
oRS.Open "Items", cConn, 1, 3, 2
oRS.MoveFirst
nOld = Application.SheetsInNewWorkbook
Application.SheetsInNewWorkbook = 1
Application.Workbooks.Add
Application.SheetsInNewWorkbook = nOld
Set oWS = Application.ActiveWorkbook.Worksheets(1)
cName = oRS.Collect("Item")
If Len(cName) > 30 Then cName = Mid(cName, 1, 30)
oWS.Name = cName
oWS.Activate
Application.ActiveWindow.DisplayGridlines = False
oWS.Range("A1").Value = oRS.Collect("Item")
oWS.Range("A2").Value = oRS.Collect("Price")
'with these next 2 it would be better to copy
'the field data to the clipboard then paste
'to the range.
oWS.Range("A4").Value = oRS.Collect("Details")
oWS.Range("A16").Value = oRS.Collect("Descrip")
oWS.Columns("A:A").Select
Application.Selection.ColumnWidth = 54.14
Application.Selection.WrapText = True
Set oS = CreateObject("ADODB.Stream")
oS.Type = 1
oS.Open
oS.Write (oRS.Collect("img"))
oS.Position = 0
cPict = cPath & "\" & "test.jpg"
oS.SaveToFile cPict, 2
oS.Close
oS = 0
oWS.Range("C1").Select
oWS.Pictures.Insert(cPict).Select
Application.ScreenUpdating = 1
Set oWS = Nothing
oRS.Close
Set oRS = Nothing
End Sub

YellowLabPro
09-16-2007, 09:41 AM
Thanks Stan,
I just downloaded and launched the db file in Access.

Coincidentially I am reading about Access now, started on it earlier this week for a different project. What I am saying is, "this is new to me, very new".

I installed the latest Sub createitem() in the "gear.xls" workbook and put the .mdb file in the WebScraping directory on the desktop.
Running the sub this pulled the record that was in the .mdb file into a new created worksheet which loaded up the specs, description and the picture in a new workbook.
This looks Fantastic so far!

You write that I have some homework- the two text fields- I am thinking maybe you mean what will I want to do w/ it from here, but we can handle that a little later, I am thinking...

I see what you mean now regarding placing each item in its own sheet or collected together now.

I would opt for all similar related products to be cataloged on the same sheet, all snowboards on one or many sheets depending how many we pull down.

Ok- now that I have done that... what's next? Do you need to send me the sub for all snowboards you wrote?

Stan- hope I don't come off as a complete rock- trying to follow along here....

Thanks for all your help!

Doug

stanl
09-16-2007, 11:02 AM
You write that I have some homework- the two text fields-

No, either you, or maybe one of the gurus here should enhance the sub with api calls so the details and descrip fields can be cut/pasted via the clipboard rather than assigned from the table. As it, the features are all a block of text in range("A4")... if the data were paste in it would separate into rows based on the vbcrlfs in the text block. I find working w/clipboard in VBA challenging, and prefer using C functions from Winbatch. I'll get the routine to create the table posted in a bit. Stan

YellowLabPro
09-16-2007, 11:04 AM
Thanks Stan!

Norie
09-16-2007, 01:25 PM
Stan

Is API really necessary?

Couldn't the HTML/DTHML object model be exploited?

That's what I did in post 3, using the links collection.

Admittedly it was a fairly simplistic example.:)

stanl
09-16-2007, 04:09 PM
Stan
Is API really necessary?
Couldn't the HTML/DTHML object model be exploited?


That had nothing to do with DHTML - it was concerned with formatting text extracted from memo fields into Excel cells. But there are certainly other methods to get from point A to point B. Here is the output from my WB code which used the clipboard for the details and descrip fields. Compare it to what is produce by my previous post. Stan

YellowLabPro
09-16-2007, 05:09 PM
Stan,
Both look great, really. The second one is nice w/out the dangling characters at the end of the text strings and in individual cells. It is also nice in that I can do an InStr search to extract data if need be, where the other one may or may not allow that to happen or possibley not as easy.
Should I be doing anything to help w/ this at this point?

stanl
09-17-2007, 03:59 AM
Should I be doing anything to help w/ this at this point?

Several things:

1. modify the sub to use clipboard functions for the two memo fields.

2. if you ran the code to build the access table with all snowboards, the code to build the excel files has to be modified to process a Do while Not oRS.eof loop, adding a new sheet and formatting it for each snowboard {even better, use the first sheet as a Table of Contents}

3. are you ultimately doing this for payment or for your own edification?