Consulting

Page 1 of 2 1 2 LastLast
Results 1 to 20 of 21

Thread: Collect Data via the Web: redux

  1. #1
    VBAX Master stanl's Avatar
    Joined
    Jan 2005
    Posts
    1,141
    Location

    Collect Data via the Web: redux

    The original thread was getting a bit long To recap my .02, the web site itself is organized into a catalog with url links set by Product, i.e. "Gear" then page, class/group, then subclass. These links then list the items and if all items do not fit on a single page offer either a View All or Next anchor. The original post was specific to snowboards and the code below will obtain (1) The item name (2) a url link to the item itself (3) features for the item as extracted from the url link. I chose to create an in-memory recordset and persist it to XML as Boards.xml. There is a commented section in the code having to do with navigating to View All - I think I have a handle on that. The code is specific to the IE Browser, and should probably be done as HTTP requests to remove that restriction. In terms of the original OPS overall goals a lot more effort has to be put into the design. Stan

    P.S. the OP wanted the data in Excel, but the XML recordset can easily be ported there, just wanted to see if this was going in the right direction. I attached the recordset I created in testing, you can unzip it and view in Notepad.

    [vba]
    Sub createlookup()
    'variables use to create link for specific catalog items
    'these are hard-coded, but could be derived from a lookup
    cItem = "Boards"
    cPage = "295"
    cCat = "7"
    cSub = "49"
    cRef = "/c" & cCat & "/s" & cSub & "/"
    'assign XML recordset based on Item
    cXML = ActiveWorkbook.Path & "\" & cItem & ".xml"
    If Dir(cXML) <> "" Then Kill (cXML)
    'create in-memory recordset
    Set oRS = CreateObject("Adodb.Recordset")
    oRS.Fields.Append "Item", 200, 100
    oRS.Fields.Append "Url", 200, 200
    oRS.Fields.Append "Info", 202, 536870910
    oRS.Open
    cURL = "http://www.backcountry.com/store/group/" & cPage & "/c" & cCat & "/s" & cSub & "/"
    Set oIE = CreateObject("InternetExplorer.Application")
    oIE.Visible = True
    oIE.navigate cURL
    cLinks = ""
    'Navigate to first page
    While oIE.Busy Or oIE.readyState <> 4
    DoEvents
    Wend
    While oIE.Document.readyState <> "complete"
    DoEvents
    Wend
    '///////////////////////////////////////////////////////////
    'cannot get this next section to work
    'it will click and reload the page
    'but the recordset will be blank
    'uncomment and see for yourself
    'For Each S In oIE.Document.Body.GetElementsByTagName("A")
    ' If UCase(S.innerText) = "VIEW ALL" Then
    ' S.Click
    ' Exit For
    ' End If
    'Next
    'While oIE.Busy Or oIE.readyState <> 4
    ' DoEvents
    'Wend
    'While oIE.Document.readyState <> "complete"
    ' DoEvents
    'Wend
    '/////////////////////////////////////////////////////////////////

    'add specific item urls to recordset
    For Each S In oIE.Document.Body.GetElementsByTagName("A")
    If InStr(S.href, cRef) And S.GetAttribute("alt") <> "" Then
    If InStr(S.href, "#rev") = 0 Then
    If InStr(S.href, "id=") Then
    If InStr(cLinks, S.href) = 0 Then
    cLinks = cLinks & "/" & S.href
    oRS.Addnew
    oRS.Collect("Item") = S.GetAttribute("alt")
    oRS.Collect("Url") = S.href
    oRS.Update
    End If
    End If
    End If
    End If
    Next
    If oRS.RecordCount > 0 Then
    oRS.MoveFirst
    While Not oRS.EOF
    oIE.navigate (oRS.Collect("Url"))
    While oIE.Busy Or oIE.readyState <> 4
    DoEvents
    Wend
    While oIE.Document.readyState <> "complete"
    DoEvents
    Wend
    oRS.Collect("Info") = oIE.Document.Body.GetElementsByTagName("TABLE").Item(2).innerText
    oRS.Update
    oRS.MoveNext
    Wend
    End If
    oIE.Quit
    Set oIE = Nothing
    'persist recordset as XML and close it
    oRS.Save cXML, 1
    oRS.Close
    Set oRS = Nothing
    End Sub
    Sub webscrape()
    cItem = "Boards"
    cXML = ActiveWorkbook.Path & "\" & cItem & ".xml"
    If Dir(cXML) = "" Then Exit Sub
    Set oRS = CreateObject("Adodb.Recordset")
    oRS.Open cXML, "Provider=MSPersist;", 1, 4, 256
    MsgBox oRS.RecordCount
    oRS.Close
    Set oRS = Nothing
    End Sub
    [/vba]

  2. #2
    VBAX Master stanl's Avatar
    Joined
    Jan 2005
    Posts
    1,141
    Location
    a couple of final notes, you can replace

    [vba]For Each S In oIE.Document.Body.GetElementsByTagName("A")
    If InStr(S.href, cRef) And S.GetAttribute("alt") <> "" Then
    If InStr(S.href, "#rev") = 0 Then
    If InStr(S.href, "id=") Then
    If InStr(cLinks, S.href) = 0 Then
    cLinks = cLinks & "/" & S.href
    oRS.Addnew
    oRS.Collect("Item") = S.GetAttribute("alt")
    oRS.Collect("Url") = S.href
    oRS.Update
    End If
    End If
    End If
    End If
    Next
    [/vba]

    with

    [vba]
    For Each S In oIE.Document.Body.GetElementsByTagName("A")
    If S.GetAttribute("alt") <> "" Then
    cAlt = Replace(S.GetAttribute("alt"), " ", "-")
    If InStr(S.href, cAlt) Then
    If InStr(cLinks, S.href) = 0 Then
    cLinks = cLinks & "/" & S.href
    oRS.Addnew
    oRS.Collect("Item") = Replace(cAlt, "-", " ")
    oRS.Collect("Url") = S.href
    oRS.Update
    End If
    End If
    End If
    Next
    [/vba]

    in the createlookup code - ir is better suited for handling processing more items than fit on a single page.

    Also, began thinking about drilling down to individual items via hrefs on specific pages. It appears a good starting place is

    http://www.backcountry.com/store/

    and what you want is a collection of all the \subcat\ links... which in turn reference other subcategories all the way down to the individual items. Seems like a good place for regex. The following code uses regex to obtain relevant subcat links. Not great, but it's a start {again just creating an xml recordset}. Stan

    [vba]
    Sub regexparse()
    Set regex = CreateObject("vbscript.regexp")
    regex.Pattern = "(http://www.backcountry.com/store/subcat/)(\d|\d\d)(/)"
    regex.IgnoreCase = False
    cURL = "http://www.backcountry.com/store/"
    Set oIE = CreateObject("InternetExplorer.Application")
    oIE.Visible = True
    oIE.navigate cURL
    cXML = ActiveWorkbook.Path & "\storelinks.xml"
    If Dir(cXML) <> "" Then Kill (cXML)
    Set oRS = CreateObject("Adodb.Recordset")
    oRS.Fields.Append "level", 200, 30
    oRS.Fields.Append "Url", 200, 200
    oRS.Fields.Append "Item", 200, 100
    oRS.Open
    cLevel = "store"
    While oIE.Busy Or oIE.readyState <> 4
    DoEvents
    Wend
    While oIE.Document.readyState <> "complete"
    DoEvents
    Wend
    For Each S In oIE.Document.Body.GetElementsByTagName("A")
    If regex.test(S.href) Then
    oRS.Addnew
    oRS.Collect("level") = cLevel
    oRS.Collect("Url") = S.href
    If S.GetAttribute("innertext") <> "" Then oRS.Collect("Item") = S.innerText
    oRS.Update
    End If
    Next
    oRS.Save cXML, 1
    oRS.Close
    Set oRS = Nothing
    oIE.Quit
    Set oIE = Nothing
    Set regex = Nothing
    End Sub

    [/vba]

  3. #3
    VBAX Master Norie's Avatar
    Joined
    Jan 2005
    Location
    Stirling, Scotland
    Posts
    1,831
    Location
    Stan

    Do you really need to open every link for this?

    Here's an example of, not particularly sophisticated, code for getting all the links from a page into a text file.
    [vba]
    Sub Test()
    Dim lnk
    Set ie = CreateObject("InternetExplorer.Application")
    With ie
    .Visible = True
    .Navigate "http://www.nu.nl/"
    Do Until .ReadyState = 4: DoEvents: Loop

    Set doc = ie.Document

    Open "C:\TestHTML.txt" For Output As #1
    For Each lnk In doc.Links
    Print #1, lnk
    Next lnk
    Close #1
    End With

    End Sub[/vba]

  4. #4
    VBAX Master stanl's Avatar
    Joined
    Jan 2005
    Posts
    1,141
    Location
    Quote Originally Posted by Norie
    Stan
    Do you really need to open every link for this?
    I don't think that was the point. I only re-opened each valid item url to get the descriptive data. And to obtain 'valid' urls from the initial page I needed to massage the hrefs. You might want to look at the xml to get a better picture.

  5. #5
    VBAX Master Norie's Avatar
    Joined
    Jan 2005
    Location
    Stirling, Scotland
    Posts
    1,831
    Location
    Stan

    I've tried looking at it but it just doesn't seem to be working for me.

    Tried opening in Notepad, Word, Excel 2007 etc

    In Excel 2007 all I seem to get is hyperlinks and the names of the boards.

    Mind you it's probably me doing something wrong.

    As to what the OP wanted to actually do, it's still not clear to me.

    But then again that's probably just me.

  6. #6
    VBAX Master stanl's Avatar
    Joined
    Jan 2005
    Posts
    1,141
    Location
    Quote Originally Posted by Norie
    Stan
    I've tried looking at it but it just doesn't seem to be working for me.
    Norie;

    It's just an ADO recordset, so it might look a little strange in Notepad due to the z:row configuration. Try placing the attached workbook in the same directory as the xml file, then run the show macro. My aim from the getgo was just to illustrate one methodology for web-scraping with a site such as Doug referenced. As you can see, I use a lot of late-binding and explicit constants... as normally I write my code outside the VBE, and run as separate compiled exes. Eventually Doug will need a relational structure of tables where

    store
    -----> subcategory
    ---->sub-subcategory
    -----> Item

    these will be a series of keys and urls, pointing to different sections, each sending different parameters to various code modules.

    Stan

  7. #7
    VBAX Master Norie's Avatar
    Joined
    Jan 2005
    Location
    Stirling, Scotland
    Posts
    1,831
    Location
    Stan

    Seem to have got it now.

    I actually thought all the Length etc stuff was something to do with the XML.

    I can also see what you are getting at - the data needs to be parsed somehow.

    Personally in the past I've just relied on using the HTML/DTHML object to say extract a particular table and place the data on the worksheet.

    Perhaps not the best approach but it seems to work.

    Mind you I've seen posts/threads elsewhere where it was recommended that the complete source code was read in to wherever and then you parsed that.

    Doesn't seem like a good idea to me.

  8. #8
    VBAX Master stanl's Avatar
    Joined
    Jan 2005
    Posts
    1,141
    Location
    Quote Originally Posted by Norie
    Mind you I've seen posts/threads elsewhere where it was recommended that the complete source code was read in to wherever and then you parsed that.

    Doesn't seem like a good idea to me.
    You'll have to tell that to the people who came up with RSS feeds

  9. #9
    VBAX Master Norie's Avatar
    Joined
    Jan 2005
    Location
    Stirling, Scotland
    Posts
    1,831
    Location
    Stan

    If I knew what an RSS feed was I might.

    Only joking, but all this new stuff is starting to leave me behind.

    I mean I still get the arguments for InStr the wrong way round every time I use it.

  10. #10
    VBAX Master stanl's Avatar
    Joined
    Jan 2005
    Posts
    1,141
    Location
    Quote Originally Posted by Norie
    Stan

    If I knew what an RSS feed was I might.

    Only joking, but all this new stuff is starting to leave me behind.
    check out the SQL forum; I posted a Logparser script that does the Yahoo news RSS feed; and in another thread there is HTTP code in a workbook that performs a google.translate.

  11. #11
    Good day Stan,
    I downloaded the two files, boards.zip and getboardsxml.xls, copied and placed the code in getboardsxml.xls book, ran it.
    It loaded and cycled through several pages on the target address. But that is the only thing I see that occurred. Was I supposed to get any download info?
    I did not uncomment, nor replace the section you remarked about- I wanted to follow along w/ your logic.

    Thanks for delving and running w/ this project

    Best Regards,

    Doug
    my site: www.ecboardco.com
    was built w/ a majority of the assistance from the board members here... thanks VBAX.

    Just because I see something, doesn't mean that what's actually happening is what I see.

    You don't get from 0-90 by standing still!

  12. #12
    VBAX Master stanl's Avatar
    Joined
    Jan 2005
    Posts
    1,141
    Location
    I obviously got ahead of myself. You should run the createlookup() sub from a workbook, and it will create the xml file. A lot of the rest of the stuff was for Norie, or me just trying to answer the larger issue of parsing the entire site, i.e. code as is gets snowboards... what if we wanted Parkas, or women's underwear????

    Createlookup() was intended to permit traversing the website, identifying items/by their urls and capturing that information in a recordset. How you want things ultimately displayed or parsed is up to you, and you need to provide a template.

    Do you want to construct a workbook with sheets for each items displaying the items features [maybe a picture] - 1 per sheet... or an Excel database, or an Access table with features encoded as HTML and picture as binary/ole field??????????

  13. #13
    Stan,
    Just got back, will dive into this, this afternoon, parkas are good, but I don't wear women's underwear regardless of what anyone here says....

    What I ultimately would like to do is provide a list of desired items and find size, spec information and descriptions, retrieving pictures might actually be very helpful too.
    my site: www.ecboardco.com
    was built w/ a majority of the assistance from the board members here... thanks VBAX.

    Just because I see something, doesn't mean that what's actually happening is what I see.

    You don't get from 0-90 by standing still!

  14. #14
    VBAX Master stanl's Avatar
    Joined
    Jan 2005
    Posts
    1,141
    Location
    Quote Originally Posted by YellowLabPro
    What I ultimately would like to do is provide a list of desired items and find size, spec information and descriptions, retrieving pictures might actually be very helpful too.
    Consider this the final cut. XML recordsets don't generate much interest in this forum, so I wrote a sub to parse data into an Access Table, and included pictures. I have attached the zipped .mdb with 1 row (the first snowboard) as the entire file is ~1.8 meg. The following sub will open the table and create an excel worksheet based on the info in the data. You will note I made a comment about using the clipboard for two fields since it would parse the data into separate rows instead of a text block... but consider that homework.

    If this is in the ballpark of the type of output you want, I will post the sub to create the table for all snowboards. Again, this is just my method for web-scraping - collect all links.images and detail in a database and generate worksheets locally as needed. Make sure you unzip the .mdb and place it in the same subdir as the excel file you put the code into.

    Stan


    [vba]
    Sub createitem()
    cPath = ActiveWorkbook.Path
    cMDB = cPath & "\wtest.mdb"
    If Dir(cMDB) = "" Then Exit Sub
    cProv = "MicroSoft.Jet.OLEDB.4.0"
    cConn = "Provider=" & cProv & ";Data Source=" & cMDB & ";"
    Set oRS = CreateObject("Adodb.Recordset")
    oRS.Open "Items", cConn, 1, 3, 2
    oRS.MoveFirst
    nOld = Application.SheetsInNewWorkbook
    Application.SheetsInNewWorkbook = 1
    Application.Workbooks.Add
    Application.SheetsInNewWorkbook = nOld
    Set oWS = Application.ActiveWorkbook.Worksheets(1)
    cName = oRS.Collect("Item")
    If Len(cName) > 30 Then cName = Mid(cName, 1, 30)
    oWS.Name = cName
    oWS.Activate
    Application.ActiveWindow.DisplayGridlines = False
    oWS.Range("A1").Value = oRS.Collect("Item")
    oWS.Range("A2").Value = oRS.Collect("Price")
    'with these next 2 it would be better to copy
    'the field data to the clipboard then paste
    'to the range.
    oWS.Range("A4").Value = oRS.Collect("Details")
    oWS.Range("A16").Value = oRS.Collect("Descrip")
    oWS.Columns("A:A").Select
    Application.Selection.ColumnWidth = 54.14
    Application.Selection.WrapText = True
    Set oS = CreateObject("ADODB.Stream")
    oS.Type = 1
    oS.Open
    oS.Write (oRS.Collect("img"))
    oS.Position = 0
    cPict = cPath & "\" & "test.jpg"
    oS.SaveToFile cPict, 2
    oS.Close
    oS = 0
    oWS.Range("C1").Select
    oWS.Pictures.Insert(cPict).Select
    Application.ScreenUpdating = 1
    Set oWS = Nothing
    oRS.Close
    Set oRS = Nothing
    End Sub
    [/vba]

  15. #15
    Thanks Stan,
    I just downloaded and launched the db file in Access.

    Coincidentially I am reading about Access now, started on it earlier this week for a different project. What I am saying is, "this is new to me, very new".

    I installed the latest Sub createitem() in the "gear.xls" workbook and put the .mdb file in the WebScraping directory on the desktop.
    Running the sub this pulled the record that was in the .mdb file into a new created worksheet which loaded up the specs, description and the picture in a new workbook.
    This looks Fantastic so far!

    You write that I have some homework- the two text fields- I am thinking maybe you mean what will I want to do w/ it from here, but we can handle that a little later, I am thinking...

    I see what you mean now regarding placing each item in its own sheet or collected together now.

    I would opt for all similar related products to be cataloged on the same sheet, all snowboards on one or many sheets depending how many we pull down.

    Ok- now that I have done that... what's next? Do you need to send me the sub for all snowboards you wrote?

    Stan- hope I don't come off as a complete rock- trying to follow along here....

    Thanks for all your help!

    Doug
    my site: www.ecboardco.com
    was built w/ a majority of the assistance from the board members here... thanks VBAX.

    Just because I see something, doesn't mean that what's actually happening is what I see.

    You don't get from 0-90 by standing still!

  16. #16
    VBAX Master stanl's Avatar
    Joined
    Jan 2005
    Posts
    1,141
    Location
    Quote Originally Posted by YellowLabPro
    You write that I have some homework- the two text fields-
    No, either you, or maybe one of the gurus here should enhance the sub with api calls so the details and descrip fields can be cut/pasted via the clipboard rather than assigned from the table. As it, the features are all a block of text in range("A4")... if the data were paste in it would separate into rows based on the vbcrlfs in the text block. I find working w/clipboard in VBA challenging, and prefer using C functions from Winbatch. I'll get the routine to create the table posted in a bit. Stan

  17. #17
    Thanks Stan!
    my site: www.ecboardco.com
    was built w/ a majority of the assistance from the board members here... thanks VBAX.

    Just because I see something, doesn't mean that what's actually happening is what I see.

    You don't get from 0-90 by standing still!

  18. #18
    VBAX Master Norie's Avatar
    Joined
    Jan 2005
    Location
    Stirling, Scotland
    Posts
    1,831
    Location
    Stan

    Is API really necessary?

    Couldn't the HTML/DTHML object model be exploited?

    That's what I did in post 3, using the links collection.

    Admittedly it was a fairly simplistic example.

  19. #19
    VBAX Master stanl's Avatar
    Joined
    Jan 2005
    Posts
    1,141
    Location
    Quote Originally Posted by Norie
    Stan
    Is API really necessary?
    Couldn't the HTML/DTHML object model be exploited?
    That had nothing to do with DHTML - it was concerned with formatting text extracted from memo fields into Excel cells. But there are certainly other methods to get from point A to point B. Here is the output from my WB code which used the clipboard for the details and descrip fields. Compare it to what is produce by my previous post. Stan

  20. #20
    Stan,
    Both look great, really. The second one is nice w/out the dangling characters at the end of the text strings and in individual cells. It is also nice in that I can do an InStr search to extract data if need be, where the other one may or may not allow that to happen or possibley not as easy.
    Should I be doing anything to help w/ this at this point?
    my site: www.ecboardco.com
    was built w/ a majority of the assistance from the board members here... thanks VBAX.

    Just because I see something, doesn't mean that what's actually happening is what I see.

    You don't get from 0-90 by standing still!

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •