Consulting

Results 1 to 6 of 6

Thread: Downloading many html web pages, converting to text, then parsing out

  1. #1

    Downloading many html web pages, converting to text, then parsing out

    Hello,

    I have about 100 webpages that I would like to download as text, and then parse out the 21st table from. I have the code set up to access all the websites as web queries but do not know how to download each of them as a text file.

    Thanks.

  2. #2
    Knowledge Base Approver VBAX Master Oorang's Avatar
    Joined
    Jan 2007
    Posts
    1,135
    Location
    Do you want the text of the webpages or the HTML?
    Cordially,
    Aaron



    Keep Our Board Clean!
    • Please Mark your thread "Solved" if you get an acceptable response (under thread tools).
    • Enclose your code in VBA tags then it will be formatted as per the VBIDE to improve readability.

  3. #3
    The text. I only need certain text and data from the tables.

    But it's ok if I get the table or the text. I really just need the 21st table from the 100 pages in ANY form. thx.
    Last edited by Oorang; 07-09-2008 at 10:08 AM. Reason: Merged concurrent posts by same user.

  4. #4
    Knowledge Base Approver VBAX Master Oorang's Avatar
    Joined
    Jan 2007
    Posts
    1,135
    Location
    I'd be interested to see your web query code, but here is a very Q&D way:

    [VBA]Public Sub GetWebText()
    'MAKE SURE YOU SET A REFERENCE TO:
    'shdocvw.dll
    'mshtml.tlb
    Dim objIE As SHDocVw.InternetExplorer
    Dim ieDoc As MSHTML.HTMLDocument
    Dim ws As Excel.Worksheet
    Dim strURL As String
    Dim lngRow As Long
    Dim strText As String

    Set ws = ActiveSheet
    ws.Cells.WrapText = False
    'Create Internet Explorer Object
    Set objIE = New SHDocVw.InternetExplorer

    Do
    lngRow = lngRow + 1
    strURL = ws.Cells(lngRow, 1).Value
    If Not CBool(LenB(strURL)) Then
    Exit Do
    End If
    'Navigate the URL
    objIE.Navigate strURL
    'Wait for page to load
    Do Until objIE.ReadyState = READYSTATE_COMPLETE: Loop
    'Get document object
    Set ieDoc = Nothing
    Do While ieDoc Is Nothing
    Set ieDoc = objIE.Document
    Loop
    strText = vbNullString
    On Error Resume Next
    Do
    Err.Clear
    strText = ieDoc.body.innerText
    Loop While Err.Number
    On Error GoTo 0
    ws.Cells(lngRow, 2).Value = strText
    Loop
    objIE.Quit
    End Sub
    [/VBA]
    Cordially,
    Aaron



    Keep Our Board Clean!
    • Please Mark your thread "Solved" if you get an acceptable response (under thread tools).
    • Enclose your code in VBA tags then it will be formatted as per the VBIDE to improve readability.

  5. #5
    Thanks I currently can't set reference to shdocvw.dll but will try the code from above when I can and get back to you on it. I have another question,

    I am in a website where I input a deal number, and the address in the resulting page's address looks like this
    HTML Code:
    http://zizizizizi.com/zizizizizi/cust/qcksearch/qcksearch_search_result.asp?searchident=qcksearch&startkey=0&search=2&searchquery=03l61fak5&redir_url=/zizizizizi/cust/qcksearch/qcksearch%5Fsearch%5Fresult.asp&bhcp=1"]http://zizizizizi.com/zizizizizi/cust/qcksearch/qcksearch_search_result.asp?searchident=qcksearch&startkey=0&search=2&searchquery=[B]03l61fak5[/B]&redir_url=/zizizizizi/cust/qcksearch/qcksearch%5Fsearch%5Fresult.asp&bhcp=1
    where the 03761fab5 after searchquery= is the deal number. Once I get to the page after the query, I click on a different link and the resulting address in the address bar is
    HTML Code:
    http://zizizizizi.com/zizizizizi/cust/qcksearch/qckSearch_search_result.asp?n_id=400038356&searchQuery=03761fab5&search=2"]http://zizizizizi.com/zizizizizi/cust/qcksearch/qckSearch_search_result.asp?n_id=[B]400038356[/B]&searchQuery=03761fab5&search=2&
    .....

    The 400038356 is the site's deal id which I need in order to quickly query many tables. Is there a way to extract the address in the address bar as a query so I can parse out the 9 digits in ...search_result.asp?n_id=400038356&searchQuery...?

  6. #6
    Knowledge Base Approver VBAX Master Oorang's Avatar
    Joined
    Jan 2007
    Posts
    1,135
    Location
    Why can't you use shdocvw? Are you running non-windows?

    Please post your web query code, and I'll see if I can hit on an alternative.
    Cordially,
    Aaron



    Keep Our Board Clean!
    • Please Mark your thread "Solved" if you get an acceptable response (under thread tools).
    • Enclose your code in VBA tags then it will be formatted as per the VBIDE to improve readability.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •