PDA

View Full Version : Downloading many html web pages, converting to text, then parsing out



MachaMacha
06-30-2008, 06:24 AM
Hello,

I have about 100 webpages that I would like to download as text, and then parse out the 21st table from. I have the code set up to access all the websites as web queries but do not know how to download each of them as a text file.

Thanks.

Oorang
07-09-2008, 07:57 AM
Do you want the text of the webpages or the HTML?

MachaMacha
07-09-2008, 08:21 AM
The text. I only need certain text and data from the tables.

But it's ok if I get the table or the text. I really just need the 21st table from the 100 pages in ANY form. thx.

Oorang
07-09-2008, 10:50 AM
I'd be interested to see your web query code, but here is a very Q&D way:

Public Sub GetWebText()
'MAKE SURE YOU SET A REFERENCE TO:
'shdocvw.dll
'mshtml.tlb
Dim objIE As SHDocVw.InternetExplorer
Dim ieDoc As MSHTML.HTMLDocument
Dim ws As Excel.Worksheet
Dim strURL As String
Dim lngRow As Long
Dim strText As String

Set ws = ActiveSheet
ws.Cells.WrapText = False
'Create Internet Explorer Object
Set objIE = New SHDocVw.InternetExplorer

Do
lngRow = lngRow + 1
strURL = ws.Cells(lngRow, 1).Value
If Not CBool(LenB(strURL)) Then
Exit Do
End If
'Navigate the URL
objIE.Navigate strURL
'Wait for page to load
Do Until objIE.ReadyState = READYSTATE_COMPLETE: Loop
'Get document object
Set ieDoc = Nothing
Do While ieDoc Is Nothing
Set ieDoc = objIE.Document
Loop
strText = vbNullString
On Error Resume Next
Do
Err.Clear
strText = ieDoc.body.innerText
Loop While Err.Number
On Error GoTo 0
ws.Cells(lngRow, 2).Value = strText
Loop
objIE.Quit
End Sub

MachaMacha
07-09-2008, 11:42 AM
Thanks I currently can't set reference to shdocvw.dll but will try the code from above when I can and get back to you on it. I have another question,

I am in a website where I input a deal number, and the address in the resulting page's address looks like this

http://zizizizizi.com/zizizizizi/cust/qcksearch/qcksearch_search_result.asp?searchident=qcksearch&startkey=0&search=2&searchquery=03l61fak5&redir_url=/zizizizizi/cust/qcksearch/qcksearch%5Fsearch%5Fresult.asp&bhcp=1"]http://zizizizizi.com/zizizizizi/cust/qcksearch/qcksearch_search_result.asp?searchident=qcksearch&startkey=0&search=2&searchquery=03l61fak5&redir_url=/zizizizizi/cust/qcksearch/qcksearch%5Fsearch%5Fresult.asp&bhcp=1
where the 03761fab5 after searchquery= is the deal number. Once I get to the page after the query, I click on a different link and the resulting address in the address bar is

http://zizizizizi.com/zizizizizi/cust/qcksearch/qckSearch_search_result.asp?n_id=400038356&searchQuery=03761fab5&search=2"]http://zizizizizi.com/zizizizizi/cust/qcksearch/qckSearch_search_result.asp?n_id=400038356&searchQuery=03761fab5&search=2&.....

The 400038356 is the site's deal id which I need in order to quickly query many tables. Is there a way to extract the address in the address bar as a query so I can parse out the 9 digits in ...search_result.asp?n_id=400038356&searchQuery...?

Oorang
07-09-2008, 12:30 PM
Why can't you use shdocvw? Are you running non-windows?

Please post your web query code, and I'll see if I can hit on an alternative.