PDA

View Full Version : Download multi page table to Excel.



mdmackillop
01-31-2007, 03:51 AM
Not my area of expertise, so would someone be kind enough to write me some code to copy the table in 1 or more parts shown here and paste it into Excel. Each days report would be pasted below the previous one.
Thanks
Malcolm
http://ted.europa.eu/Exec?DataFlow=ShowPage.dfl&TableName=Ted_EN&Template=TED/N_result_list.htm&Page=1&toconf=yes

Edit:
I've just come across a time-out problem. Maybe a bit more thought needed!

mvidas
02-08-2007, 09:35 AM
Still need help here?

mdmackillop
02-08-2007, 04:40 PM
Hi Matt,
Long time no see!
No urgency on this. I'll repost once I resolve the query parameters
Thanks
Malcolm

stanl
02-10-2007, 03:59 AM
Still need help here?

Could you give a desired navigation? Which language to select for display; which section to scrape tables from, etc...

This way I can see if HTTP transfer and local parsing of tables is a viable solution (it would avoid any timeout issues with the site). Stan

mdmackillop
02-10-2007, 04:07 AM
Hi Stan,
I thought I had closed this! but as you kindly ask:

Site Address - http://ted.europa.eu/
Select EN (8th item)
Country - GB
Business Opportunities - Contracts
Download resultant table and subsequent pages

There's no rush on this, but thanks for any help.
Regards
Malcolm

stanl
02-10-2007, 05:47 AM
You can avoid the timeout with an HTTP Post (I used the freeware xStandard http.dll since it includes Tidy to ensure well-formed html is returned. I was able to make post but the response file (attached)was not what I expected. I could navigate directly with the link below, but it comes up in French, not English [quelle merde]. However, you should be able to open the htm file I attached and navigate to the correct page. At that point you can run web queries for the tables you need, or write script code (I have a DHTML analyzer I compiled which can assist in this)


http://ted.europa.eu/Exec?Template=TED/homepage.htm&DataFlow=hRead.dfl&hpt=ALL&StatLang=EN

stanl
02-10-2007, 09:58 AM
Well, if this was an HTTPS site I would expect issues, but the design is either idiotic or brilliant. I've done something similar [web-scraping the yellow pages for address and phone info for all schools, churches, public offices by county/city in North Carolina]. The main point was handling tables with multiple pages.

I used xPath, gathered everything as a database table, then generated Excel workbooks as needed. Don't know what you expected for an answer, but I can probably help you through contracts[under business opportunities] for GB..... Stan

mdmackillop
02-10-2007, 10:31 AM
Hi Stan
Web Query just seems to return some page layout data in French.
I tried the Help example while I had the page open, but still no luck.

Sub EUTest()
Set shFirstQtr = Workbooks(1).Worksheets(1)
Set qtQtrResults = shFirstQtr.QueryTables _
.Add(Connection:="URL;http://ted.europa.eu/Exec?DataFlow=ShowPage.dfl&TableName=Ted_EN&Template=TED/N_result_list.htm&Page=1&toconf=yes", _
Destination:=shFirstQtr.Cells(1, 1))
With qtQtrResults
.WebFormatting = xlNone
.WebSelectionType = xlSpecifiedTables
.WebTables = "1,2"
.Refresh
End With
End Sub

stanl
02-10-2007, 03:36 PM
The url you used gets me nothing. Maybe I missed the point, but is the attached a table of interest; because if so it is multi-paged and I think requires building the data not just performing a web query. Are you needing the data in Excel formatted the same as the HTML? Stan

mdmackillop
02-10-2007, 03:45 PM
Hi Stan
According to FAQ there is some sort of IP identification which may prevent use of my connection from another PC. I hoped that Excel could get the data from the tables while I had the page open.
It is the table which is of interest. I would like, if it's in excel to carry out a search to determnine items of interest. I'm not too worried about formatting etc. I can handle that if I can get the data.
I would stress that this is of a low priority, but it will save our secretary a bit of copying and pasting on a daily basis!

stanl
02-11-2007, 05:40 AM
How about:
1. Navigate to Browse by Location (NUTS)
2. re-navigate to United Kingdom [all entries]
3. Parse multiple pages into recordset
4. Either copy recordset directly into Excel and use as basis for Pivot Table, or store/archive in Access and perform Jet 4.0 SELECT...INTO queries to create Excel sheets as needed.
:dunno Stan [sans login/pw]

mdmackillop
02-11-2007, 05:50 AM
That sound workable. I'd prefer Excel for simpler manipulation and subsequent processing.

stanl
02-11-2007, 01:36 PM
Well (I explained more in private mail) I can get to here, viz. converting table data to csv as a memvar. So concatenating multiple pages, saving the memvar as a csv file and loading into excel is doable. Stan

P.S. meant to use | as delimiter, instaed of ,

mdmackillop
02-11-2007, 02:07 PM
Thanks Stan

stanl
02-12-2007, 07:36 AM
Thanks Stan

maybe...maybe not... I have attached a | separated file which parsed all contracts for UK [324 as of last count]. I see why they call it (NUTS):doh: Anyway, if this is something you can work with in Excel, I think my script will work and avoid the Timeouts. I originally used HTTP transfer thinking the URL constructed a sessionid, but it seems [I tested 8 times in a row] that the IE Object will suffice.

1 caveat: I used Winbatch, since the task required parsing text, I'm not that jazzed w/the filesystem object, all the Dim's and Set's involved - however, the IE part is 99% compatible with VBA, so you can use your own parsing functions. If the csv works, I can set out the steps in 99% compatible pseudo-code. Stan

stanl
02-13-2007, 07:11 AM
Good thing I ran it again (as the site moved from week 6 to week 7, and that threw a curve at final parsing). I've attached week 7's data as csv and Excel - notice what happens to some dates, which I assume is an easy fix for you.

If any of this looks doable for you, I'll go over the web-scraping 101 code, try not to bore you with things you already know... otherwise I'll fade away from this thread for fear of being annoying.

Stan

mdmackillop
02-13-2007, 09:55 AM
Hi Stan,
I managed a quick look at what you did and played around with excel formatting. I can readily extract from the csv for 90%. The other 10% are multiline descriptions where the second line in effect deletes the following item. (see item 14 of your first results file). I'll try and look at your last post tonight.
No fear of your being annoying; I'm just a bit short of time just now.
Regards
Malcolm

stanl
02-16-2007, 07:09 AM
I'm just a bit short of time just now.
Malcolm

Well, like you said...low priority... the code below is the best I can do for brute force processing of your data. Since few have responded to this thread, I'm thinking there is no easy web-query or 10 lines of code. Usually, with web-scraping I like to run my scripts independent so I did my best to modify my 17 lines of script code as a sub. I'm sure it can be optimized, but I'd like to think at least I did the DHTML investigations to obtain the correct references. As in my last post, I get a file that imports to Excel [as I attached ion that post] Stan



Sub GetTed()
'Reference MS Internet Controls
Dim IE As InternetExplorer
Set IE = New InternetExplorer
'first naviagte to home page
IE.Visible = True
IE.Navigate "http://ted.europa.eu/"

'Wait until page is loaded. not recommended, probably need to
'exit after 30 seconds, I'd write it as a separate function
Do Until IE.ReadyState = READYSTATE_COMPLETE: Loop
'set up a var go which assumes failure for all navigations
go=0
'make sure it's in English - or choose whatever language by
'specifying part of the innertext
For i= 0 To IE.Document.GetElementsByTagName("A").length-1
s= IE.Document.GetElementsByTagName("A").Item(i)
If s.GetAttribute("innerText")<>"" AND Instr(s.InnerText,"Official Journal") Then
s.click()
go=1
Exit For
End If
Next

If go=0 Then
goto end
End If
Do Until IE.ReadyState = READYSTATE_COMPLETE: Loop
'Now move to NUTS
go=0
For i= 0 To IE.Document.GetElementsByTagName("A").length-1
s= IE.Document.GetElementsByTagName("A").Item(i)
If s.GetAttribute("innerText")<>"" AND Instr(s.InnerText,"(NUTS)") Then
s.click()
go=1
Exit For
End If
Next
End If
If go=0 Then
goto end
End If
Do Until IE.ReadyState = READYSTATE_COMPLETE: Loop
go=0
'this is the tricky part, the next page has links to tables for all countries
'but does not include an innertext to test with, I noticed that GB suddenly becomes
'UK - which is the last in the list and [hoprefully] always -10 from the last
'anchor
n = oIE.Document.GetElementsByTagName("A").length-10
If n>0 Then
oIE.Document.GetElementsByTagName("A").Item(n).click
go=1
Else
goto end
End If
'bring First Table
Do Until IE.ReadyState = READYSTATE_COMPLETE: Loop
go=0
cTable = "Ref,Details,Country,Start,End" & vbcrlf
If IE.Document.Body.GetElementsByTagName("TABLE").Item(39)
cTable = cTable & gettable(IE.Document.Body.GetElementsByTagName("TABLE").Item(39).InnerText)
go=1
x=2
'loop through subsequent tables if they exist
While go=1
n=oIE.Document.GetElementsByTagName("A").length-1
For i= 0 To n
If oIE.Document.GetElementsByTagName("A").Item(i) Then
s= oIE.Document.GetElementsByTagName("A").Item(i)
If s.GetAttribute("innerText")<>"" AND s.InnerText=x Then
s.click()
Do Until IE.ReadyState = READYSTATE_COMPLETE: Loop
cTable = cTable & gettable(IE.Document.Body.GetElementsByTagName("TABLE").Item(39).InnerText)
x=x+1
Else
go=0
End if
End if
Loop
End If
'at this point you should have a csv memvar which you can outpout to a file or
'position back into a worksheet... your choice
:end
Set IE = Nothing
End Sub
Public Function gettable(c AS String)
cTxt = cTxt=Mid(c,Instr(c,"Deadline",8))
cTxt=Replace(cTxt,",",";")
cTxt=Replace(cTxt," ",",")
cTxt=Replace(cTxt,",,",",")
cTxt=Replace(cTxt,", ",",")
Return cTxt
End Function

mdmackillop
02-18-2007, 05:05 AM
Hi Stan
Thanks for the code.
I've had to make a few amandments to get it to run in Excel. I've some glitches to work out with the additional pages. There's also a problem with Permissions. I can step through the code, but get permission denied errors when trying to run it. Debugging and continuing the code works OK, so maybe I just need to build in a Wait.

Sub GetTed()
'Reference MS Internet Controls
Dim IE As InternetExplorer
Dim s As Object
Set IE = New InternetExplorer
'first naviagte to home page
IE.Visible = True
Application.StatusBar = "Opening Internet Explorer..."
IE.Navigate "http://ted.europa.eu/"

'Wait until page is loaded. not recommended, probably need to
'exit after 30 seconds, I'd write it as a separate function
Do Until IE.ReadyState = READYSTATE_COMPLETE: Loop
'set up a var go which assumes failure for all navigations
go = 0

Application.StatusBar = "Opening English version..."
'make sure it's in English - or choose whatever language by
'specifying part of the innertext
For i = 0 To IE.Document.GetElementsByTagName("A").Length - 1
Set s = IE.Document.GetElementsByTagName("A").Item(i)
txt = s.GetAttribute("innerText")
If txt <> "" Then
If InStr(1, txt, "Official Journal") > 0 Then
'MsgBox txt
s.Click
go = 1
Exit For
End If
End If
Next
Set s = Nothing

Application.StatusBar = "Going to Location (NUTS)..."
If go = 0 Then
GoTo Endd
End If
Do Until IE.ReadyState = READYSTATE_COMPLETE: Loop
'Now move to NUTS
go = 0
For i = 0 To IE.Document.GetElementsByTagName("A").Length - 1
Set s = IE.Document.GetElementsByTagName("A").Item(i)
txt = IE.Document.GetElementsByTagName("A").Item(i).GetAttribute("innerText")
If txt <> "" Then
If InStr(1, txt, "(NUTS)") > 0 Then
'MsgBox txt
s.Click
go = 1
Exit For
End If
End If
Next
Set s = Nothing
If go = 0 Then
GoTo Endd
End If
Do Until IE.ReadyState = READYSTATE_COMPLETE: Loop
go = 0
'this is the tricky part, the next page has links to tables for all countries
'but does not include an innertext to test with, I noticed that GB suddenly becomes
'UK - which is the last in the list and [hopefully] always -10 from the last
'anchor

Application.StatusBar = "Going to UK page..."
n = IE.Document.GetElementsByTagName("A").Length - 10
If n > 0 Then
IE.Document.GetElementsByTagName("A").Item(n).Click
go = 1
Else
GoTo Endd
End If
'bring First Table
Do Until IE.ReadyState = READYSTATE_COMPLETE: Loop
go = 0

Application.StatusBar = "Create table from first page..."
ctable = "Ref,Details,Country,Start,End" & vbCrLf
'If IE.Document.Body.GetElementsByTagName("TABLE").Item(39)
ctable = ctable & IE.Document.Body.GetElementsByTagName("TABLE").Item(39).innertext
Debug.Print ctable
ctable = ctable & IE.Document.Body.GetElementsByTagName("TABLE").Item(39).innertext
go = 1
x = 2
'loop through subsequent tables if they exist

Application.StatusBar = "Getting subsequent tables..."
While go = 1
n = IE.Document.GetElementsByTagName("A").Length - 1
For i = 0 To n
'If IE.Document.GetElementsByTagName("A").Item(i) Then
Set s = IE.Document.GetElementsByTagName("A").Item(i)
If IE.Document.GetElementsByTagName("A").Item(i).GetAttribute("innerText") _
<> "" And IE.Document.GetElementsByTagName("A").Item(i).GetAttribute("innerText") = x Then
MsgBox s 's.click()
Do Until IE.ReadyState = READYSTATE_COMPLETE: Loop
ctable = ctable & GetTable(IE.Document.Body.GetElementsByTagName("TABLE").Item(39).innertext)
x = x + 1
Else
go = 0
End If
'End If
Next
Wend
'at this point you should have a csv memvar which you can outpout to a file or
'position back into a worksheet... your choice
Endd:
Application.StatusBar = ""
Set IE = Nothing
End Sub
Public Function GetTable(c As String)
cTxt = Mid(c, InStr(c, "Deadline", 8))
cTxt = Replace(cTxt, ",", ";")
cTxt = Replace(cTxt, " ", ",")
cTxt = Replace(cTxt, ",,", ",")
cTxt = Replace(cTxt, ", ", ",")
GetTable cTxt
End Function


I've also tried accessing Contracts (which is the bit required) from alternative controls. Still a problem with Permissions, but I'm getting there with the subsequent pages. I'm looking to remove some of the looping once I identify the proper controls.


Sub GetTed2()
'Reference MS Internet Controls
Dim IE As InternetExplorer, s As Object
'first naviagte to home page
Set IE = New InternetExplorer
IE.Visible = False
Application.StatusBar = "Opening Interner Explorer..."
IE.Navigate "http://ted.europa.eu/Exec?DataFlow=hRead.dfl&Template=TED/homepage.htm&hpt=GB&StatLang=EN"

Application.StatusBar = "Opening 'Contract Notice'..."
Do Until IE.ReadyState = READYSTATE_COMPLETE: Loop
tmp = IE.Document.GetElementsByTagName("A").Length - 1
For i = 0 To tmp 'IE.Document.GetElementsByTagName("A").Length - 1
Set s = IE.Document.GetElementsByTagName("A").Item(i)
'Select Contract list
If InStr(1, s, "Generatequery('GB', '3')") > 0 Then
s.Click
Exit For
End If
Next

Do Until IE.ReadyState = READYSTATE_COMPLETE: Loop

Application.StatusBar = "Getting first table..."
'Get first table
ctable = IE.Document.Body.GetElementsByTagName("TABLE").Item(38).innertext
'Split table into rows
tmp = Split(ctable, Chr(13))
Rw = 0

Application.StatusBar = "Split table and paste into rows..."
For Each t In tmp
Rw = Rw + 1
Sheets(1).Cells(Rw, 1) = t
Next

Application.StatusBar = "Get next page..."
tmp = IE.Document.GetElementsByTagName("A").Length - 1
k = 2
For i = 0 To tmp
Set s = IE.Document.GetElementsByTagName("A").Item(i)
If InStr(1, s, "gotopageres('" & k & "')") > 0 Then
s.Click
k = k + 1

Do Until IE.ReadyState = READYSTATE_COMPLETE: Loop
Application.StatusBar = "Getting next table..."
ctable = IE.Document.Body.GetElementsByTagName("TABLE").Item(38).innertext
tmp = Split(ctable, Chr(13))
Rw = Cells(Rows.Count, 1).End(xlUp).Row
Application.StatusBar = "Splitting table into rows..."
For Each t In tmp
Rw = Rw + 1
Sheets(1).Cells(Rw, 1) = t
Next
End If
Next
Set IE = Nothing
Application.StatusBar = ""
Exit Sub
End Sub


Thanks very much for the assistance, this is all new to me!
Regards
Malcolm

stanl
02-18-2007, 09:21 AM
I'm sure once you get this ironed out, future web-scrapes will be a breeze. I appreciate your patience with the way I kludged my original script. I have a feeling a careful use of execscript() can short-circuit some of the page-2-page navigation and looping. I have a DHTML analyzer exe I'm in the process of upgrading. You select the objects you want interrogated for a particular page, and it breaks them down into an output text file. Stan

mdmackillop
02-18-2007, 11:42 AM
I'll play around with it a bit more and check out exescript etc. Maybe one day I'll be able to answer a question on "scraping"!

stanl
02-18-2007, 01:55 PM
I'll play around with it a bit more and check out exescript etc. Maybe one day I'll be able to answer a question on "scraping"!

Really sorry it is giving you a time of it. I just ran my original script again and got 307 hits. I'll compile it to an EXE, and send you a link via Private Mail. If it works on your system, then it should work in VBA, and we both learn something. Stan

mdmackillop
02-18-2007, 02:58 PM
Thanks Stan. All part of the learning process.