PDA

View Full Version : Collect Data via The Web



YellowLabPro
09-12-2007, 05:12 PM
I have a question/request for anyone who has experience w/ extracting data from the web.

I would like to pull data regarding snowboards, description of the board, size information, length, width, sidecut, etc....
This would expand to other items I sell.

My VBA skills are still newbie like so this would have to be most likely someone could provide the major structure of the procedure.

For instance if we use the Burton Custom X 158

Burton is the Brand
Custom X is the model
158 is the size

I would like to pull the data down in such a manner that it could be placed in a table.

If any other information is needed let me know...
This is something I have been thinking about for awhile but have not gotten to the point of knowing how to begin such a query.

thanks,

Doug

anandbohra
09-12-2007, 11:36 PM
may be my this thread will help u out in getting even a single single data from web through custom functions


http://www.vbaexpress.com/forum/showpost.php?p=113797&postcount=4

Charlize
09-13-2007, 12:27 AM
- Provide us with a link to the internetpage and I'll try to give you a solution.
1. store the form in an object
2. loop through the object to find values that needs to be filled in
3. fill them in
4. submit the form
5. if the result is a table, read table and move data to excel (Norie has some great samples of this one)
For the other 4 items there are some other guys/girls here that are amazingly good at this (I'm still trying to understand everything).

Give it a day or two and you never know what comes out.

stanl
09-13-2007, 02:50 AM
I have a question/request for anyone who has experience w/ extracting data from the web.


Two choices (1) go through a browser, use DHTML Tags and either file i/o or DOM Parser (2) use HTTP/HTTPS requests [I prefer the xStandard freeware since it incorporates Tidy and renders fully formed XML] - then use xPath.

#2 is preferred if you are making multiple attempts since it works behind the scenes.

As for putting data into a table; I prefer building an in-memory Recordset and adding to it while visiting a site. It can then be persisted as XML, placed into Excel or a DB.

.02 Stan

YellowLabPro
09-13-2007, 03:10 AM
Thanks every one for your suggestions.
Charliez- I don't have a particular site. The reason is that this is not a formal thing. Not like the stock market where all the information is stored uniformly, no sort of clearing house of information, especially true when I start looking for information regarding smaller items, for instance- shoes. I was hoping that a search could be programmed in to tell it to look through any site w/ the listed item, the top 5 or 10.
If that is not possible, then for this particular case- The first link is to the site, the second link is to the drill-down of the specific product. As you can imagine, having to go through the process of drilling down on each item will make the project unfeasible. I am only stating the obvious to avoid any misunderstandings of the project's demands.

http://www.burton.com/Gear/Default.aspx#/mens/boards/matrix/boardMatrix.swf
http://www.burton.com/Gear/Default.aspx#/mens/boards/productDetail/70010/B80171091

The Third link is a drill-down approach too.
http://www.backcountry.com/store/BUR1485/c7/s49/Burton-Custom-X-Snowboard.html

Stanl-
With what I have provided, does this give you enough information to setup a test case for me to see what and how you would intend to retrieve the desired data?

stanl
09-13-2007, 04:19 AM
Stanl-
With what I have provided, does this give you enough information to setup a test case for me to see what and how you would intend to retrieve the desired data?

Actually it is well organized:

home page: http://www.backcountry.com

then the store and group - /store/group/

then page, class and subclass

295/c7/s49/

so http://www.backcountry.com/store/group/295/c7/s49/

should navigate to your desired page, [assume using IE]. Because the pages are jscript, you would use IE's fireevent() method on the "View All" anchor, then collect the resultant hrefs.

This would be, more or less, prep work. I would perform this pre-parsing to build a database of links. Then as needed, use HTTP to collect relevant facts, images... which could be placed into Excel, or an Access Table.

Which gets the specific items you are interested in... the challenge is to parse the entire catalog by page, class and subclass:think:

...not sure this is the best place to start with Web-scraping, if as you say, you are that new to VBA, but we can maybe knock something out unless someone else comes up with a better mousetrap, Stan

stanl
09-13-2007, 06:03 AM
as a P.S.

this link is something I compiled http://www.mindspring.com/~stanl/Download/TCOsetup.exe

it is basic web-scraping for the Edmunds.com site to obtain True Cost of Ownership for cars. Don't fear the exe, just inno setup and it includes an uninstall. Your can choose a year model, make, features for a car - it will use IE to navigate to the edminds site, enter the data and pull back results into a newly created workbook which you have the option to save or ignore. Not written in VBA, but if you saw the source, you'd probably say "just like VBA... just happens to use late-binding and ( ) around methods....

To be honest, pre-parsing a site like this or the one you suggested is good exercise... as both involve background DB's... so you get a feel for design if ever asked to create a shopping-cart application.

P.P.S - for you Firefox folks... there is a Mozilla.ocx {project} that simulates the IE DHTML

YellowLabPro
09-13-2007, 06:50 AM
Stan,
That looks great. Though I have no idea where to start on this. Is there a list of items you can help me assemble and I will do some leg work?
The one problem I encountered w/ your program was when trying to save the item found- it was a Lexus 470, 2006. I don't know if this is the sites problem, the item or what?
Attached is an image.

stanl
09-13-2007, 07:43 AM
aha! I had thought the app could only be installed from C:\ .. problem being spaces in directory names... but don't even bother with it, just uninstall, or move to C:\TCOSD :banghead:

I was just using that as an illustration of web-scraping... how do you want to attack your issue? Where do you want your VBA code run from - Excel, and then populate one or more worksheets...

I'll put together a basic href-grab for the 'boards' as referenced in the URL I posted and post the xls w/code later this afternoon [EDT]....

P.S. thank your for the Beta Test:friends:

YellowLabPro
09-13-2007, 07:52 AM
Stan,
Thanks, ohhhhhhhh have to laugh out loud about the beta testing really....
Ok I will do that. I have to run out for a while, taking the pup to the vet. Then will re-run setup. It really is an impressive piece you wrote.

To answer your questions-
1) Yes, run from Excel.
2) I have a workbook/sheet that I am building TGSProductsAttribPrep.xls, which contain all the attributes of the listed products. So saving it to that workbook seems to be a logical answer. However, I will have multiple tabs, one for each category, Snowboards, Bindings, Boots, and other related items...., Then there will be Shoes, Sandals, then Clothing items...
So depending on your thougths of the grand picture, if another workbook makes better use for specific product specifications and descriptions then that route is ok w/ me. We can just shove all of the related workbooks into one folder.

Thanks Stan,

Doug

YellowLabPro
09-13-2007, 07:53 AM
And again I have not experience here yet, Access may be a course to take.....

Norie
09-13-2007, 09:23 AM
Doug

I really don't see how we can give specific help here without a URL(s).

Doing this sort of thing is really specific to the particular site, how it's designed and what data you want to extract.

malik641
09-13-2007, 11:47 AM
I really don't see how we can give specific help here without a URL(s).
Doug gave 2 specific websites to work off of. What do you mean?

YellowLabPro
09-13-2007, 11:54 AM
Stan,
Not rushing you at all, but I am back and whenever you need anything feel free to shoot it my way.

Cheers,

Doug

Norie
09-13-2007, 12:00 PM
Joseph

Have you looked at those links?:)

Perhaps it's more that I meant that we need to know a little bit more about what data needs to be extracted and what needs to be input.

malik641
09-13-2007, 12:40 PM
Yes, I have looked at those links. And I don't know what's wrong with them?

I don't do much development with extracting data from websites/URLs, so what makes a "good" URL versus a "bad" one :dunno

Norie
09-13-2007, 12:52 PM
Joseph

The first two links, for me anyway, just lead to a page asking me to make a language selection.

The third leads to a page containing information on a particular snowboard.

And it looks to me that to manually reach that page you would have to follow various links. eg Snowboard>Snowboards>All Mountain Snowboards...

What links/data is actually needed to get to that particular page?

And what data needs to be extracted from it?

Is it the Technical Features? The Description?

I'm not quite sure what you mean by a good/bad URL, the point is that to do this kind of thing it's important to know how the site/page is designed.

I've taken a look at the source for the 3rd URL and I'll post back later with some code that will extract the technical specifiation table to Excel.:)

Here goes.


Sub Test()

Set ie = CreateObject("InternetExplorer.Application")
With ie
.Visible = True
.Navigate "http://www.backcountry.com/store/BUR1485/c7/s49/Burton-Custom-X-Snowboard.html"
Do Until .ReadyState = 4: DoEvents: Loop
Do Until .ReadyState = 4: DoEvents: Loop
Do While .Busy: DoEvents: Loop
Set doc = ie.Document
GetOneTable doc, 3
.Quit
End With

End Sub

Sub GetOneTable(d, n)
' d is the document
' n is the table to extract
Dim e As Object ' the elements of the document
Dim t As Object ' the table required
Dim r As Object ' the rows of the table
Dim c As Object ' the cells of the rows.
Dim I As Long
Dim J As Long
For Each e In d.all
If e.nodename = "TABLE" Then
J = J + 1
End If
If J = n Then
Set t = e

tabno = tabno + 1
nextrow = nextrow + 1
Set rng = Range("A" & nextrow)
For Each r In t.Rows
For Each c In r.Cells
rng.Value = c.innertext
Set rng = rng.Offset(, 1)
I = I + 1
Next c
nextrow = nextrow + 1
Set rng = rng.Offset(1, -I)
I = 0
Next r
Exit For
End If

Next e

End Sub

stanl
09-13-2007, 12:57 PM
Stan,
Not rushing you at all, but I am back and whenever you need anything feel free to shoot it my way.


Like I said, the best work is the pre-parsing. The attached workbook has a simple {uncommented} sub that will select the links to the boards [at least the first 33 :devil2: ] and place them as hyperlinks [I double-spaced them]. What remains is to write a loop to process them and extract the necessary data from each page. Again, how the original URL is even chosen or how the links are processed involves more pre-pre-parsing... but that is the logic in my previous post.

(1) you could treat each hyperlinked page via WebQuery [messy and not recommended, but workable]
(2) Use HTTP and xPath to get both the description and the image from each link, stored in a DB table

.02

stanl
09-13-2007, 12:58 PM
..oops here is the xls

malik641
09-13-2007, 01:04 PM
I'm not quite sure what you mean by a good/bad URL, the point is that to do this kind of thing it's important to know how the site/page is designed. I couldn't tell what you were getting at, at first. When you said "Have you seen those links?" I just figured there was something terribly wrong with them (like they didn't exist or something....because I didn't look at them until you asked). When I found out that they worked fully functional, I wasn't sure what you were implying.

Ok. I see now. Thanks Norie, that was a much clearer explanation for me :)

Thanks for the code. I will look at it later.

YellowLabPro
09-13-2007, 02:19 PM
Stan,
Works nice. Builds a nice list as you said.

Then as needed, use HTTP to collect relevant facts, images... which could be placed into Excel, or an Access Table.

Which gets the specific items you are interested in... the challenge is to parse the entire catalog by page, class and subclass
I am at your mercy.... you will need to tell me what is needed along the way.
If I provide a list of specific items that I would like to collect data on, what data, i.e. size, description, etc.... will this be of any help?

YellowLabPro
09-13-2007, 02:35 PM
Norie,
For the record, I did list a few of certain data pieces I want to extract in the initial post; product size specs, description....

If there are details I need to be more specific on please feel free to let me know.

Your code did a very job of pulling the size for that particular item. In reading through the code, it appears to use tabs to indicate what data to pull. In addition to what your code pulls, the specs of the item, would be the description. In your first post you describe not having links to the item. That is a question I have too- how to assemble and locate for what we want to extract against.

Best Regards,

Doug

Norie
09-13-2007, 02:59 PM
Doug

Sorry but you've lost me again.:bug:

The code I posted gets data from a specific table, nothing to do with tabs.:)

And to find that table I had to look at the source of the page.

By the way Stan's code didn't actually seem to work for me.

I was probably doing something wrong though.:doh:

YellowLabPro
09-13-2007, 03:01 PM
from your code: I was guessing this was a tab on the page, that is why the item size specs were extracted and not the description 1 column over to the right on the page.

tabno = tabno + 1

YellowLabPro
09-13-2007, 03:03 PM
Stan's code from what I understood it's role at this point was to do nothing more than retrieve a list of items. From there we would need to build a desired extraction list, I think this is what he meant by pre-pre parser list.

Norie
09-13-2007, 03:09 PM
Doug

Just retried (is that a word or should I just be retired) Stan's code and it does extract URLs for the various snowboards.

Is that what you need/want?

Probably my impatience causing the problem.:)

YellowLabPro
09-13-2007, 03:13 PM
Norie,
Are you asking if the Url is what I want? If so, then no. I want the product information. Your code currently extract a very important portion of what I want, but not all. I still need the description.
Also- was tabno for a Tab on the page?

YellowLabPro
09-13-2007, 03:14 PM
And no- not ready for you to retire (too much for me to learn still)- put out to pasture as we say around here.... :)

stanl
09-13-2007, 03:14 PM
Doug

Just retried (is that a word or should I just be retired) Stan's code and it does extract URLs for the various snowboards.

Is that what you need/want?


Prolly a little of both. (1) a routine like my to obtain a list of valid URLS for a given item (2) for each URL in the list call a routine like yours ?????:dunno

YellowLabPro
09-13-2007, 03:16 PM
Stan,
Great- that is an issue- discovering valid Urls...

Norie
09-13-2007, 03:23 PM
Stan

Not quite sure what you mean.:)

The code you posted does extract URLs but it doesn't actually return any data, other than the links.

I'm sure that either your code or mine could be adapted to do so.

Quickbeam
10-10-2007, 12:13 PM
Norie,

Thanks for the great code to use Excel to extract table data from a webpage. I have been looking for this for a long time. I'm glad google finally led me here.

Digita
10-10-2007, 07:03 PM
Hi Norie,

Thanks for the cool scripts you shared in this post. This is exactly what I am after. You mentioned that you looked at the source code of the particular webpage to find out which table to extract the data from. I tried to view the source code of that page but couldn't make heads & tails of it. Would you please elaborate more on this point?

Thanks in advance.

Regards


KP

Quickbeam
10-15-2007, 02:25 PM
You need to find which "<Table>" element contains the data that you want. In Noria's code you then need to pass the number of the table you are looking for to the function.

If a web page has a lot of tables. I view source find the table I want and then copy all of the code from that point up to the start of the file. I then paste it in word and find all for the word "<table>". This would tell you the number of the table that you are looking for.

Digita
10-15-2007, 11:20 PM
Thanks Quickbeam. From the VBA code Norie provided, the info is extracted from table # 3 (my guess which could be wrong). Perusal of the source code of the page, my guess is the standard coding for each table starts with:


<Table width = ......
and ends with:


</table>

Going through the source codes, the details appear to be from table 2. I'm still confused. Would some expert provide some hints on how to work out the table number?

Thanks in advance for your kind input.

Regards


KP

stanl
10-16-2007, 03:29 AM
Would some expert provide some hints on how to work out the table number?


Tables in the DHTML hierarchy are zero-based (confusing since they are 1 based when performing an Excel Web Query). Assuming oIE is your Object, something like

nTables = oIE.Document.Body.GetElementsByTagName("TABLE")

which can be used as a logical


'pseudo-code
If ! nTables Then

Else
For i = 0 To nTables.length -1
'process each table
Next
End If


note: length used used not count. You can then iterate individual sub-objects such as <TR> or <TD> and check for attributes, i.e.

If nTables.Item(i).GetAttribute("name")<>"" Then

.02 Stan

Brandtrock
10-16-2007, 04:11 AM
Nate Oliver does a nice job explaining some basics for this kind of thing here. (http://www.mrexcel.com/board2/viewtopic.php?t=180837&postdays=0&postorder=asc&&start=0)

Regards,

Quickbeam
10-16-2007, 02:59 PM
The data that is being pulled is from the 3rd Table. Which has the features.

Below is the first line of the table from the link:(I can't post live links yet)

backcountry.com/store/BUR1485/c7/s49/Burton-Custom-X-Snowboard.html

<table class="features" cellspacing="0">