PDA

View Full Version : DOM: Searching only the tags embedded within one particular div tag



vlina
02-04-2011, 12:10 PM
Hello,

My code goes to a website, does a search, and returns a title, abstract, and reference of each of the 5 results on the page.

Is there a way to search tags (with particular classNames) ONLY within each of the 5 "result" div tags?

The html goes something like:

<div class="all_results">

<div class="result">

<p class="reference"><reference goes here</p>
<h1 class="title">title goes here</h1>
<p class="auth_list">authors go here</p>


<div class="abstract">
<p>abstract goes here</p>
</div>

</div>

<div class="result"> (this would be the 2nd result)

....
For each "div" tag with className "result", search only in that tag for "p" tag with className "reference' and "h1" tag with className "title" and sub-"div" tag with className "abstract".

I tried grabbing the innerHTML of each result and assigning it to a variable (called resultCode) and making that variable an HTMLDocument, string, etc. This way I could use that as the "ie.document" in "ie.document.all.tags("p")" and search only there. But it didn't work (wrong variable type).

The problem with my current code (below) is that it searches for "p", "h1" and "div" (with className abstract) in the entire document, not the current "result" div tag. So, it returns the first title, abstract, and reference 5 times instead of each of the 5 once.



Dim varTagP, varTagsP As Variant
Dim varTagH, varTagsH As Variant
Dim varTagDIV, varTagsDIV As Variant

Dim numDIVtags, i, m As Integer
Dim theReference, theTitle, theAbstract As String
Dim resultCode As String 'also tried HTMLDocument, etc.

numDIVtags = ie.document.all.tags("DIV").Length


For i = 0 To numDIVtags

If ie.document.all.tags("DIV")(i).className = "result" Then

Debug.Print "There is a result on the " & i & "th div tag"

'get reference from result
Set varTagsP = ie.document.all.tags("P")
For Each varTagP In varTagsP
If varTagP.className = "reference" Then
theReference = varTagP.innerText
Debug.Print theReference
Exit For
End If
Next

'get title from result
Set varTagsH = ie.document.all.tags("H1")
For Each varTagH In varTagsH
If varTagH.className = "title" Then
theTitle = varTagH.innerText
Debug.Print theTitle
Exit For
End If
Next

'get abstract from result
Set varTagsDIV = ie.document.all.tags("DIV")
For Each varTagDIV In varTagsDIV
If varTagDIV.className = "abstract_text" Then
theAbstract = varTagDIV.innerText
Debug.Print theAbstract
Exit For
End If
Next
End If
Next



I've looked for posts similar to this for many hours without luck. Any advice would be very much appreciated.

Thank you for your time.

JP2112
02-04-2011, 12:43 PM
As a start, check out the document.getElementsByClassName Method. According to MSDN, that will return "a collection of objects with the same CLASS attribute value."

I don't see how you are grabbing the webpage, but once I've scraped the page I would do something like this:


Dim html As Object ' MSHTML.HTMLDocument
Dim resultClasses As Object ' MSHTML.IHTMLElementCollection
Dim resultClass As Object ' MSHTML.IHTMLElement
Set html = CreateObject("htmlfile") ' New MSHTML.HTMLDocument
html.body.innerHTML = result
Set resultClasses = html.getElementsByClassName("result")
For Each resultClass In resultClasses
' do what you want here
Next resultClass


But maybe if you explain what your end goal is (instead of the particular step you're stuck on), someone can suggest better code to accomplish it.

JP2112
02-04-2011, 12:47 PM
I forgot to mention that document.getElementsByClassName is available in IE9 / HTML5, and my understanding is that neither of these have been released yet.

vlina
02-04-2011, 01:41 PM
Thanks for your quick reply, JP! That's too bad getElementsByClassName doesn't work in IE7, it sounds like the perfect solution.

The goal is: I have a list of authors, and I need to download the title, abstract, and reference of the last 5 works they published. The code searches a database for their name, changes some settings online so that it displays just the last 5 works, and now I want to scrape it and fill in the spreadsheet with the data (column 3 is the title, 4 is the abstract, etc.).

I grab the website simply by doing:


Dim ie As InternetExplorer
Dim sUrl as String

Set ie = New InternetExplorer
ie.navigate sUrl


The rest is just entering the search term changing some search options.

JP2112
02-04-2011, 02:04 PM
You can use document.getElementsByTagName("div") to grab all the div elements, then set an object reference where class="all_results". Something like


Dim html As Object ' MSHTML.HTMLDocument
Dim resultClasses As Object ' MSHTML.IHTMLElementCollection
Dim resultClass As Object ' MSHTML.IHTMLElement
Dim allResultsDiv As Object ' MSHTML.IHTMLElementCollection
Set html = CreateObject("htmlfile") ' New MSHTML.HTMLDocument
html.body.innerHTML = ' String variable containing your result
Set resultClasses = html.getElementsByTagName("div")
For Each resultClass In resultClasses
If resultClass.getAttribute("class") = "all_results" Then
Set allResultsDiv = resultClass
Exit For
End If
Next resultClass


This is a long shot, but does the website have an API? If they have an API it's usually easier to extract this sort of information. Can you share the URL?

vlina
02-07-2011, 11:58 AM
Thanks, JP. The website is ncbi.nlm.nih.gov/pubmed. I'm searching for a person's name (say "smith j[author]") and changing the Display Settings to show just the latest 5 results, and the format = Abstract.

The above code doesn't seem to be working. And in playing around with it, I'm finding that getElemetnsByTagName("div") isn't grabbing the results div's - it only gets the top and bottom of the page, oddly enough.

vlina
02-07-2011, 12:35 PM
Hi JP. I think I'm on to something, using your getElementsByTagName. Will reply back soon if it works/doesn't work.

Thanks again,
Natasha