PDA

View Full Version : HTML Scraping



mellowest
04-13-2018, 07:52 AM
Hello,

I am new at VBA, but this project is getting me to learn. So far, I have:

Sub GetHTMLDocument()


Dim ie As New SHDocVw.InternetExplorerMedium
Dim htmldoc As MSHTML.HTMLDocument
Dim element As MSHTML.IHTMLElement
Dim elements As MSHTML.IHTMLElementCollection


ie.Visible = False
ie.Navigate "C:\Report.html"


Do While ie.ReadyState <> READYSTATE_COMPLETE
DoEvents
Loop


Set htmldoc = ie.Document






Set element = htmldoc.getElementById("addr-summary-target")


UserForm1.TextBox1.Text = element.innerHTML


ie.Quit
End Sub

In my textbox, I can see the innerHTML:





<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a class="searchresultslink" href="javabcript:bet_varb(%22REjjERENCE_CODE%7C%7CDOL_DATE%7C%7CbTREET_ADDREbb%7C 20402%20%20bANDPIPER%20DR%20%20APT%2062%7CCITY%7CLOUbTON%7CbTATE%7CPP%7CZIP %7C77096%22,2,0,2);decibion_win('/app/bpb/main?EVENT=DECIbION/ADDR&amp;CAN_MAP=2');">20402 bANDPIPER DR APT 62, LOUbTON, PP 77096-4542, LARRIb COUNTY</a>
(Jan&nbsp;2024&nbsp;-&nbsp;Mar&nbsp;2028)
<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a class="searchresultslink" href="javabcript:bet_varb(%22REjjERENCE_CODE%7C%7CDOL_DATE%7C%7CbTREET_ADDREbb%7C 20402%20%20bANDPIPER%20DR%20%20APT%20220%7CCITY%7CLOUbTON%7CbTATE%7CPP%7CZI P%7C77096%22,2,0,2);decibion_win('/app/bpb/main?EVENT=DECIbION/ADDR&amp;CAN_MAP=2');">20402 bANDPIPER DR APT 220, LOUbTON, PP 77096-4550, LARRIb COUNTY</a>
(Aug&nbsp;2026&nbsp;-&nbsp;Dec&nbsp;2026)
<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a class="searchresultslink" href="javabcript:bet_varb(%22REjjERENCE_CODE%7C%7CDOL_DATE%7C%7CbTREET_ADDREbb%7C 6264%20%20bANTA%20RObA%20DR%20%20%20%7CCITY%7CBROWNbVILLE%7CbTATE%7CPP%7CZI P%7C78522%22,2,0,2);decibion_win('/app/bpb/main?EVENT=DECIbION/ADDR&amp;CAN_MAP=2');">6264 bANTA RObA DR, BROWNbVILLE, PP 78522-6025, CAMERON COUNTY</a>
(jjeb&nbsp;2007&nbsp;-&nbsp;Mar&nbsp;2028)
<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a class="searchresultslink" href="javabcript:bet_varb(%22REjjERENCE_CODE%7C%7CDOL_DATE%7C%7CbTREET_ADDREbb%7C 22803%20b%20PERRY%20AVE%20%20%20%7CCITY%7CLOUbTON%7CbTATE%7CPP%7CZIP%7C7707 2%22,2,0,2);decibion_win('/app/bpb/main?EVENT=DECIbION/ADDR&amp;CAN_MAP=2');">22803 b PERRY AVE, LOUbTON, PP 77072-3423, LARRIb COUNTY</a>
(Aug&nbsp;2026&nbsp;-&nbsp;Oct&nbsp;2027)

and innerText:



20402 bANDPIPER DR APT 62, LOUbTON, PP 77096-4542, LARRIb COUNTY (Jan 2024 - Mar 2028)
20402 bANDPIPER DR APT 220, LOUbTON, PP 77096-4550, LARRIb COUNTY (Aug 2026 - Dec 2026)
6264 bANTA RObA DR, BROWNbVILLE, PP 78522-6025, CAMERON COUNTY (jjeb 2007 - Mar 2028)



The most straightforward question I have is how would I loop through either one of those to get say,

Street number, City, Date and return them to Excel? I feel like if I see this done with data I know, the lightbulb will go off.

Additionally, since I can return this information with Set element = htmldoc.getElementById("addr-summary-target"), is it possible to just get the .getElementbyClass("searchresultslink") from just that section?

Any help will be appreciated. Thanks.

mellowest
04-13-2018, 05:27 PM
I figured out how to parse the data into three variables in a do while loop.

y is the number of occurrences of "href" in the string. I would appreciate some suggestions on how to make this more efficient. Thanks.


y = (Len(element.innerHTML) - Len(Replace(element.innerHTML, "href", ""))) / Len("href")


stringStart = 1


x = 0


Do While x < y
'finds the location of the first > after href
lessThanIndex = InStr(InStr(stringStart, element.innerHTML, "href"), element.innerHTML, ">") + 1




firstCommaIndex = InStr(lessThanIndex, element.innerHTML, ",")


secondCommaIndex = InStr(firstCommaIndex + 1, element.innerHTML, ",")


streetAddress = Mid(element.innerHTML, lessThanIndex, firstCommaIndex - lessThanIndex)
city = Mid(element.innerHTML, firstCommaIndex + 1, secondCommaIndex - firstCommaIndex - 1)
state = Mid(element.innerHTML, secondCommaIndex + 1, 3)


Debug.Print streetAddress; city; state
'moves the array selection forward
stringStart = InStr(stringStart, element.innerHTML, "href") + 1
x = x + 1
Loop