PDA

View Full Version : Problems grabbing hrefs in HTML docs



mlarsen
05-15-2008, 09:28 AM
Hello,

I have code here that works perfectly fine. It searches through a folder and subfolders for any html or htm file, then searches each file it finds for any hrefs and prints them to the spreadsheet. I am having issues with some though. It seems to find the hrefs that are on 1 line in the source code, but the ones where there are multiple hrefs within a <P></P> tage its not grabbing. Not sure what the problem is considering in the code I am searching for ANY instance of href="

I have included my code below...

Function GetURLAddress(ByVal TextToSearch) As String
Dim Start, Length
Start = InStr(1, TextToSearch, "href=", vbTextCompare) + 6
Length = InStr(1, TextToSearch, """ ", vbTextCompare) - Start
If Length < 0 Then
Length = InStr(1, TextToSearch, "'>", vbTextCompare) - Start
End If
If Length < 0 Then
GetURLAddress = ""
Else
GetURLAddress = LTrim(RTrim(Mid(TextToSearch, Start, Length)))
End If
End Function

Private Function GetHrefs(ByVal html, ByVal strFilex, ByVal strTitlex)
Dim re, matches, match, d, uri, name, r As Long, c As Range, Lrow As Long
Dim saveLink
Dim iRet
saveLink = False

Set re = CreateObject("vbscript.regexp")
re.Pattern = "<a\s+.*?href=[""\']?([^""\' >]*)[""\']?[^>]*>(.*?)<\/a>"
*do you think this is where the issue is?*
re.IgnoreCase = True
re.MultiLine = True
re.Global = True
Set matches = re.Execute(html)
For Each match In matches
iRet = InspectLink(GetURLAddress(match))
If (iRet > 0) Then
Cells(Globalindx, 2) = strFilex
Cells(Globalindx, 3) = strTitlex
Cells(Globalindx, 4) = GetURLTitle(match)
Cells(Globalindx, 5) = GetURLAddress(match)
Cells(Globalindx, 6) = GetType(iRet)

Globalindx = Globalindx + 1
End If


Next
Set matches = Nothing
Set re = Nothing

End Function

RonMcK
05-15-2008, 01:05 PM
m,

When you want to show us vba code please click the VBA button to create a pair of vba tags, then, and only then, paste your code between the tags; this will preserve the readability of your code. Any way, here is your code, cleaned up:
Option Explicit

Function GetURLAddress(ByVal TextToSearch) As String
Dim Start, Length
Start = InStr(1, TextToSearch, "href=", vbTextCompare) + 6
Length = InStr(1, TextToSearch, """ ", vbTextCompare) - Start
If Length < 0 Then
Length = InStr(1, TextToSearch, "'>", vbTextCompare) - Start
End If
If Length < 0 Then
GetURLAddress = ""
Else
GetURLAddress = LTrim(RTrim(Mid(TextToSearch, Start, Length)))
End If
End Function

Private Function GetHrefs(ByVal html, ByVal strFilex, ByVal strTitlex)
Dim re, matches, match, d, uri, name, r As Long, c As Range, Lrow As Long
Dim saveLink
Dim iRet
saveLink = False

Set re = CreateObject("vbscript.regexp")
re.Pattern = "<a\s+.*?href=[""\']?([^""\' >]*)[""\']?[^>]*>(.*?)<\/a>"
' *do you think this is where the issue is?* (umm, no.)
re.IgnoreCase = True
re.MultiLine = True
re.Global = True
Set matches = re.Execute(rehtml)
For Each match In matches
iRet = InspectLink(GetURLAddress(match))
If (iRet > 0) Then
Cells(Globalindx, 2) = strFilex
Cells(Globalindx, 3) = strTitlex
Cells(Globalindx, 4) = GetURLTitle(match)
Cells(Globalindx, 5) = GetURLAddress(match)
Cells(Globalindx, 6) = GetType(iRet)
Globalindx = Globalindx + 1
End If
Next
Set matches = Nothing
Set re = Nothing

End Function

I believe your problem is contained with in the loop: For Each match in matches ... Next. Each 'match' is evaluated only once so a 2nd, 3rd, ... nth href= are not handled. It would be good when GetURLAddress() evaluates 'match' if it would return a remainder sring with all text to the right of the URL addy that it parsed out. You could then loop back, test the string and process it again if it still has an href= in it, repeating the process until all href=s on the line have been added to the array.

HTH,

T-J
05-15-2008, 01:40 PM
Can also be done by parsing the HTML using the HTML Object Library:

Use Tools > References and set a reference to Microsoft Internet Controls and Microsoft HTML Object Library


Sub Get_hRefs()
Dim myIE As InternetExplorer 'Microsoft Internet Controls
Dim myHTMLPage As HTMLDocument 'Microsoft HTML Object Library
Dim hRef As HTMLAnchorElement

Set myIE = New InternetExplorer
myIE.Navigate URL:="your_html_file.html" 'a web page --CHANGE THIS LINE
myIE.Visible = True
Do
DoEvents
Loop While myIE.Busy Or myIE.readyState <> READYSTATE_COMPLETE

Set myHTMLPage = myIE.document
For Each hRef In myHTMLPage.getElementsByTagName("A")
Debug.Print hRef, hRef.innerText
Next hRef

myIE.Quit
End Sub