PDA

View Full Version : Get data from HTML files



swaggerbox
09-30-2014, 02:51 AM
I have over a hundred HTML files in a folder.

What I need to do is get the following data: Data1, Data2 and Data3 from the different sections in an HTML file.

Data2 is the value between the span class="dictionaryOovHighlight"> and </span>, Data3 is the value of class="frmInput200px jTranslated", while Data1 is the value between <div> and </div> which includes <span class="dictionaryOovHighlight">.

I need to place these data into columns A, B, and C in an Excel sheet (named Sheet1).

Column A:
<div>Data1<span class="dictionaryOovHighlight">Data2</span> Data1</div>

Column B:
<span class="dictionaryOovHighlight">Data2</span>

Column C:
class="frmInput200px jTranslated" value="Data3">

See attached screenshot of the Excel file and a sample html file.

How do I do this via VBA?

westconn1
09-30-2014, 06:16 AM
you can try this to see if it returns what you require

Set wb = CreateObject("internetexplorer.application")
wb.navigate2 "c:\temp\2014-08-29.htm"
Set doc = wb.document
Set spans = doc.getElementsByTagName("span")
rw = 1
For Each s In spans
If s.className = "dictionaryOovHighlight" Then
ActiveSheet.Cells(rw, 1).Value = s.parentElement.innerText
ActiveSheet.Cells(rw, 2).Value = s.innerText
Set trans = s.parentElement.parentElement.parentElement
ActiveSheet.Cells(rw, 3).Value = trans.all(7).Value
rw = rw + 1
End If

Nexti tested very briefly, but you should check results carefully

snb
09-30-2014, 06:21 AM
Sub M_snb()
c00 = CreateObject("scripting.filesystemobject").opentextfile("G:\OF\20140829.htm").readall

sn = Filter(Split(Replace(c00, "<div>", "</div>"), "</div>"), " class=""dictionaryOovHighlight"">")
ReDim sq(UBound(sn), 2)

For j = 0 To UBound(sn)
sp = Split(sn(j), "<span class=""dictionaryOovHighlight"">")
sq(j, 0) = sp(0)
sq(j, 1) = Split(sp(1), "</span>")(0)
sq(j, 2) = Split(sp(1), "</span>")(1)
Next

Cells(1).Resize(UBound(sq) + 1, 3) = sq
End Sub

or


Sub M_snb()
On Error Resume Next
j = 0

With CreateObject("htmlfile")
.body.innerHTML = CreateObject("scripting.filesystemobject").opentextfile("G:\OF\20140829.htm").readall
ReDim sq(.getElementsByTagName("span").Length, 2)
For Each it In .getElementsByTagName("span")
If it.className = "dictionaryOovHighlight" Then
sq(j, 0) = it.parentElement.innerText
sq(j, 1) = it.innerText
sq(j, 2) = it.parentElement.parentElement.parentElement.all(7).Value
j = j + 1
End If
Next
End With

Cells(1).Resize(UBound(sq) + 1, 3) = sq
End Sub

swaggerbox
10-01-2014, 12:02 AM
you can try this to see if it returns what you require

Set wb = CreateObject("internetexplorer.application")
wb.navigate2 "c:\temp\2014-08-29.htm"
Set doc = wb.document
Set spans = doc.getElementsByTagName("span")
rw = 1
For Each s In spans
If s.className = "dictionaryOovHighlight" Then
ActiveSheet.Cells(rw, 1).Value = s.parentElement.innerText
ActiveSheet.Cells(rw, 2).Value = s.innerText
Set trans = s.parentElement.parentElement.parentElement
ActiveSheet.Cells(rw, 3).Value = trans.all(7).Value
rw = rw + 1
End If

Nexti tested very briefly, but you should check results carefully

Hi Westconn,

I'm receiving a "Runtime error 438: Object doesn't support this property or method on the line"
ActiveSheet.Cells(rw, 3).Value = trans.all(7).Value
I have recently installed IE9 (previously used IE8) but still experienced this error.

swaggerbox
10-01-2014, 12:15 AM
Sub M_snb()
c00 = CreateObject("scripting.filesystemobject").opentextfile("G:\OF\20140829.htm").readall

sn = Filter(Split(Replace(c00, "<div>", "</div>"), "</div>"), " class=""dictionaryOovHighlight"">")
ReDim sq(UBound(sn), 2)

For j = 0 To UBound(sn)
sp = Split(sn(j), "<span class=""dictionaryOovHighlight"">")
sq(j, 0) = sp(0)
sq(j, 1) = Split(sp(1), "</span>")(0)
sq(j, 2) = Split(sp(1), "</span>")(1)
Next

Cells(1).Resize(UBound(sq) + 1, 3) = sq
End Sub

or


Sub M_snb()
On Error Resume Next
j = 0

With CreateObject("htmlfile")
.body.innerHTML = CreateObject("scripting.filesystemobject").opentextfile("G:\OF\20140829.htm").readall
ReDim sq(.getElementsByTagName("span").Length, 2)
For Each it In .getElementsByTagName("span")
If it.className = "dictionaryOovHighlight" Then
sq(j, 0) = it.parentElement.innerText
sq(j, 1) = it.innerText
sq(j, 2) = it.parentElement.parentElement.parentElement.all(7).Value
j = j + 1
End If
Next
End With

Cells(1).Resize(UBound(sq) + 1, 3) = sq
End Sub


Hi snb:

Data2 is correctly captured but Data1 is concatenated. Data1 includes Data2 as indicated above: <div>Data1<span class="dictionaryOovHighlight">Data2</span> Data1</div>

Data3 is missing: class="frmInput200px jTranslated" value="Data3">

swaggerbox
10-01-2014, 12:20 AM
12332

hi snb:

see snapshot out the output above. Column A is Data1+Data2 (concatenated, rest appearing in Column B and Column C), Column B is Data2 (correct), Column C is the value of the textbox (Data3)

snb
10-01-2014, 12:56 AM
Please do not quote.

If you analyze the code you must be able, using the same technique, to get what you want.

swaggerbox
10-01-2014, 02:15 AM
Hi snb:

sorry about that. I am still a newbie on VBA. Don't really know how to do it by myself.