PDA

View Full Version : Not able to get HTML text including formatting



musicgold
12-20-2010, 09:30 AM
Hi,

I am trying to use the following VBA code to get specifc text from a webpage.
While I am able to do that, I am not able to get the formatting info. The formatting used on the webpage is essential -e.g. the bolded part of the word. How can I acheive that?


...
IE.navigate "http://dictionary.reference.com/browse/principle"
Do While IE.Busy: DoEvents: Loop

Set htmlDoc = IE.document
Set htmlColl = htmlDoc.getElementsByTagName("SPAN")

For Each hinput In htmlColl

If hinput.className = "pron" Then 'the pronouciation of the word

ocell.Offset(0, 1).Value = hinput.innerText
...

Thanks,

MG.

Shred Dude
12-20-2010, 10:31 AM
After reviewing the HTML, I'd say you'll have to iterate through the pieces of the SPAN and grab each format, and convert that to something Excel can undertand.

So after you find the Span with the pronunciation, break it down and loop it:

If hinput.className = "pron" Then 'the pronouciation of the word

s= split(hinput,"</span>")
redim data(0)
for sp=lbound(s) to ubound(s)
redim preserve data(sp)
'build array of innertext pieces and their corresponding format
data(sp)= s(sp).className & "|" & s(sp).innertext
next sp

'write the array data to the sheet
'use split columns to separate into two pieces
'build routine to format pieces accordingly
'concantanate formatted pieces..

'etc.

musicgold
12-21-2010, 08:24 PM
Shred Dude,

Thanks. I tried your suggestions but I am getting an error at the red line in the following code. This is my complete code. I never used the split function before, however, I observed a strange thing in this subroutine.
I couln't use the 'run to the cursor' command to directly go to the line : if hinput.classname="pron" line. The program just terminates without any error.

At the red line, the value of sp stays 0. That means no array is being created by the split function.


Public Sub Dictionary()
Dim ocell As Range
Dim IE As New SHDocVw.InternetExplorer
Dim Ticker As String
Dim htmlDoc As MSHTML.HTMLDocument
Dim htmlInput As MSHTML.HTMLInputElement
Dim htmlColl As MSHTML.IHTMLElementCollection
Dim i, sp As Integer
Dim s

Set IE = CreateObject("InternetExplorer.Application")
IE.Visible = 1


For Each ocell In Selection

IE.navigate "http://dictionary.reference.com/browse/" & ocell.Value
Do While IE.Busy: DoEvents: Loop

Set htmlDoc = IE.document
Set htmlColl = htmlDoc.getElementsByTagName("SPAN")


For Each hinput In htmlColl

If hinput.className = "pron" Then

s = split(hinput.outerHTML, "</span>") ' I tried just hinput, as well as innertext, outertext, and innerHTML here.

ReDim data(0)
For sp = LBound(s) To UBound(s)
ReDim Preserve data(sp)
'build array of innertext pieces and their corresponding format
data(sp) = s(sp).className & "|" & s(sp).innerText
Next sp

i = 0

For sp = LBound(data) To UBound(data)

ocell.Offset(0, i).Value = data(sp)

i = i + 1

Next sp

GoTo Loopback

End If

Next


Loopback:
Next

End Sub

Shred Dude
12-22-2010, 09:09 AM
Try All Caps on the delimiter in the Split function. You HTML may not have "span".

s = split(hinput.outerHTML, "</SPAN>")

Shred Dude
12-22-2010, 09:14 AM
Just took a closer look. This isn't going to work. s(sp) is a String, not an object. You'll need to take your Span object with classname Pron into another HTMLElementCollection of Spans, then iterate through that collection.


The example of your HTML I saw had multiple Spans within the Span you were getting to.

Then pull the classname property of each span etc.

For Each hinput In htmlColl

If hinput.className = "pron" Then

newColl = hinput.getelementsbytagname("SPAN")
for each s in newcoll
'examine the pieces, etc.
next s