PDA

View Full Version : [SOLVED:] Improving Code to Convert HTML to Text



Paleo
02-24-2005, 03:05 PM
Hi I have a code to convert html to text into an excel cell and wanna know if there is any better way of doing this.

This is my dumn code:



Sub Texto()
Dim Texto As String, Texto1 As String, Texto2 As String, Texto3 As String
Dim Pos As Integer, Pos1 As Integer, Pos2 As Integer
Dim bolNegr As Boolean, intCor As Integer
Texto = Range("A1")
Pos = InStr(1, Texto, ">", vbTextCompare)
Pos1 = InStr(1, Texto, "=", vbTextCompare)
Texto1 = Mid(Texto, Pos1 + 1, Pos - (Pos1 + 1))
Texto2 = Mid(Texto, Pos + 1, Len(Texto) - Pos)
Texto2 = Replace(Texto2, "</font>", "")
Texto2 = Replace(Texto2, "</b>", "")
Pos2 = InStr(1, Texto2, "<b>", vbTextCompare)
If Pos2 > 0 Then
bolNegr = True
Else
bolNegr = False
End If
Select Case Texto1
Case "black"
intCor = 1
Case "white"
intCor = 2
Case "red"
intCor = 3
Case "navy"
intCor = 5
Case "yellow"
intCor = 6
Case "pink"
intCor = 7
Case "cyan"
intCor = 8
Case "brown"
intCor = 9
Case "green"
intCor = 10
Case Else
intCor = 4
End Select
Texto3 = Mid(Texto2, Pos2 + 3, Len(Texto2) - Pos2 - 3)
With Range("A2")
.Value = Texto3
.Font.Bold = bolNegr
.Font.ColorIndex = intCor
End With
End Sub


I have at cell A1 the following:
<font color=navy><b>This is my first example!</b></font>

Sure its not a complete code but I think its enough for a start. If someone knows a better approach to that Select Case I would appreciate too.

Jacob Hilderbrand
02-24-2005, 04:14 PM
Looks good to me. I would just change the Integer data types to Long. Long is native to PCs and will run faster (though the speed boost would not be much).

Paleo
02-24-2005, 05:05 PM
Hi Jake,

doesnt Long consumes more memory than Integer? In a database the larger is the variable the bigger is the amount of memory it requires. Isnt it in excel too?

Jacob Hilderbrand
02-24-2005, 05:08 PM
Well Long is already used by your PC and Integer has to be custom created. So really it takes more work to make an Integer.

Long is faster and can holder larger values. Integer is basically worthless unless you are using another function that requires an Integer value as an Argument.

Long does use more memory 4 bits versus 2 bits for Integer, but since Integer must be created so your PC can understand it, and Long does not, the 2 bits you would save is not worth it.

Paleo
02-24-2005, 05:11 PM
Ok, great then, thanks. I will change it.

Len Piwowar
02-24-2005, 08:08 PM
Alternative method: Additions are in red text


Option Base 1
Sub TextoV2()
'Rem Option Base 1
Dim Texto As String, Texto1 As String, Texto2 As String, Texto3 As String
Dim Pos As Integer, Pos1 As Integer, Pos2 As Integer
Dim bolNegr As Boolean, intCor, x As Integer
Texto = Range("A1")
Pos = InStr(1, Texto, ">", vbTextCompare)
Pos1 = InStr(1, Texto, "=", vbTextCompare)
Texto1 = Mid(Texto, Pos1 + 1, Pos - (Pos1 + 1))
Texto2 = Mid(Texto, Pos + 1, Len(Texto) - Pos)
Texto2 = Replace(Texto2, "</font>", "")
Texto2 = Replace(Texto2, "</b>", "")
Pos2 = InStr(1, Texto2, "<b>", vbTextCompare)
If Pos2 > 0 Then
bolNegr = True
Else
bolNegr = False
End If
intCors = Array("black", "white", "red", "NoColor", "navy", "pink", "cyan", "brown", "Green")
For x = 1 To 10
If Texto1 Like intCors(x) Then
intCor = x
Exit For
Else
intCor = 4
End If
Next
Texto3 = Mid(Texto2, Pos2 + 3, Len(Texto2) - Pos2 - 3)
With Range("A2")
.Value = Texto3
.Font.Bold = bolNegr
.Font.ColorIndex = intCor
End With
End Sub

Paleo
02-24-2005, 09:04 PM
Hi,

interesting approach, thanks. I will test it for performance and then let you know if it has any difference.

mvidas
02-25-2005, 09:44 AM
I had no idea about the Long/memory thing, thanks Jake. Learning new things every day!

As for the question at hand, there are much easier ways to do what you need, Carlos.
What is your final goal for this?
Are you always going to be cleaning up single lines of code like your example above? Or are you going to be downloading websites and converting them to text? Or even opening an html file from your hard drive and stripping out the html tags that way?

If you explain what it is you're doing, I can change the code to suit your needs (or you can take it out of my example below), but here is something to strip the html out of an incoming string:


Private Declare Sub Sleep Lib "kernel32" (ByVal dwMilliseconds As Long)
Sub CarlosExample()
Range("A1") = "<font color=navy><b>This is my first example!</b></font>"
Msgbox StripHtml(Range("A1").Text)
End Sub

Function StripHtml(incomingText As String) As String
Dim ie As Object, fso As Object, txtF As Object
'Create temporary file on hard drive to hold incoming string
Set fso = CreateObject("Scripting.FileSystemObject")
Set txtF = fso.CreateTextFile("C:\vTEMPv.html")
txtF.WriteLine incomingText
txtF.Close
'Open the temporary file in internet explorer, and only take the text from it
Set ie = CreateObject("internetexplorer.application")
ie.Navigate "C:\vtempv.html"
Do Until ie.readystate = 4
Call Sleep(10)
Loop
StripHtml = ie.Document.body.innerText
ie.Quit
Set ie = Nothing
'Delete temporary file
fso.DeleteFile "C:\vtempv.html"
End Function

If you want just the text of an existing website (or existing html/htm file), you could use something like:


Private Declare Sub Sleep Lib "kernel32" (ByVal dwMilliseconds As Long)
Sub CarlosExample()
Range("A2") = StripHtmlURL("http://www.vbaexpress.com/forum/showthread.php?t=2009")
End Sub
Function StripHtmlURL(incomingURL As String) As String
Dim ie As Object
Set ie = CreateObject("internetexplorer.application")
ie.Navigate incomingURL
Do Until ie.readystate = 4
Call Sleep(10)
Loop
StripHtml = ie.Document.body.innerText
ie.Quit
Set ie = Nothing
End Function


Matt

Len Piwowar
02-25-2005, 11:39 AM
Workbook Instructions:
1) Select Html text run procedure TextoV3, the code has some error checking .
2) Converted Html Text is placed in cell to the right of the selected cell.

Note: Create AllColorsRange named range on ColorIndx Sheet Range("C 1:C44") this allows user to type in color names of their choice next to the color.

Module1:


Sub TextoV3()
Dim Texto, HtmlTxtColor, CurColorName, HtmlBold As String
Dim TextLenght, Pos, Pos1, intCor, MyClrIndx, x As Integer
Dim bolNegr, ColorFound As Boolean
Dim CurrentCell, AllMyColorsRng As Range
Dim cell As Variant
'Get Current Cell Html text
Set CurrentCell = Application.ActiveCell
Texto = CurrentCell.Value
If Texto = "" Then
MsgBox "Nothing in cell"
GoTo TheEnd
End If
TextLenght = Len(Texto)
'Get Color Setting from Html text
Pos = InStr(1, Texto, ">", vbTextCompare)
Pos1 = InStr(1, Texto, "=", vbTextCompare)
HtmlTxtColor = Mid(Texto, Pos1 + 1, Pos - (Pos1 + 1))
Set AllMyColorsRng = ThisWorkbook.Worksheets("ColorIdx").Range("AllColorsRange").Cells
For Each cell In AllMyColorsRng
CurColorName = cell.Value
If UCase(HtmlTxtColor) Like UCase(CurColorName) Then
ColorFound = True
intCor = cell.Row 'Html Color Name is Typed in Col(3) Using Same Row Number as the color's number
Exit For
End If
Next
If intCor = "" Then
MsgBox HtmlTxtColor & " ColorIndex name Not Found, Default color will be used!"
intCor = 4
End If
'Remove Color Settings from Html text string
Texto = Right(Texto, TextLenght - Pos)
TextLenght = Len(Texto)
'Get Bold Settings from Html text and check if exists
Pos = InStr(1, Texto, ">", vbTextCompare)
HtmlBold = Left(Texto, Pos)
Pos = InStr(1, HtmlBold, "b>", vbTextCompare)
If Not (Pos = 0) And (Pos = 2) Then
If Pos = 2 Then
bolNegr = True
ElseIf Not (Pos = 0) And (Pos = 3) Then
bolNegr = False
End If
Else
MsgBox "Bold settings not found !"
Exit Sub
End If
'Remove Bold Settings from Html text string
Texto = Mid(Texto, (Pos + 2), Len(Texto))
TextLenght = Len(Texto)
'Get Text From Html
Pos = InStr(1, Texto, "<", vbTextCompare)
HtmlText = Left(Texto, Pos - 1)
'Convert and Place in next cell over
With ThisWorkbook.Worksheets("Convert").Range(CurrentCell.Address).Offset(0, 1) 'Place in next cell
.Value = HtmlText
.Font.Bold = bolNegr
.Font.ColorIndex = intCor
End With
TheEnd:
End Sub

Paleo
02-25-2005, 12:13 PM
Hi Matt,

sounds like you got it done. I will teste it. The EMS creates a xls file with that code, in cells so I need a macro to strip then out. I think your is gonna do the job. Will test it.

mvidas
02-25-2005, 12:50 PM
The only thing mine doesnt do is color the text like you have in your original sub, all it does is remove all the tags.

Something you could try is programmatically exporting the EMS results to a text file, running the HTML stripper on the exported file, then reopen the txt file into excel. Let me know if you would need any help doing that
Matt

Paleo
02-25-2005, 12:54 PM
Hi Matt,

actually I just need to put the colors in it now, but a little worse. I need to create a macro that runs for every color. Any idea better than an array?

mvidas
02-25-2005, 01:11 PM
Run this CellColors sub before running the html stripping macro, should take care of the color thing for you. I added a portion to bold the <b> cells as well, but commented that out in case you already took care of that.


Sub CellColors()
Dim vColors, vColor, vRange As Range, usdRg As Range
Set usdRg = ActiveSheet.UsedRange
vColors = Array("", "black", "white", "red", "", "navy", "yellow", "pink", _
"cyan", "brown", "green")
Application.ScreenUpdating = False
usdRg.Font.ColorIndex = 4
For vColor = LBound(vColors) To UBound(vColors)
If Not vColors(vColor) = "" Then
Set vRange = FoundRange(usdRg, "color=" & vColors(vColor))
If Not vRange Is Nothing Then vRange.Font.ColorIndex = vColor
End If
Next vColor
' usdRg.Font.Bold = False
' Set vRange = FoundRange(usdRg, "<b>")
' If Not vRange Is Nothing Then vRange.Font.Bold = True
Application.ScreenUpdating = True
End Sub
Function FoundRange(vRG As Range, vVal) As Range
Dim FND As Range
Dim FND1 As Range
Set FND = vRG.Find(vVal, LookIn:=xlValues, LookAt:=xlPart, MatchCase:=False)
If Not FND Is Nothing Then
Set FoundRange = FND
Set FND1 = FND
Set FND = vRG.FindNext(FND)
Do Until FND.Address = FND1.Address
Set FoundRange = Union(FoundRange, FND)
Set FND = vRG.FindNext(FND)
Loop
End If
End Function

Matt

TonyJollans
02-26-2005, 04:47 AM
Re: the discussion earlier about Integer/Long data types, you might be interested in this:

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/modcore/html/decontheintegerdatatypes.asp

There really is no benefit in using Integer any more unless it is needed to compile.

mvidas
02-26-2005, 06:46 AM
Thanks, Tony!

Paleo
02-26-2005, 08:59 AM
Hi Tony,

many thanks!

brettdj
02-27-2005, 01:03 AM
Paleo,

This link may be of interest
http://www.dicks-blog.com/archives/2005/02/23/html-in-cells-ii/

Cheers

Dave

Paleo
02-27-2005, 01:29 AM
Hi Dave,

it looked just like what I need but am getting an error at:


Dim objData As DataObject


Looks like it dont recognize DataObject.

Any clues?

Ivan F Moala
02-27-2005, 06:07 AM
DataObject is part of the FM20.dll = Microsoft Forms 2 object lib

Paleo
02-27-2005, 08:07 AM
Hi Ivan,

thanks for the tip, now it works perfectly and is a lot better than my original code.