PDA

View Full Version : Solved: Remove characters from text function



JimS
12-10-2009, 08:25 AM
Below is a function that is used to remove all the special HTML characters from the text in a column of cells.

Can the RegEx.Pattern be modified so that it can remove other patterns of characters, in addition to the <(.|\n)+?> that it already removes?

For example, can the RegEx.Pattern have an "and" in it or can there be multiple RegEx.Pattern lines (ie: RegEx.Pattern1, RegEx.Pattern2, RegEx.Pattern3) and
then have the RegEx.Replace(inval, "") reference all of the RegEx.Pattern lines?


Thanks for any ideas or help...

JimS



Public Function StripHtML(inval)

Dim RegEx As Object

Set RegEx = CreateObject("VBScript.RegExp")
RegEx.Global = True
RegEx.Pattern = "<(.|\n)+?>"
StripHtML = RegEx.Replace(inval, "")

End Function

Bob Phillips
12-10-2009, 09:03 AM
Have you tried just adding the extra characters to that pattern?

JimS
12-10-2009, 09:42 AM
I'll try, but the new pattern that I need to remove is something like this, ":&nbsp;" (without the quotes) so I wasn't sure if it would memove all the n, b, s and p's.

Any idea how it should be added to this?

RegEx.Pattern = "<(.|\n)+?>"

I think the issue is that the new pattern (:&nbsp;) is not between the <> in the text of the cell.

Oorang
12-10-2009, 11:22 AM
Hello Jim:)
Although appropriate in some rare cases, parsing HTML via regex is generally considered to be a questionable practice (http://www.codinghorror.com/blog/archives/001311.html) that can expose you to the slings and arrows (http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) of your peers. It's very hard to get right, and even when you do, it's hard to maintain and understand for those who maintain your code in the future.

If I may, here is a slightly less painful approach. For this to work you must first set a reference to Microsoft HTML Library (if you can't find it on the list, just browse to MSHTML.tlb).

Option Explicit

Public Sub Example()
Const csDocText As String = _
"<!DOCTYPE html PUBLIC ""-//W3C//DTD XHTML 1.0 Transitional//EN""" & _
"""http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"">" & _
"<html xmlns=""http://www.w3.org/1999/xhtml""><head>" & _
"<meta name=""generator"" content=" & _
"""HTML Tidy for Windows (vers 14 February 2006), see www.w3.org"" />" & _
"<title>I am a document</title></head><body>" & _
"<p><b>1. </b>In the beginning god created the heaven and the earth.</p>" & _
"<p><b>2. </b>And the earth was without form, and void; and darkness " & _
"<i>was</i> upon the face of the waters.</p>" & _
"<p><b>3.</b> And God said, Let there be light: and there was " & _
"light.</p></body></html>"

MsgBox GetTextFromHTML(csDocText)

End Sub

Public Function GetTextFromHTML(ByVal html As String) As String
Dim doc As MSHTML.HTMLDocument
Set doc = New MSHTML.HTMLDocument
Do Until doc.readyState = "complete"
DoEvents
Loop
doc.body.innerHTML = html
GetTextFromHTML = doc.body.innerText
Set doc = Nothing
End Function

JimS
12-10-2009, 12:37 PM
Aaron,

I'm all for anything less painfull...

Could you give me a little more detail on how to use this code and what it actually does, Thanks...

Basically I copy a file that is an export out of a database and paste it on to a worksheet. I copy each cell in a column to a blank column using the "StripHTML Function" to remove all the HTML Characters and then pasting the cleaned text back into the original cell. The code performs this routine on 3 different columns of data.

Is there a way to modify Aaron's code so that it will loop through all the used cells in a column and then paste the non-HTML text back into the same cell?

Oorang
12-15-2009, 09:29 AM
Hi Jim,
Sorry for the short answer: I can think of several ways to do this. First try looping through the cells as you normally do, and just do something to this effect:

Worksheets("Sheet2").Cells(1,1).Value = GetTextFromHTML(Worksheets("Sheet1").Cells(1,1).Value)


Not sure what the performance will be on that, let me know. If it's slow we might try a different approach.

Other questions:
Why aren't you just opening the directly in Excel and resaving it as xls file? Excel can open html files.
Can you get the export in a format other than HTML?
How well formed is the html? DocType? XML? XHTML? etc:)

JimS
12-16-2009, 08:39 AM
Your suggest is what I needed, Thanks...