View Full Version : Solved: Remove characters from text function
Below is a function that is used to remove all the special HTML characters from the text in a column of cells.
Can the RegEx.Pattern be modified so that it can remove other patterns of characters, in addition to the <(.|\n)+?> that it already removes?
For example, can the RegEx.Pattern have an "and" in it or can there be multiple RegEx.Pattern lines (ie: RegEx.Pattern1, RegEx.Pattern2, RegEx.Pattern3) and
then have the RegEx.Replace(inval, "") reference all of the RegEx.Pattern lines?
Thanks for any ideas or help...
JimS
Public Function StripHtML(inval)
Dim RegEx As Object
Set RegEx = CreateObject("VBScript.RegExp")
RegEx.Global = True
RegEx.Pattern = "<(.|\n)+?>"
StripHtML = RegEx.Replace(inval, "")
End Function
Bob Phillips
12-10-2009, 09:03 AM
Have you tried just adding the extra characters to that pattern?
I'll try, but the new pattern that I need to remove is something like this, ": " (without the quotes) so I wasn't sure if it would memove all the n, b, s and p's.
Any idea how it should be added to this?
RegEx.Pattern = "<(.|\n)+?>"
I think the issue is that the new pattern (: ) is not between the <> in the text of the cell.
Oorang
12-10-2009, 11:22 AM
Hello Jim:)
Although appropriate in some rare cases, parsing HTML via regex is generally considered to be a questionable practice (http://www.codinghorror.com/blog/archives/001311.html) that can expose you to the slings and arrows (http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) of your peers. It's very hard to get right, and even when you do, it's hard to maintain and understand for those who maintain your code in the future.
If I may, here is a slightly less painful approach. For this to work you must first set a reference to Microsoft HTML Library (if you can't find it on the list, just browse to MSHTML.tlb).
Option Explicit
Public Sub Example()
Const csDocText As String = _
"<!DOCTYPE html PUBLIC ""-//W3C//DTD XHTML 1.0 Transitional//EN""" & _
"""http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"">" & _
"<html xmlns=""http://www.w3.org/1999/xhtml""><head>" & _
"<meta name=""generator"" content=" & _
"""HTML Tidy for Windows (vers 14 February 2006), see www.w3.org"" />" & _
"<title>I am a document</title></head><body>" & _
"<p><b>1. </b>In the beginning god created the heaven and the earth.</p>" & _
"<p><b>2. </b>And the earth was without form, and void; and darkness " & _
"<i>was</i> upon the face of the waters.</p>" & _
"<p><b>3.</b> And God said, Let there be light: and there was " & _
"light.</p></body></html>"
MsgBox GetTextFromHTML(csDocText)
End Sub
Public Function GetTextFromHTML(ByVal html As String) As String
Dim doc As MSHTML.HTMLDocument
Set doc = New MSHTML.HTMLDocument
Do Until doc.readyState = "complete"
DoEvents
Loop
doc.body.innerHTML = html
GetTextFromHTML = doc.body.innerText
Set doc = Nothing
End Function
Aaron,
I'm all for anything less painfull...
Could you give me a little more detail on how to use this code and what it actually does, Thanks...
Basically I copy a file that is an export out of a database and paste it on to a worksheet. I copy each cell in a column to a blank column using the "StripHTML Function" to remove all the HTML Characters and then pasting the cleaned text back into the original cell. The code performs this routine on 3 different columns of data.
Is there a way to modify Aaron's code so that it will loop through all the used cells in a column and then paste the non-HTML text back into the same cell?
Oorang
12-15-2009, 09:29 AM
Hi Jim,
Sorry for the short answer: I can think of several ways to do this. First try looping through the cells as you normally do, and just do something to this effect:
Worksheets("Sheet2").Cells(1,1).Value = GetTextFromHTML(Worksheets("Sheet1").Cells(1,1).Value)
Not sure what the performance will be on that, let me know. If it's slow we might try a different approach.
Other questions:
Why aren't you just opening the directly in Excel and resaving it as xls file? Excel can open html files.
Can you get the export in a format other than HTML?
How well formed is the html? DocType? XML? XHTML? etc:)
Your suggest is what I needed, Thanks...
Powered by vBulletin® Version 4.2.5 Copyright © 2024 vBulletin Solutions Inc. All rights reserved.