View Full Version : Solved: Automated editing of HTML files..

08-25-2006, 04:55 AM
Hi all!

I have a lot of HTML files of a similar format that I wish to cut data from and save to a comma delimited file.

All files have the same format, with a title, image, price, stock figure, weight, part code and URL for the manufacturer. I need to cut the price, stock figure, part code and URL from the file (so they don't show) and add them into a csv file along with the name of the file being processed.

As I am familiar with Excel VBA I thought I could dive straight into Frontpage VBA and code this very easily...not so. Either I'm being really dumb, or this is not as easy as it sounds...

I have attached 2 sample files in the .zip attachment.

I would appreciate ANY help whatsoever on this...


08-25-2006, 05:26 AM
I'd do this way...

1. In the excel file create a web query on one of your html pages (registering it with the macro recorder).
2. Adapt the registered code so that you can loop for all the html files.
3. At the end of the sub insert some code to store/copy/write data of each file in a csv file.

what you think?

08-26-2006, 09:49 AM
Thanks for the reply ALe I will give your suggestion a try. I'll let you know how I get on!

08-26-2006, 01:44 PM
If the data is in tables you might take a look at this kb entry:

08-28-2006, 06:41 AM
Hi Rob,

Give the following a try, should do what you need:Sub fatbaldbob()
Dim FileArray() As String, CSVData() As String
Dim Cnt As Long, i As Long, vFF As Long
Dim tempStr As String, CSVFile As String, vFile As String
Dim RegEx As Object

CSVFile = "C:\rob.csv"

Cnt = 0
ReDim CSVData(4, 0) '0=price,1=stock figure,2=part code,3=url,4=path\filename
ReDim FileArray(1, 0) '0=path,1=filename
vFileSearch "C:\samples\", FileArray
' vFileSearch "C:\samples2\", FileArray 'if you want to look in more than one directory
Set RegEx = CreateObject("vbscript.regexp")
With RegEx
.Global = True
.IgnoreCase = True
.MultiLine = True
End With
For i = 0 To UBound(FileArray, 2)
vFF = FreeFile
vFile = FileArray(0, i) & FileArray(1, i)
Open vFile For Binary As #vFF
tempStr = Space$(LOF(vFF))
Get #vFF, , tempStr
Close #vFF
RegEx.Pattern = "?[\d\.]+[^\x00]*?\d+ in Stock[^\x00]*?Part Code[^\x00]" & _
"*?<b>[^\x00]*?<\/b>[^\x00]*?<a href=""http[^\x00]*?""[^\x00]*?<\/a>"
If Not RegEx.Test(tempStr) Then
MsgBox FileArray(0, i) & FileArray(1, i) & vbCrLf & _
"File pattern not met, skipping file"
ReDim Preserve CSVData(4, Cnt)

RegEx.Pattern = "(?[\d\.]+)"
CSVData(0, Cnt) = RegEx.Execute(tempStr).Item(0).SubMatches(0)
tempStr = RegEx.Replace(tempStr, "&nbsp;")

RegEx.Pattern = "(\d+ in Stock)"
CSVData(1, Cnt) = RegEx.Execute(tempStr).Item(0).SubMatches(0)
tempStr = RegEx.Replace(tempStr, "&nbsp;")

RegEx.Pattern = "(Part Code[^\x00]*?<b>)([^\x00]*?)(<\/b>)"
CSVData(2, Cnt) = RegEx.Execute(tempStr).Item(0).SubMatches(1)
tempStr = RegEx.Replace(tempStr, "&nbsp;")

RegEx.Pattern = "<a href=""(http[^\x00]*?)""[^\x00]*?<\/a>"
CSVData(3, Cnt) = RegEx.Execute(tempStr).Item(0).SubMatches(0)
tempStr = RegEx.Replace(tempStr, "&nbsp;")

CSVData(4, Cnt) = vFile
Cnt = Cnt + 1

vFF = FreeFile
Open vFile For Output As #vFF
Print #vFF, tempStr;
Close #vFF
End If

vFF = FreeFile
Open CSVFile For Output As #vFF
For i = 0 To Cnt - 1
Print #vFF, Join(Array(CSVData(0, i), CSVData(1, i), CSVData(2, i), _
CSVData(3, i), CSVData(4, i)), ",")
Close #vFF
Set RegEx = Nothing
End Sub
Function vFileSearch(ByVal vPath As String, ByRef FileArray() As String, _
Optional ByVal vExtension As String = "html") As Boolean
Dim tempStr As String, vCnt As Long
If Len(FileArray(0, LBound(FileArray, 2))) = 0 Then
vCnt = LBound(FileArray, 2)
vCnt = UBound(FileArray, 2) + 1
End If
If Right(vPath, 1) <> "\" Then vPath = vPath & "\"
On Error Resume Next 'in case no 'read' rights to directory
tempStr = Dir(vPath & "*." & vExtension)
On Error GoTo 0
Do Until Len(tempStr) = 0
ReDim Preserve FileArray(1, vCnt)
FileArray(0, vCnt) = vPath
FileArray(1, vCnt) = tempStr
vCnt = vCnt + 1
tempStr = Dir
End FunctionPlease don't hesitate to ask any questions!

09-01-2006, 06:42 AM
Wow, thanks ALe, Lucas and mvidas - can't thank you enough!
I am finally getting somewhere with this now!

Will try and finally nail this problem this weekend now...


09-03-2006, 01:38 AM
Matt, your code works perfectly! Thanks for your time, you are a star!
How do I extract the weight, and delete the dashes? (I've tried to understand your code, but it's a bit beyond me I'm afraid!)
I'm sure it's real easy when you know how...


09-03-2006, 03:51 AM
Sorted! Couldn't get my head around the regular expression patterns (someone should create an online syntax checker!) But have finally done what I needed to.


09-03-2006, 05:49 PM
Well we're here if you do have any questions :) There is a KB entry by brettdj that tells you the syntax (though it sounds like you know it), and there are some online checkers out there (can't think of any at the moment, regexbuddy maybe?). Feel free to post your modified code here, if you don't want that information there at all (weights or the dotted lines), there might be an easier way of doing it.

Also, after thinking a little more about it, we could modify the code to convert it to VBScript, so you can just right-click the .html files and go to Send To to edit them. Just ideas though now, as I'm not on my computer at the moment, but let me know if anything sounds good.

Glad to help though!