Log in

View Full Version : [SOLVED:] Speeding up a replace macro running through a long text



_frudo
01-15-2023, 03:51 AM
Hello,
I had a go at writing a macro that will look for a string in a long xml code (650,000 lines), will duplicate the paragraph the string is found in and will replace the string with another string in the original paragraph, and will do this till the end of the document.

This is fine with shorter pieces of xml to go through, but with the 650,000 lines, my Word 365 simply freezes, the macro seemingly not even starting. If I try it on 350,000 lines, the macro starts fine, but gets slower and slower, and like after 20 min Word freezes, too.

Is there a way the macro can be written to run faster? This is why I am asking for your help. Or would you have another idea how I get this done?
Thanks!



Public Const sTEXT = "k=""ABCD"""
Public Const sREPLACETEXT = "k=""EFGH"""

Sub Duplicate_And_Replace()
'Duplicates every paragraph (= line) with sTEXT and replaces sTEXT with sREPLACETEXT in the duplicate
Dim lngSafety As Long
Dim blnIsFound As Boolean
On Error GoTo Error
Selection.HomeKey Unit:=wdStory

'Setting initial value for blnIsFound to make sure sTEXT does show up at all
Selection.Find.ClearFormatting
Selection.Find.Replacement.ClearFormatting
With Selection.Find
.TEXT = sTEXT
.Forward = True
.Wrap = wdFindStop
.Format = False
.MatchCase = False
.MatchWholeWord = False
.MatchWildcards = False
.MatchSoundsLike = False
.MatchAllWordForms = False
End With
blnIsFound = Selection.Find.Execute 'executes Find and sets blnIsFound to true or false

lngSafety = 0
Selection.HomeKey Unit:=wdStory
While blnIsFound
Selection.Find.ClearFormatting
Selection.Find.Replacement.ClearFormatting
blnIsFound = Selection.Find.Execute
' MsgBox blnIsFound
lngSafety = lngSafety + 1
' If lngSafety > 8 Then
' MsgBox lngSafety & " loops, probably infinite loop. Quitting."
' Exit Sub
' End If
If blnIsFound Then
CopyCurrentParagraph
Selection.Collapse Direction:=wdCollapseStart
Selection.PasteAndFormat (wdFormatOriginalFormatting)
ReplaceTextInCurrentParagraph
End If
Wend
MsgBox lngSafety & " loops"
Err.Clear
Error:
If Err.Number <> 0 Then MsgBox "Error: " & _
Err.Number & vbLf & Err.Description: Err.Clear
End Sub

Sub CopyCurrentParagraph()
Selection.StartOf Unit:=wdParagraph
Selection.MoveEnd Unit:=wdParagraph
Selection.Copy
End Sub

Sub ReplaceTextInCurrentParagraph()
Selection.StartOf Unit:=wdParagraph
Selection.MoveEnd Unit:=wdParagraph
Selection.Find.ClearFormatting
Selection.Find.Replacement.ClearFormatting
With Selection.Find
.TEXT = sTEXT
.Replacement.TEXT = sREPLACETEXT
.Forward = True
.Wrap = wdFindStop
.Format = False
.MatchCase = False
.MatchWholeWord = False
.MatchWildcards = False
.MatchSoundsLike = False
.MatchAllWordForms = False
End With
With Selection
If .Find.Forward = True Then
.Collapse Direction:=wdCollapseStart
Else
.Collapse Direction:=wdCollapseEnd
End If
.Find.Execute Replace:=wdReplaceOne
If .Find.Forward = True Then
.Collapse Direction:=wdCollapseEnd
Else
.Collapse Direction:=wdCollapseStart
End If
End With
End Sub

vtito.site
01-15-2023, 07:33 AM
Hello,
Or would you have another idea how I get this done?

When you need to process a huge number of lines, I think the first problem that the algorithm and Word itself will face is lack of memory.
To solve this problem, I would start with the fact that an xml file is just a text file, and you need to work with it as with a text file. That is, without using the Word object model.
Then the question arises, how to work with it? – whether to load it into memory in full or in parts. The answer to this question depends on the power of your computer.
In any case, the task is quite interesting, and if you provide the file itself, as well as what sTEXT and what sREPLACETEXT needs to be changed, I could solve this problem.
At one time, I had to process an xml file with a size of 50 GB. I did it in multithreaded mode in c#. However, it would be interesting to solve the same task on VBA.

_frudo
01-15-2023, 12:09 PM
if you provide the file itself, as well as what sTEXT and what sREPLACETEXT needs to be changed, I could solve this problem.
Thanks for your interest! You can find an example xml file here (2,7 MB ZIP):
https://upload.disroot.org/r/ofHPFsDR#CxR+zdZlEzpu9BWrMg4pHtAw8hebj15nAw5oSeqt8Gg=

For this file, the first two lines in the macro would be
Public Const sTEXT = "k=""archaeological_site"""
Public Const sREPLACETEXT = "k=""site_type"""

_frudo
01-16-2023, 04:57 AM
Now it is running, thanks!
Something is not right, however. It takes 3:20 h for the macro to run, and I can see the last page with correct replacements, but then Word just shuts down before I can save anything. I tried twice already.

On restarting, Word offers a restored file that seems to be the result of the macro, but is corrupted: The first two lines of the xml file now are

<

k="site_type"k="archaeological_site"?xml version="1.0" encoding="UTF-8"?>



when it was (before the macro)

<?xml version="1.0" encoding="UTF-8"?>

k="site_type" being the printout of sTEXT and k="archaeological_site" of sREPLACETEXT, in this case.

This also happens with a short xml file of only two pages when Word does not shut down, so appartently is not related to the problem of Word shutting down.

vtito.site
01-16-2023, 05:30 AM
Thanks for your interest! You can find an example xml file here (2,7 MB ZIP):

Thanks for the source file. The task is really interesting. I had to resort to using Scripting.FileSystemObject, since the source your file has unix line endings (vbLF only)
The presented algorithm works for me not to say that fast. VBA is still a slow language. I have a processing speed of 1000 lines per 6 seconds. That is, 650 thousand will be processed in more than an hour. Frankly speaking, in c# this task would have been completed faster.
But you can always see at what stage of processing the algorithm is (due to Application.StatusBar = "Processing File " & myCounter & " lines processed...").
And you won't need to record the output file, it will be saved automatically.



Sub Duplicate_And_Replace_From_VS()
Dim OutputFileNum As Integer
Dim XmlLine As String
Dim XmlNewLine As String
Dim currentBarStatus As Boolean, myCounter As Long
Dim startTime As Date, endTime As Date


startTime = Now


currentBarStatus = Application.DisplayStatusBar
Application.DisplayStatusBar = True


Dim fso As Scripting.FileSystemObject
Set fso = New Scripting.FileSystemObject
Dim fInputFile As Scripting.TextStream
Set fInputFile = fso.OpenTextFile("arch NOT site 2023-01-14.osm", ForReading, False, TristateMixed)


OutputFileNum = FreeFile()
Open "arch NOT site 2023-01-14New.osm" For Append As #OutputFileNum


While Not fInputFile.AtEndOfStream
XmlLine = fInputFile.ReadLine()
Print #OutputFileNum, XmlLine
myCounter = myCounter + 1
If InStr(1, XmlLine, sTEXT) > 0 Then
XmlNewLine = Replace(XmlLine, sTEXT, sREPLACETEXT)
Print #OutputFileNum, XmlNewLine
End If
Application.StatusBar = "Processing File " & myCounter & " lines processed..."
DoEvents
Wend
fInputFile.Close
Close #OutputFileNum
Application.DisplayStatusBar = currentBarStatus
endTime = Now
Dim elapsedTime As Integer
elapsedTime = DateDiff("s", startTime, endTime)
MsgBox "Done for " & elapsedTime & "sec."
End Sub

_frudo
01-16-2023, 01:11 PM
Thank you so much, vitito.site!

I am getting a "user defined type not defined" error on the following line, both in Word and in Excel:

Dim fso As Scripting.FileSystemObject
I suppose I have to set a reference to an object library somewhere, but where to?

vtito.site
01-18-2023, 12:59 AM
Yes, you should go to reference and the check Microsoft Scripting Runtime