Consulting

Results 1 to 11 of 11

Thread: Speeding up a replace macro running through a long text

  1. #1
    VBAX Newbie
    Joined
    Jan 2023
    Posts
    5
    Location

    Question Speeding up a replace macro running through a long text

    Hello,
    I had a go at writing a macro that will look for a string in a long xml code (650,000 lines), will duplicate the paragraph the string is found in and will replace the string with another string in the original paragraph, and will do this till the end of the document.

    This is fine with shorter pieces of xml to go through, but with the 650,000 lines, my Word 365 simply freezes, the macro seemingly not even starting. If I try it on 350,000 lines, the macro starts fine, but gets slower and slower, and like after 20 min Word freezes, too.

    Is there a way the macro can be written to run faster? This is why I am asking for your help. Or would you have another idea how I get this done?
    Thanks!


    Public Const sTEXT = "k=""ABCD"""
    Public Const sREPLACETEXT = "k=""EFGH"""
    
    Sub Duplicate_And_Replace()
    'Duplicates every paragraph (= line) with sTEXT and replaces sTEXT with sREPLACETEXT in the duplicate
        Dim lngSafety As Long
        Dim blnIsFound As Boolean
        On Error GoTo Error
        Selection.HomeKey Unit:=wdStory
        
        'Setting initial value for blnIsFound to make sure sTEXT does show up at all
        Selection.Find.ClearFormatting
        Selection.Find.Replacement.ClearFormatting
        With Selection.Find
            .TEXT = sTEXT
            .Forward = True
            .Wrap = wdFindStop
            .Format = False
            .MatchCase = False
            .MatchWholeWord = False
            .MatchWildcards = False
            .MatchSoundsLike = False
            .MatchAllWordForms = False
        End With
        blnIsFound = Selection.Find.Execute                            'executes Find and sets blnIsFound to true or false
        
        lngSafety = 0
        Selection.HomeKey Unit:=wdStory
        While blnIsFound
            Selection.Find.ClearFormatting
            Selection.Find.Replacement.ClearFormatting
            blnIsFound = Selection.Find.Execute
    '        MsgBox blnIsFound
            lngSafety = lngSafety + 1
    '        If lngSafety > 8 Then
    '            MsgBox lngSafety & " loops, probably infinite loop. Quitting."
    '            Exit Sub
    '        End If
            If blnIsFound Then
                CopyCurrentParagraph
                Selection.Collapse Direction:=wdCollapseStart
                Selection.PasteAndFormat (wdFormatOriginalFormatting)
                ReplaceTextInCurrentParagraph
            End If
        Wend
        MsgBox lngSafety & " loops"
        Err.Clear
    Error:
        If Err.Number <> 0 Then MsgBox "Error: " & _
            Err.Number & vbLf & Err.Description: Err.Clear
    End Sub
    
    Sub CopyCurrentParagraph()
        Selection.StartOf Unit:=wdParagraph
        Selection.MoveEnd Unit:=wdParagraph
        Selection.Copy
    End Sub
    
    Sub ReplaceTextInCurrentParagraph()
        Selection.StartOf Unit:=wdParagraph
        Selection.MoveEnd Unit:=wdParagraph
        Selection.Find.ClearFormatting
        Selection.Find.Replacement.ClearFormatting
        With Selection.Find
            .TEXT = sTEXT
            .Replacement.TEXT = sREPLACETEXT
            .Forward = True
            .Wrap = wdFindStop
            .Format = False
            .MatchCase = False
            .MatchWholeWord = False
            .MatchWildcards = False
            .MatchSoundsLike = False
            .MatchAllWordForms = False
        End With
        With Selection
            If .Find.Forward = True Then
                .Collapse Direction:=wdCollapseStart
            Else
                .Collapse Direction:=wdCollapseEnd
            End If
            .Find.Execute Replace:=wdReplaceOne
            If .Find.Forward = True Then
                .Collapse Direction:=wdCollapseEnd
            Else
                .Collapse Direction:=wdCollapseStart
            End If
        End With
    End Sub

  2. #2
    Quote Originally Posted by _frudo View Post
    Hello,
    Or would you have another idea how I get this done?
    When you need to process a huge number of lines, I think the first problem that the algorithm and Word itself will face is lack of memory.
    To solve this problem, I would start with the fact that an xml file is just a text file, and you need to work with it as with a text file. That is, without using the Word object model.
    Then the question arises, how to work with it? – whether to load it into memory in full or in parts. The answer to this question depends on the power of your computer.
    In any case, the task is quite interesting, and if you provide the file itself, as well as what sTEXT and what sREPLACETEXT needs to be changed, I could solve this problem.
    At one time, I had to process an xml file with a size of 50 GB. I did it in multithreaded mode in c#. However, it would be interesting to solve the same task on VBA.

  3. #3
    VBAX Newbie
    Joined
    Jan 2023
    Posts
    5
    Location
    Quote Originally Posted by vtito.site View Post
    if you provide the file itself, as well as what sTEXT and what sREPLACETEXT needs to be changed, I could solve this problem.
    Thanks for your interest! You can find an example xml file here (2,7 MB ZIP):
    https://upload.disroot.org/r/ofHPFsD...5nAw5oSeqt8Gg=

    For this file, the first two lines in the macro would be
    Public Const sTEXT = "k=""archaeological_site"""
    Public Const sREPLACETEXT = "k=""site_type"""

  4. #4
    VBAX Guru macropod's Avatar
    Joined
    Jul 2008
    Posts
    4,433
    Location
    From your description, it seems to me all you need is:
    Sub Duplicate_And_Replace()
    Application.ScreenUpdating = False
    Const sTEXT = "k=""ABCD"""
    Const sREPLACETEXT = "k=""EFGH"""
    With ActiveDocument.Range.Find
      .ClearFormatting
      .Replacement.ClearFormatting
      .Text = "([!^13]@)" & sTEXT & "([!^13]@^13)"
      .Replacement.Text = "\1" & sREPLACETEXT & "\2\1" & sTEXT & "\2"
      .Forward = True
      .Wrap = wdFindContinue
      .Format = False
      .MatchWildcards = True
      .Execute Replace:=wdReplaceAll
    End With
    Application.ScreenUpdating = True
    End Sub
    Last edited by macropod; 01-15-2023 at 05:50 PM.
    Cheers
    Paul Edstein
    [Fmr MS MVP - Word]

  5. #5
    VBAX Newbie
    Joined
    Jan 2023
    Posts
    5
    Location
    That seems very elegant and lean - thanks, macropod!
    Somehow, though, I get a message saying the argument in the line ".Execute Replace = wdReplaceAll" is not optimal. While I would be fine with suboptimally having the job done, my Word 365 apparently isn't.
    Would you know what is bothering Word?

  6. #6
    Moderator VBAX Wizard Aussiebear's Avatar
    Joined
    Dec 2005
    Location
    Queensland
    Posts
    5,132
    Location
    Would it simply be that we are missing a colon in front of the equal sign ".Execute Replace:=wdReplaceAll" ?
    Remember To Do the Following....
    Use [Code].... [/Code] tags when posting code to the thread.
    Mark your thread as Solved if satisfied by using the Thread Tools options.
    If posting the same issue to another forum please show the link

  7. #7
    VBAX Guru macropod's Avatar
    Joined
    Jul 2008
    Posts
    4,433
    Location
    Quote Originally Posted by Aussiebear View Post
    Would it simply be that we are missing a colon in front of the equal sign ".Execute Replace:=wdReplaceAll" ?
    Correct - and code corrected.
    Cheers
    Paul Edstein
    [Fmr MS MVP - Word]

  8. #8
    VBAX Newbie
    Joined
    Jan 2023
    Posts
    5
    Location
    Now it is running, thanks!
    Something is not right, however. It takes 3:20 h for the macro to run, and I can see the last page with correct replacements, but then Word just shuts down before I can save anything. I tried twice already.

    On restarting, Word offers a restored file that seems to be the result of the macro, but is corrupted: The first two lines of the xml file now are
    < 
    
      k="site_type"k="archaeological_site"?xml version="1.0" encoding="UTF-8"?>
    when it was (before the macro)
    <?xml version="1.0" encoding="UTF-8"?>
    k="site_type" being the printout of sTEXT and k="archaeological_site" of sREPLACETEXT, in this case.

    This also happens with a short xml file of only two pages when Word does not shut down, so appartently is not related to the problem of Word shutting down.
    Last edited by _frudo; 01-16-2023 at 05:15 AM.

  9. #9
    Quote Originally Posted by _frudo View Post
    Thanks for your interest! You can find an example xml file here (2,7 MB ZIP):
    Thanks for the source file. The task is really interesting. I had to resort to using Scripting.FileSystemObject, since the source your file has unix line endings (vbLF only)
    The presented algorithm works for me not to say that fast. VBA is still a slow language. I have a processing speed of 1000 lines per 6 seconds. That is, 650 thousand will be processed in more than an hour. Frankly speaking, in c# this task would have been completed faster.
    But you can always see at what stage of processing the algorithm is (due to Application.StatusBar = "Processing File " & myCounter & " lines processed...").
    And you won't need to record the output file, it will be saved automatically.

    Sub Duplicate_And_Replace_From_VS()
    Dim OutputFileNum As Integer
    Dim XmlLine As String
    Dim XmlNewLine As String
    Dim currentBarStatus As Boolean, myCounter As Long
    Dim startTime As Date, endTime As Date
    
    
    startTime = Now
    
    
    currentBarStatus = Application.DisplayStatusBar
    Application.DisplayStatusBar = True
    
    
    Dim fso As Scripting.FileSystemObject
    Set fso = New Scripting.FileSystemObject
    Dim fInputFile As Scripting.TextStream
    Set fInputFile = fso.OpenTextFile("arch NOT site 2023-01-14.osm", ForReading, False, TristateMixed)
    
    
    OutputFileNum = FreeFile()
    Open "arch NOT site 2023-01-14New.osm" For Append As #OutputFileNum
    
    
    While Not fInputFile.AtEndOfStream
      XmlLine = fInputFile.ReadLine()
      Print #OutputFileNum, XmlLine
      myCounter = myCounter + 1
      If InStr(1, XmlLine, sTEXT) > 0 Then
        XmlNewLine = Replace(XmlLine, sTEXT, sREPLACETEXT)
        Print #OutputFileNum, XmlNewLine
      End If
      Application.StatusBar = "Processing File " & myCounter & " lines processed..."
      DoEvents
    Wend
    fInputFile.Close
    Close #OutputFileNum
    Application.DisplayStatusBar = currentBarStatus
    endTime = Now
    Dim elapsedTime As Integer
    elapsedTime = DateDiff("s", startTime, endTime)
    MsgBox "Done for " & elapsedTime & "sec."
    End Sub

  10. #10
    VBAX Newbie
    Joined
    Jan 2023
    Posts
    5
    Location
    Thank you so much, vitito.site!

    I am getting a "user defined type not defined" error on the following line, both in Word and in Excel:
    Dim fso As Scripting.FileSystemObject
    I suppose I have to set a reference to an object library somewhere, but where to?

  11. #11
    Yes, you should go to reference and the check Microsoft Scripting Runtime

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •