Consulting

Page 1 of 2 1 2 LastLast
Results 1 to 20 of 22

Thread: Finding heading of chapters in .docx file and copying paragraphs to new .docx

  1. #1
    VBAX Regular
    Joined
    Dec 2017
    Posts
    13
    Location

    Finding heading of chapters in .docx file and copying paragraphs to new .docx

    Hello community,

    I have a large amount of .docx files, which I would like to reduce to a more readable size each. In each file, there is one heading that appears several times, always formated as a 'Heading 2' style. I look for this specific heading which as I said occurs several times in a document and I want to copy all the text parts in just these chapters with the respective heading to a new word document (so I have the same amount of files in the end, however with only the content that I want in them).

    To start with I wrote a code in Python that copies the full list of all files with their directory to my excel file in column A. Now in column B, I have defined the different headings of the paragraphs that I need to extract from the files (see example below).

    Excel.jpg

    The word files look something like in the picture below and I have indicated the text sections that I would need to copy to the new reduced file and also some hints regarding formatting.

    Word example file.jpg

    Now to approach this problem I have written the following VBA, which I created also using some of the input provided to others on this platform before.


    Sub SelectData()
        
        Application.ScreenUpdating = False
        
        Dim WdApp As Word.Application
        Set WdApp = CreateObject("Word.Application")
        
        Dim Doc As Word.Document
        Dim NewDoc As Word.Document
            
        Dim HeadingToFind As String
        Dim ChapterToFind As String
        Dim StartRange As Long
        Dim EndRange As Long
        
        Dim WkSht As Worksheet
        
        Dim LRow As Long
        Dim i As Long
        
        Set WkSht = ThisWorkbook.Sheets("Sheet1")
        LRow = WkSht.UsedRange.Cells.SpecialCells(xlCellTypeLastCell).Row
                 
        With WkSht
            For i = 1 To LRow
                If Dir(.Cells(i, 1).Text, vbNormal) = "" Then
                    .Cells(i, 3).Value = "Please check File Location"
                Else
                    Set Doc = WdApp.Documents.Open(Filename:=.Cells(i, 1).Text, _
                    AddToRecentFiles:=False, Visible:=False, ReadOnly:=False)
                                    
                    Set NewDoc = Documents.Add
                                             
                    ChapterToFind = LCase(.Cells(i, 2).Text)
                        
                        With Doc
                        
                        Selection.HomeKey Unit:=wdStory
                        
                            With Selection
                            
                                With .Find
                                    .ClearFormatting
                                    .Text = ChapterToFind
                                    .MatchWildcards = False
                                    .MatchCase = True
                                    .Execute
                                End With
                                
                                If .Find.Found Then
                                    .Collapse wdCollapseStart
                                    With .Find
                                        .Text = ""
                                        .Style = "Heading 2"
                                        .Forward = False
                                        .Execute
                                    End With
                                    
                                    .MoveDown Count:=1
                                    .HomeKey Unit:=wdLine
                                    StartRange = .Start
                                    
                                    
                                    .Find.Forward = True
                                    .Find.Execute
                                    .Collapse wdCollapseStart
                                    .MoveUp Count:=1
                                    .EndKey Unit:=wdLine
                                    EndRange = .End
                                    
                                    Doc.Range(StartRange, EndRange).Copy
                                    NewDoc.Content.Paste
                                    NewDoc.SaveAs2 Doc.Path & "Clean" & ".docx", wdFormatFlatXML
                                Else
                                    WkSht.Cells(i, 4).Value = "Error Chapter Not Found"
                                End If
                                
                            End With
        
                    End With
                    WdApp.Quit
                    Set Doc = Nothing: Set NewDoc = Nothing: Set WdApp = Nothing: Set WkSht = Nothing
                    Application.ScreenUpdating = True
                
                End If
            
            Next
        
        End With
    
    
    End Sub
    Now I struggle with getting this to work. Particularly I get a runtime error 438, where the issue is with the "Selection.HomeKey".
    Also I have not been able to find any idea or solution for the problem that the Chapters that I try to copy the content to the new files may occur several times within one document. As a result, I would really appreciate any help.

    I want to let you know that I have first posted this problem on a different platform (stackoverflow) before, when I tried to approach the problem using Python (I cannot post a link but if you search for "Finding a heading in word file and copying entire paragraph..." you will find it). Subsequently I have posted there also asking for help with the VBA solution (again, I cannot post links but if you go there and search for "Finding heading of chapters in word file and copying individual paragraphs..." you will see it). However, I have not been able to progress. I have understood that the ".Selection" needs to be linked to the word application, but have not succeeded in implementing this. Overall, I am in dire need of a solution as I am under time pressure, therefore I am now posting here with high hopes.

    Thank you for any help in advance!

  2. #2
    VBAX Regular
    Joined
    Dec 2017
    Posts
    13
    Location
    Update:

    I got it to work to a certain extent using the following code:
    Sub ExtractData()
        Application.ScreenUpdating = False
        Dim wdApp As Object, wdDoc As Object, wdRng As Object, nwdDoc
        Dim WkSht As Worksheet, LRow As Long, i As Long
        Set WkSht = ThisWorkbook.Sheets("Sheet1")
        LRow = WkSht.UsedRange.Cells.SpecialCells(xlCellTypeLastCell).Row
        Set wdApp = CreateObject("Word.Application")
        If wdApp Is Nothing Then
            MsgBox "Can't start Word.", vbExclamation
            Exit Sub
        End If
        With WkSht
            For i = 1 To LRow
                If LCase(.Cells(i, 1).Text) = "true" Then
                    If Dir(.Cells(i, 2).Text, vbNormal) = "" Then
                        .Cells(i, 3).Value = "Please check File Location"
                    Else
                        Set wdDoc = wdApp.Documents.Open(Filename:=.Cells(i, 2).Text, _
                        AddToRecentFiles:=False, Visible:=False, ReadOnly:=False)
                        With wdDoc
                            With .Range
                                With .Find
                                    .ClearFormatting
                                    .Replacement.ClearFormatting
                                    .Forward = True
                                    .Wrap = 0 'wdFindStop
                                    .Format = True
                                    .Style = "Heading 2"
                                    .MatchWildcards = False
                                    .MatchCase = False
                                    .Text = WkSht.Cells(i, 4).Value
                                    .Replacement.Text = ""
                                    .Execute
                                End With
                                If .Find.Found Then
                                    Set wdRng = .Duplicate
                                    wdRng.Collapse 0 'wdCollapseEnd
                                End If
                                .Start = wdRng.End
                                With .Find
                                    .Style = "Heading 2"
                                    .Text = ""
                                    .Execute
                                End With
                                If .Find.Found Then
                                    wdRng.End = .Duplicate.Start - 1
                                End If
                                If Not wdRng Is Nothing Then
                                    With wdRng
                                        While .Tables.Count > 0
                                            .Tables(1).Delete
                                        Wend
                                        With .Find
                                            .ClearFormatting
                                            .Replacement.ClearFormatting
                                            .Forward = True
                                            .Wrap = 0 'wdFindStop
                                            .Format = False
                                            .MatchWildcards = True
                                            .Text = "[^13^l]{1,}"
                                            .Replacement.Text = Chr(182)
                                            .Execute Replace:=2 'wdReplaceAll
                                        End With
                                        
                                        Set nwdDoc = wdApp.Documents.Add
                                        
                                        If Len(.Text) > 1 Then
                                            .Copy
                                            
                                            With nwdDoc
                                            nwdDoc.Content.Paste
                                            nwdDoc.SaveAs2 wdDoc.Path & "_Extract_" & wdDoc.Name
                                            End With
                                        Else
                                            WkSht.Cells(i, 3).Value = "No Data"
                                        End If
                                    End With
                                Else
                                    WkSht.Cells(i, 3).Value = "Not Found"
                                End If
                            End With
                            .Close SaveChanges:=False
                        End With
                        Set wdRng = Nothing
                    End If
                End If
            Next
        End With
        wdApp.Quit
        Set wdDoc = Nothing: Set wdApp = Nothing: Set WkSht = Nothing: Set nwdDoc = Nothing
        Application.ScreenUpdating = True
    End Sub
    It perfectly does the job, however only for the first time the chapter I seek to extract. As I mentioned before, the chapter I try to get occurs several times in one source document. Any suggestions on how to change the code to make it copy all the sections that I need?

    Thank you for any help!

  3. #3
    Knowledge Base Approver VBAX Guru macropod's Avatar
    Joined
    Jul 2008
    Posts
    4,435
    Location
    You seem to be making this far harder than it needs to be:
    Sub Demo()
    Application.ScreenUpdating = False
    Dim DocSrc As Document, DocTgt As Document, Rng As Range
    Set DocSrc = ActiveDocument: Set DocTgt = Documents.Add
    With DocSrc.Range
      With .Find
        .ClearFormatting
        .Replacement.ClearFormatting
        .Text = ""
        .Replacement.Text = ""
        .Forward = True
        .Format = True
        .Style = wdStyleHeading2
        .Wrap = wdFindStop
        .MatchCase = False
        .MatchWholeWord = False
        .MatchWildcards = False
        .MatchSoundsLike = False
        .MatchAllWordForms = False
        .Execute
      End With
      Do While .Find.Found
        Set Rng = .Duplicate.GoTo(What:=wdGoToBookmark, Name:="\HeadingLevel")
        DocTgt.Characters.Last.FormattedText = Rng.FormattedText
        .End = Rng.End
        If .End = ActiveDocument.Range.End Then Exit Do
        .Collapse wdCollapseEnd
        .Find.Execute
      Loop
    End With
    Application.ScreenUpdating = True
    End Sub
    Cheers
    Paul Edstein
    [Fmr MS MVP - Word]

  4. #4
    Knowledge Base Approver VBAX Wizard
    Joined
    Apr 2012
    Posts
    5,642
    I wouldn't mind if you posted a sample Word Dcoument + indication which text in the document you need.

  5. #5
    VBAX Regular
    Joined
    Dec 2017
    Posts
    13
    Location
    Thanks macropod, I am working on implementing your solution. So far no success, I keep getting an error about the following line:

    .End = Rng.End
    It says, "Argument is not optional". Will keep trying to make it work.

  6. #6
    VBAX Regular
    Joined
    Dec 2017
    Posts
    13
    Location
    snb, I uploaded a picture of how the word files look above. Maybe that helps for now, I will try to upload a file later today, need to figure out how to first.

    The parts I try to extract and copy into a new document are the ones marked in yellow. Where in the sample "China" is the chapter I define (from which the text is to be extracted for this .docx) in the excel file from which I try to steer the process. In another file it could for example be that I need to extract the text content of the chapters titled "North America".

    Hope this helps in making my task clear.

    Thank you very much all, your help is highly appreciated!

  7. #7
    VBAX Regular
    Joined
    Dec 2017
    Posts
    13
    Location
    macropod, I have changed my code according to your input. It looks as written below now.

    I keep getting an error still with ".End = Rng.End": Argument is not optional.

    Am I forgetting something?

    Sub ExtractData()
        Application.ScreenUpdating = False
        Dim wdApp As Object
        Dim wdDoc As Document, nwDoc As Document
        Dim Rng As Range
        Dim WkSht As Worksheet
        Dim LRow As Long, i As Long
        
        Set WkSht = ThisWorkbook.Sheets("Sheet1")
        LRow = WkSht.UsedRange.Cells.SpecialCells(xlCellTypeLastCell).Row
        
        Set wdApp = CreateObject("Word.Application")
        
        If wdApp Is Nothing Then
            MsgBox "Can't start Word.", vbExclamation
            Exit Sub
        End If
        
        With WkSht
            For i = 1 To LRow
                If LCase(.Cells(i, 1).Text) = "true" Then
                    If Dir(.Cells(i, 2).Text, vbNormal) = "" Then
                        .Cells(i, 3).Value = "Please check File Location"
                    Else
                        Set wdDoc = wdApp.Documents.Open(Filename:=.Cells(i, 2).Text, _
                        AddToRecentFiles:=False, Visible:=False, ReadOnly:=False)
                        With wdDoc.Range
                                With .Find
                                    .ClearFormatting
                                    .Replacement.ClearFormatting
                                    .Text = WkSht.Cells(i, 4).Value
                                    .Replacement.Text = ""
                                    .Forward = True
                                    .Format = True
                                    .Style = wdStyleHeading2
                                    .Wrap = wdFindStop
                                    .MatchCase = False
                                    .MatchWholeWord = False
                                    .MatchWildcards = False
                                    .MatchSoundsLike = False
                                    .MatchAllWordForms = False
                                    .Execute
                                End With
                                Do While .Find.Found
                                    Set Rng = .Duplicate.GoTo(What:=wdGoToBookmark, Name:="\HeadingLevel")
                                    wdDoc.Characters.Last.FormattedText = Rng.FormattedText
                                    .End = Rng.End
                                    If .End = wdDoc.Range.End Then Exit Do
                                    .Collapse wdCollapseEnd
                                    .Find.Execute
                                Loop
                                    Set nwdDoc = wdApp.Documents.Add
                                    If Len(.Text) > 1 Then
                                    .Copy
                                    With nwdDoc
                                    nwdDoc.Content.Paste
                                    nwdDoc.SaveAs2 wdDoc.Path & "_Extract_" & wdDoc.Name
                                    End With
                                    Else
                                    WkSht.Cells(i, 3).Value = "No Data"
                                    End If
                            End With
                            wdDoc.Close SaveChanges:=False
                        End With
                    End If
                End If
            Next
        End With
        wdApp.Quit
        Set wdDoc = Nothing: Set wdApp = Nothing: Set WkSht = Nothing: Set nwdDoc = Nothing: Set Rng = Nothing
        Application.ScreenUpdating = True
    End Sub
    Thank you for your help!
    Last edited by StefKa; 12-13-2017 at 09:42 AM.

  8. #8
    VBAX Regular
    Joined
    Dec 2017
    Posts
    13
    Location
    snb, I uploaded the sample word document and it can be downloaded using the following link:

    https://ufile.io/97jbh

    I was not able to figure out how to attach files here. Hope that is okay to post like this.

    Thank you again!

  9. #9
    Knowledge Base Approver VBAX Guru macropod's Avatar
    Joined
    Jul 2008
    Posts
    4,435
    Location
    You could delete .End = Rng.End, but I do have to ask why you've messed around with the code instead of using it as-is.
    Cheers
    Paul Edstein
    [Fmr MS MVP - Word]

  10. #10
    VBAX Regular
    Joined
    Dec 2017
    Posts
    13
    Location
    Thank you macropod. I have changed this but keep now getting a Run-time error 13 - Type Mismatch on the following line:

    Set Rng = .Duplicate.GoTo(What:=wdGoToBookmark, Name:="\HeadingLevel")
    The entire code right now is given below. Also please let me try to explain again why I made some changes. As I have shown in the initial description, I have created a list in excel that contains all the path names of the untreated .docx files (about 2700 .docx files) and the specific chapters that need to be extracted from these into new documents (defined absolutely individually, so not the same chapters from each document). Therefore I have included in the code this "opening mechanism" so that the macro opens the files as detailed in the excel (first column) and then searches for the chapters as detailed in the second column. Besides this, I have only changed the saving process of the new documents so that for each old file one new file (containing only the text content of the selected chapters) is saved.

    Am I doing any of this wrong? I've done some research and it says that this error occurs due to mismatch of string and integer, however I am not able to identify any. To be honest, in my opinion the code should now work, but it doesn't.

    I hope this helps. By the way, sorry if any of my English is hard to read, I am originally from Germany and just reside in the US on a working student exchange; so I am a non-native speaker, but try hard to make it clear.

    Again, thank you and I really appreciate your help.

    Sub ExtractData()
    
    
        Application.ScreenUpdating = False
        
        Dim WdApp As Object
        Dim DocSrc As Document, DocTgt As Document
        Dim Rng As Range
        Dim WkSht As Worksheet
        Dim LRow As Long, i As Long
        
        Set WkSht = ThisWorkbook.Sheets("Sheet1")
        LRow = WkSht.UsedRange.Cells.SpecialCells(xlCellTypeLastCell).Row
        
        Set WdApp = CreateObject("Word.Application")
        
        If WdApp Is Nothing Then
            MsgBox "Can't start Word.", vbExclamation
            Exit Sub
        End If
        
        With WkSht
            For i = 1 To LRow
                If LCase(.Cells(i, 1).Text) = "true" Then
                    If Dir(.Cells(i, 2).Text, vbNormal) = "" Then
                        .Cells(i, 3).Value = "Please check File Location"
                    Else
                        
                        Set DocSrc = WdApp.Documents.Open(Filename:=.Cells(i, 2).Text, _
                        AddToRecentFiles:=False, Visible:=False, ReadOnly:=False)
                        Set DocTgt = WdApp.Documents.Add
                        
                        With DocSrc.Range
                                With .Find
                                    .ClearFormatting
                                    .Replacement.ClearFormatting
                                    .Text = WkSht.Cells(i, 4).Value
                                    .Replacement.Text = ""
                                    .Forward = True
                                    .Format = True
                                    .Style = wdStyleHeading2
                                    .Wrap = wdFindStop
                                    .MatchCase = False
                                    .MatchWholeWord = False
                                    .MatchWildcards = False
                                    .MatchSoundsLike = False
                                    .MatchAllWordForms = False
                                    .Execute
                                End With
                                Do While .Find.Found
                                    Set Rng = .Duplicate.GoTo(What:=wdGoToBookmark, Name:="\HeadingLevel")
                                    DocTgt.Characters.Last.FormattedText = Rng.FormattedText
                                    
                                    If .End = DocSrc.Range.End Then Exit Do
                                    .Collapse wdCollapseEnd
                                    .Find.Execute
                                Loop
                                    
                                    If Len(.Text) > 1 Then
                                    .Copy
                                    
                                    With DocTgt
                                    DocTgt.Content.Paste
                                    DocTgt.SaveAs2 DocSrc.Path & "_Extract_" & DocSrc.Name
                                    End With
                                    
                                    Else
                                    WkSht.Cells(i, 3).Value = "No Data"
                                    End If
                            
                            End With
                            DocSrc.Close SaveChanges:=False
                        
                    End If
                End If
            Next
        End With
        WdApp.Quit
        Set DocSrc = Nothing: Set WdApp = Nothing: Set WkSht = Nothing: Set DocTgt = Nothing: Set Rng = Nothing
        Application.ScreenUpdating = True
    End Sub

  11. #11
    VBAX Regular
    Joined
    Dec 2017
    Posts
    13
    Location
    This is what the excel file looks like right now. I slightly changed it to add some check in column 1, then the path to the original .docx files in column 2, column 3 is empty to print errors and column 4 is the title of the chapters of which I seek to extract the content from the .docx files (chapters might appear several times in one .docx). In reality this file contains roughly 2700 lines.

    Excel12.jpg

  12. #12
    Knowledge Base Approver VBAX Guru macropod's Avatar
    Joined
    Jul 2008
    Posts
    4,435
    Location
    Try the following.
    Sub Demo()
        Application.ScreenUpdating = False
        Dim DocSrc As Document, DocTgt As Document, Rng As Range, ArrCountries(), i As Long
        Set DocSrc = ActiveDocument: ArrCountries = Array("China", "North America", "South America", "EMEA")
        For i = 0 To UBound(ArrCountries)
          Set DocTgt = Documents.Add
          With DocSrc.Range
            With .Find
                .ClearFormatting
                .Replacement.ClearFormatting
                .Text = ArrCountries(i)
                .Replacement.Text = ""
                .Forward = True
                .Format = True
                .Style = wdStyleHeading2
                .Wrap = wdFindStop
                .MatchCase = False
                .MatchWholeWord = False
                .MatchWildcards = False
                .MatchSoundsLike = False
                .MatchAllWordForms = False
                .Execute
            End With
            Do While .Find.Found
                Set Rng = .Duplicate.GoTo(What:=wdGoToBookmark, Name:="\HeadingLevel")
                Rng.Start = Rng.Paragraphs.First.Range.End
                DocTgt.Characters.Last.FormattedText = Rng.FormattedText
                .End = Rng.End
                If .End = ActiveDocument.Range.End Then Exit Do
                .Collapse wdCollapseEnd
                .Find.Execute
            Loop
          End With
          With DocTgt
            If Len(.Range.Text) = 1 Then
              .Close False
            Else
              .Range.InsertBefore ArrCountries(i) & vbCr
              .Range.Paragraphs.First.Style = wdStyleHeading2
            End If
          End With
        Next
        Application.ScreenUpdating = True
    End Sub
    Note: There is no need to involve Excel at all. If you want to process all files in a folder, that too can be done.
    Cheers
    Paul Edstein
    [Fmr MS MVP - Word]

  13. #13
    VBAX Regular
    Joined
    Dec 2017
    Posts
    13
    Location
    macropod,

    thank you very much! Your code does exactly what I wanted. I have implemented it in word and the results are perfect, with the exception that some of the text is separated by many blanks (i.e. from page breaks). However I found that if I treated my original word documents with a small macro and removed all "^b" with find and replace "", then the text looks good except for maybe a few empty paragraphs sometimes, which are however also a result from the source document. I just read that apparently those blank paragraphs can also be removed by replacing "^13{2,}" with "^p", but have not found the time yet to implement (will do later).

    Now I do have however the problem that I cannot figure out how to open all files in the folder through the word macro. As stated above, I thought this had to be done by steering the process through excel. Over the course of the day I have tried to find solutions to open all files in the directory, run the macro and then save the extracted contents to one new .docx (i.e. one new file per source file).

    I have tried the follwing options:

    1. First, I wrote another macro which should then run the actual extraction macro on all files. This didn't work because I kept getting run-time errors in the code of your suggested macro then.

    2. Second, I tried implementing an approach using the following:

    vDirectory = "C:\Users\Stef\Desktop\Test\Documents\"
        vFile = Dir(vDirectory & "*.docx")
        
        Do While vFile <> ""
    
    ....
    
    Loop
    Again, I keep struggling because the parts I implement take the functionality of your code. In particular, the files I get in the end do not contain any text.

    3.Third, I tried an approach like this, yet no matter what without success.

    Set FSO = CreateObject("Scripting.FileSystemObject")
    Set Files = FSO.GetFolder("C:\Users\Stef\Desktop\Test\Documents\").Files
    Set WordApp = CreateObject("Word.Application")
    For Each Item In Files
    
    ...
    
    Loop
    Would you please be so kind and provide further help? I would really appreciate that.

    Thank you and best regards

  14. #14
    Knowledge Base Approver VBAX Guru macropod's Avatar
    Joined
    Jul 2008
    Posts
    4,435
    Location
    Try:
    Sub CreateCountryDocuments()
    Application.ScreenUpdating = False
    Dim strFolder As String, strFile As String, strDocNm As String, i As Long
    Dim DocSrc As Document, DocTgt As Document, Rng As Range, ArrCountries()
    ArrCountries = Array("China", "North America", "South America", "EMEA")
    strDocNm = ActiveDocument.FullName: strFolder = GetFolder: If strFolder = "" Then Exit Sub
    strFile = Dir(strFolder & "\*.doc", vbNormal)
    'Create the output documents
    For i = 0 To UBound(ArrCountries)
      Set DocTgt = Documents.Add
      With DocTgt
        .Range.InsertBefore ArrCountries(i) & vbCr
        .Range.Paragraphs.First.Style = wdStyleHeading2
        .SaveAs2 FileName:=strFolder & "\" & ArrCountries(i) & ".docx", Fileformat:=wdFormatXMLDocument, AddToRecentFiles:=False
      End With
    Next
    'Process all documents in the source folder
    While strFile <> ""
      If strFolder & "\" & strFile <> strDocNm Then
        Set DocSrc = Documents.Open(FileName:=strFolder & "\" & strFile, AddToRecentFiles:=False, Visible:=False)
        For i = 0 To UBound(ArrCountries)
          'Point to the appropriate output document
          Set DocTgt = Documents(ArrCountries(i) & ".docx")
          'Locate the relevant headings in the current source document
          With DocSrc
            With .Range
              With .Find
                    .ClearFormatting
                    .Replacement.ClearFormatting
                    .Text = ArrCountries(i)
                    .Replacement.Text = ""
                    .Forward = True
                    .Format = True
                    .Style = wdStyleHeading2
                    .Wrap = wdFindStop
                    .MatchCase = False
                    .MatchWholeWord = False
                    .MatchWildcards = False
                    .MatchSoundsLike = False
                    .MatchAllWordForms = False
                    .Execute
              End With
              Do While .Find.Found
                'Point to the found content
                Set Rng = .Duplicate.GoTo(What:=wdGoToBookmark, Name:="\HeadingLevel")
                'Get the range after the actual heading
                Rng.Start = Rng.Paragraphs.First.Range.End
                'Update the output document
                DocTgt.Characters.Last.FormattedText = Rng.FormattedText
                .End = Rng.End
                If .End = ActiveDocument.Range.End Then Exit Do
                .Collapse wdCollapseEnd
                .Find.Execute
              Loop
            End With
          End With
        Next
        'close the current source document
        DocSrc.Close SaveChanges:=False
      End If
      strFile = Dir()
    Wend
    'Clean-up & save the output documents
    For i = 0 To UBound(ArrCountries)
    'Point to the appropriate output document
      Set DocTgt = Documents(ArrCountries(i) & ".docx")
      With DocTgt
        'Clean up the output
        With .Range.Find
          .ClearFormatting
          .Replacement.ClearFormatting
          .Forward = True
          .Wrap = wdFindContinue
          .Format = False
          .MatchWildcards = True
          .Text = "^m"
          .Replacement.Text = "^p"
          .Execute Replace:=wdReplaceAll
          .Text = "[^13]{2,}"
          .Execute Replace:=wdReplaceAll
        End With
        'Save & close the output document
        .Close SaveChanges:=True
      End With
    Next
    Set Rng = Nothing: Set DocSrc = Nothing: Set DocTgt = Nothing
    Application.ScreenUpdating = True
    End Sub
    
    
    Function GetFolder() As String
    Dim oFolder As Object
    GetFolder = ""
    Set oFolder = CreateObject("Shell.Application").BrowseForFolder(0, "Choose a folder", 0)
    If (Not oFolder Is Nothing) Then GetFolder = oFolder.Items.Item.Path
    Set oFolder = Nothing
    End Function
    Cheers
    Paul Edstein
    [Fmr MS MVP - Word]

  15. #15
    VBAX Regular
    Joined
    Dec 2017
    Posts
    13
    Location
    macropod,

    thank you again, I appreciate your help. This is very kind.

    I have tried your code and there are two problems that are probably connected. First, the code goes through all the documents that are in the directory and copies the text from all documents into only one file. In other words, the text that is written under the chapters "China" in e.g. four documents is gathered in one document with the title China. I do however need the number of files to stay the same, i.e. for each old file, I need one new "filtered" file, preferably with the same file name as the old one, maybe with the extention "_Clean.docx". This should be possible I believe because from each file, only the text under one heading (note: which may of course occur several times in that document) needs to be extracted. Also, and maybe this is connected, the code has only done the aforementioned job, when I have only one country in the array. Once I add more, than one, it provided all new target files empty, i.e. without any text. I think the problem is, that it tries to access the target file through several instances of word at the same time and write to it, which then causes an error. I have tried making some changes to the code but without success.

    Is there a solution, or is my problem simply not solvable using VBA?

    Thank you and best regards

  16. #16
    Knowledge Base Approver VBAX Guru macropod's Avatar
    Joined
    Jul 2008
    Posts
    4,435
    Location
    If you're creating a new output file for each country in the input files, how are those output files to be named? Since you can't have two files with the same name, you might nave to do something like:
    Input filename = File1.docx
    Output filenames = File1-China.docx, File1-North America.docx, File1-South America.docx, File1-EMEA.docx
    Input filename = File2.docx
    Output filenames = File2-China.docx, File2-North America.docx, File2-South America.docx, File2-EMEA.docx
    etc.
    Cheers
    Paul Edstein
    [Fmr MS MVP - Word]

  17. #17
    VBAX Regular
    Joined
    Dec 2017
    Posts
    13
    Location
    Is there no option to leave the files with their old name, so for instance before "Global Company Report Full Century.docx" and after "Global Company Report Full Century_Clean.docx" (or "Client Report 2014.docx" and "Client Report 2014_Clean.docx") where I only have filtered the "Clean" file for e.g. China?

    The thing is, the "filters" (= keywords) I will use are also very diverse and basically unique for the (unfiltered) .docx files. E.g. some documents I need customer names instead of countries, some I use client names, etc. So in each file only one of my keywords (which I specify in the array) will match and this is then the content that I want to have left in that file (i.e. in the new "Clean" version).

    Does that make sense?

    So for example keyword "John Doe" from the array would e.g. only be found once in "Customer Reports 2001.docx", "Customer Reports 2002.docx", "Customer Reports 2003.docx", "Customer Reports 2004.docx" and then I would get four new files with "Customer Reports 2001_Clean.docx", "Customer Reports 2002_Clean.docx", "Customer Reports 2003_Clean.docx", "Customer Reports 2004_Clean.docx" which all contain only the Chapters titled "John Doe" before. In the array, there will be no other client name, which could be found in any of the reports (just other stuff which will not match in these four .docx files; this is what I mean by "the keywords are specific"). I hope this makes sense, I just made this up to maybe help make it clear.

    Thank you so much!

  18. #18
    Knowledge Base Approver VBAX Guru macropod's Avatar
    Joined
    Jul 2008
    Posts
    4,435
    Location
    Quote Originally Posted by StefKa View Post
    Is there no option to leave the files with their old name, so for instance before "Global Company Report Full Century.docx" and after "Global Company Report Full Century_Clean.docx" (or "Client Report 2014.docx" and "Client Report 2014_Clean.docx") where I only have filtered the "Clean" file for e.g. China?

    The thing is, the "filters" (= keywords) I will use are also very diverse and basically unique for the (unfiltered) .docx files. E.g. some documents I need customer names instead of countries, some I use client names, etc. So in each file only one of my keywords (which I specify in the array) will match and this is then the content that I want to have left in that file (i.e. in the new "Clean" version).
    Sure, but all your discussion to date has given the impression you're wanting to create separate files for a list of countries, not just extracting data for one country - or a single customer name. Even then, if you extract the details for one country or customer to a new 'Clean' document, then do a second one, the second one will simply overwrite the first. Using your "John Doe" example, if you then did the same for "Jane Doe", you'd lose some/all of the "John Doe" files.
    Cheers
    Paul Edstein
    [Fmr MS MVP - Word]

  19. #19
    VBAX Regular
    Joined
    Dec 2017
    Posts
    13
    Location
    I am sorry if I left a wrong impression of what I was trying to do. I thought it would make sense to post the problem rather abstract and simple to make it easier to understand. Maybe I can briefly try again.

    The files would be for example like this (again abstract example):

    File 1: Names from A to D (Year 1)
    File 2: Names from A to D (Year 2)
    File 3: Names from A to D (Year 3)
    File 4: Names from A to D (Year 4)

    File 5: Names from E to J (Year 1)
    File 6: Names from E to J (Year 2)
    File 7: Names from E to J (Year 3)
    File 8: Names from E to J (Year 4)

    File 9: Names from K to T (Year 1)
    File 10: Names from K to T (Year 2)
    File 11: Names from K to T (Year 3)
    File 12: Names from K to T (Year 4)

    File 13: Names from U to Z (Year 1)
    File 14: Names from U to Z (Year 2)
    File 15: Names from U to Z (Year 3)
    File 16: Names from U to Z (Year 4)

    Now if I enter in the array the names { Albert, Frank, Tom, Will }. Wouldn't it be able to give back specific clean files, since the name Albert can not be in any other files but Files 1 to 4, etc. This example could also be constructed with countries etc. (as is in my case to be done but with actually many variables and differing ones).

    So once the clean files are extracted, with the specification, they will be removed from the directory and then a new array is inserted by the user and a new set of clean files is created. This is how the process is thought to be implemented by me and I think then there wouldn't be a problem with loosing data through overwrite, or am I wrong?

    Thank you

    Stef

  20. #20
    Knowledge Base Approver VBAX Guru macropod's Avatar
    Joined
    Jul 2008
    Posts
    4,435
    Location
    Quote Originally Posted by StefKa View Post
    So once the clean files are extracted, with the specification, they will be removed from the directory and then a new array is inserted by the user and a new set of clean files is created. This is how the process is thought to be implemented by me and I think then there wouldn't be a problem with loosing data through overwrite, or am I wrong?
    Provided you can be sure that's how it will work in practice, it should be OK. Still, I wouldn't want to have files with the same names in multiple folders...

    Try:
    Sub CreateExtracts()
        Application.ScreenUpdating = False
        Dim strFolder As String, strFile As String, strDocNm As String, i As Long
        Dim DocSrc As Document, DocTgt As Document, Rng As Range, StrFilter As String
        'get the source folder
        strFolder = GetFolder: If strFolder = "" Then Exit Sub
        'Get the filtering criterion
        StrFilter = Trim(InputBox("Please input the required filter (e.g. country, company)", "Extract Filter"))
        strDocNm = ActiveDocument.FullName: If StrFilter = "" Then Exit Sub
        'Process all documents in the source folder
        strFile = Dir(strFolder & "\*.doc", vbNormal)
        While strFile <> ""
            If strFolder & "\" & strFile <> strDocNm Then
                Set DocSrc = Documents.Open(FileName:=strFolder & "\" & strFile, AddToRecentFiles:=False, Visible:=False)
                'Create the output document
                Set DocTgt = Documents.Add '(Visible:=False)
                With DocTgt
                  .Range.InsertBefore StrFilter & vbCr
                  .Range.Paragraphs.First.Style = wdStyleHeading2
                  .SaveAs2 FileName:=Split(DocSrc.FullName, ".doc")(0) & "_Clean.docx", Fileformat:=wdFormatXMLDocument, _
                    AddToRecentFiles:=False
                End With
                'Locate the relevant headings in the current source document
                With DocSrc
                  With .Range
                    With .Find
                      .ClearFormatting
                      .Replacement.ClearFormatting
                      .Text = StrFilter
                      .Replacement.Text = ""
                      .Forward = True
                      .Format = True
                      .Style = wdStyleHeading2
                      .Wrap = wdFindStop
                      .MatchCase = False
                      .MatchWholeWord = False
                      .MatchWildcards = False
                      .MatchSoundsLike = False
                      .MatchAllWordForms = False
                      .Execute
                    End With
                    Do While .Find.Found
                      'Point to the found content
                      Set Rng = .Duplicate.GoTo(What:=wdGoToBookmark, Name:="\HeadingLevel")
                      'Get the range after the actual heading
                      Rng.Start = Rng.Paragraphs.First.Range.End
                      'Update the output document
                      DocTgt.Characters.Last.FormattedText = Rng.FormattedText
                      .End = Rng.End
                      If .End = ActiveDocument.Range.End Then Exit Do
                      .Collapse wdCollapseEnd
                      .Find.Execute
                    Loop
                  End With
                  'close the current source document
                  .Close SaveChanges:=False
                End With
            End If
            With DocTgt
                 'Clean up the output
                With .Range.Find
                    .ClearFormatting
                    .Replacement.ClearFormatting
                    .Forward = True
                    .Wrap = wdFindContinue
                    .Format = False
                    .MatchWildcards = True
                    .Text = "^m"
                    .Replacement.Text = "^p"
                    .Execute Replace:=wdReplaceAll
                    .Text = "[^13]{2,}"
                    .Execute Replace:=wdReplaceAll
                End With
                 'Save & close the output document
                .Close SaveChanges:=True
            End With
            strFile = Dir()
        Wend
        Set Rng = Nothing: Set DocSrc = Nothing: Set DocTgt = Nothing
        Application.ScreenUpdating = True
    End Sub
     
    Function GetFolder() As String
        Dim oFolder As Object
        GetFolder = ""
        Set oFolder = CreateObject("Shell.Application").BrowseForFolder(0, "Choose a folder", 0)
        If (Not oFolder Is Nothing) Then GetFolder = oFolder.Items.Item.Path
        Set oFolder = Nothing
    End Function
    Cheers
    Paul Edstein
    [Fmr MS MVP - Word]

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •