PDA

View Full Version : [SOLVED:] Finding heading of chapters in .docx file and copying paragraphs to new .docx



StefKa
12-10-2017, 09:39 AM
Hello community,

I have a large amount of .docx files, which I would like to reduce to a more readable size each. In each file, there is one heading that appears several times, always formated as a 'Heading 2' style. I look for this specific heading which as I said occurs several times in a document and I want to copy all the text parts in just these chapters with the respective heading to a new word document (so I have the same amount of files in the end, however with only the content that I want in them).

To start with I wrote a code in Python that copies the full list of all files with their directory to my excel file in column A. Now in column B, I have defined the different headings of the paragraphs that I need to extract from the files (see example below).

21143
The word files look something like in the picture below and I have indicated the text sections that I would need to copy to the new reduced file and also some hints regarding formatting.

21144

Now to approach this problem I have written the following VBA, which I created also using some of the input provided to others on this platform before.



Sub SelectData()

Application.ScreenUpdating = False

Dim WdApp As Word.Application
Set WdApp = CreateObject("Word.Application")

Dim Doc As Word.Document
Dim NewDoc As Word.Document

Dim HeadingToFind As String
Dim ChapterToFind As String
Dim StartRange As Long
Dim EndRange As Long

Dim WkSht As Worksheet

Dim LRow As Long
Dim i As Long

Set WkSht = ThisWorkbook.Sheets("Sheet1")
LRow = WkSht.UsedRange.Cells.SpecialCells(xlCellTypeLastCell).Row

With WkSht
For i = 1 To LRow
If Dir(.Cells(i, 1).Text, vbNormal) = "" Then
.Cells(i, 3).Value = "Please check File Location"
Else
Set Doc = WdApp.Documents.Open(Filename:=.Cells(i, 1).Text, _
AddToRecentFiles:=False, Visible:=False, ReadOnly:=False)

Set NewDoc = Documents.Add

ChapterToFind = LCase(.Cells(i, 2).Text)

With Doc

Selection.HomeKey Unit:=wdStory

With Selection

With .Find
.ClearFormatting
.Text = ChapterToFind
.MatchWildcards = False
.MatchCase = True
.Execute
End With

If .Find.Found Then
.Collapse wdCollapseStart
With .Find
.Text = ""
.Style = "Heading 2"
.Forward = False
.Execute
End With

.MoveDown Count:=1
.HomeKey Unit:=wdLine
StartRange = .Start


.Find.Forward = True
.Find.Execute
.Collapse wdCollapseStart
.MoveUp Count:=1
.EndKey Unit:=wdLine
EndRange = .End

Doc.Range(StartRange, EndRange).Copy
NewDoc.Content.Paste
NewDoc.SaveAs2 Doc.Path & "Clean" & ".docx", wdFormatFlatXML
Else
WkSht.Cells(i, 4).Value = "Error Chapter Not Found"
End If

End With

End With
WdApp.Quit
Set Doc = Nothing: Set NewDoc = Nothing: Set WdApp = Nothing: Set WkSht = Nothing
Application.ScreenUpdating = True

End If

Next

End With


End Sub


Now I struggle with getting this to work. Particularly I get a runtime error 438, where the issue is with the "Selection.HomeKey".
Also I have not been able to find any idea or solution for the problem that the Chapters that I try to copy the content to the new files may occur several times within one document. As a result, I would really appreciate any help.

I want to let you know that I have first posted this problem on a different platform (stackoverflow) before, when I tried to approach the problem using Python (I cannot post a link but if you search for "Finding a heading in word file and copying entire paragraph..." you will find it). Subsequently I have posted there also asking for help with the VBA solution (again, I cannot post links but if you go there and search for "Finding heading of chapters in word file and copying individual paragraphs..." you will see it). However, I have not been able to progress. I have understood that the ".Selection" needs to be linked to the word application, but have not succeeded in implementing this. Overall, I am in dire need of a solution as I am under time pressure, therefore I am now posting here with high hopes.

Thank you for any help in advance!

StefKa
12-12-2017, 08:21 AM
Update:

I got it to work to a certain extent using the following code:


Sub ExtractData()
Application.ScreenUpdating = False
Dim wdApp As Object, wdDoc As Object, wdRng As Object, nwdDoc
Dim WkSht As Worksheet, LRow As Long, i As Long
Set WkSht = ThisWorkbook.Sheets("Sheet1")
LRow = WkSht.UsedRange.Cells.SpecialCells(xlCellTypeLastCell).Row
Set wdApp = CreateObject("Word.Application")
If wdApp Is Nothing Then
MsgBox "Can't start Word.", vbExclamation
Exit Sub
End If
With WkSht
For i = 1 To LRow
If LCase(.Cells(i, 1).Text) = "true" Then
If Dir(.Cells(i, 2).Text, vbNormal) = "" Then
.Cells(i, 3).Value = "Please check File Location"
Else
Set wdDoc = wdApp.Documents.Open(Filename:=.Cells(i, 2).Text, _
AddToRecentFiles:=False, Visible:=False, ReadOnly:=False)
With wdDoc
With .Range
With .Find
.ClearFormatting
.Replacement.ClearFormatting
.Forward = True
.Wrap = 0 'wdFindStop
.Format = True
.Style = "Heading 2"
.MatchWildcards = False
.MatchCase = False
.Text = WkSht.Cells(i, 4).Value
.Replacement.Text = ""
.Execute
End With
If .Find.Found Then
Set wdRng = .Duplicate
wdRng.Collapse 0 'wdCollapseEnd
End If
.Start = wdRng.End
With .Find
.Style = "Heading 2"
.Text = ""
.Execute
End With
If .Find.Found Then
wdRng.End = .Duplicate.Start - 1
End If
If Not wdRng Is Nothing Then
With wdRng
While .Tables.Count > 0
.Tables(1).Delete
Wend
With .Find
.ClearFormatting
.Replacement.ClearFormatting
.Forward = True
.Wrap = 0 'wdFindStop
.Format = False
.MatchWildcards = True
.Text = "[^13^l]{1,}"
.Replacement.Text = Chr(182)
.Execute Replace:=2 'wdReplaceAll
End With

Set nwdDoc = wdApp.Documents.Add

If Len(.Text) > 1 Then
.Copy

With nwdDoc
nwdDoc.Content.Paste
nwdDoc.SaveAs2 wdDoc.Path & "_Extract_" & wdDoc.Name
End With
Else
WkSht.Cells(i, 3).Value = "No Data"
End If
End With
Else
WkSht.Cells(i, 3).Value = "Not Found"
End If
End With
.Close SaveChanges:=False
End With
Set wdRng = Nothing
End If
End If
Next
End With
wdApp.Quit
Set wdDoc = Nothing: Set wdApp = Nothing: Set WkSht = Nothing: Set nwdDoc = Nothing
Application.ScreenUpdating = True
End Sub

It perfectly does the job, however only for the first time the chapter I seek to extract. As I mentioned before, the chapter I try to get occurs several times in one source document. Any suggestions on how to change the code to make it copy all the sections that I need?

Thank you for any help!

macropod
12-12-2017, 06:21 PM
You seem to be making this far harder than it needs to be:

Sub Demo()
Application.ScreenUpdating = False
Dim DocSrc As Document, DocTgt As Document, Rng As Range
Set DocSrc = ActiveDocument: Set DocTgt = Documents.Add
With DocSrc.Range
With .Find
.ClearFormatting
.Replacement.ClearFormatting
.Text = ""
.Replacement.Text = ""
.Forward = True
.Format = True
.Style = wdStyleHeading2
.Wrap = wdFindStop
.MatchCase = False
.MatchWholeWord = False
.MatchWildcards = False
.MatchSoundsLike = False
.MatchAllWordForms = False
.Execute
End With
Do While .Find.Found
Set Rng = .Duplicate.GoTo(What:=wdGoToBookmark, Name:="\HeadingLevel")
DocTgt.Characters.Last.FormattedText = Rng.FormattedText
.End = Rng.End
If .End = ActiveDocument.Range.End Then Exit Do
.Collapse wdCollapseEnd
.Find.Execute
Loop
End With
Application.ScreenUpdating = True
End Sub

snb
12-13-2017, 03:29 AM
I wouldn't mind if you posted a sample Word Dcoument + indication which text in the document you need.

StefKa
12-13-2017, 08:13 AM
Thanks macropod, I am working on implementing your solution. So far no success, I keep getting an error about the following line:


.End = Rng.End

It says, "Argument is not optional". Will keep trying to make it work.

StefKa
12-13-2017, 08:17 AM
snb, I uploaded a picture of how the word files look above. Maybe that helps for now, I will try to upload a file later today, need to figure out how to first.

The parts I try to extract and copy into a new document are the ones marked in yellow. Where in the sample "China" is the chapter I define (from which the text is to be extracted for this .docx) in the excel file from which I try to steer the process. In another file it could for example be that I need to extract the text content of the chapters titled "North America".

Hope this helps in making my task clear.

Thank you very much all, your help is highly appreciated!

StefKa
12-13-2017, 09:26 AM
macropod, I have changed my code according to your input. It looks as written below now.

I keep getting an error still with ".End = Rng.End": Argument is not optional.

Am I forgetting something?


Sub ExtractData()
Application.ScreenUpdating = False
Dim wdApp As Object
Dim wdDoc As Document, nwDoc As Document
Dim Rng As Range
Dim WkSht As Worksheet
Dim LRow As Long, i As Long

Set WkSht = ThisWorkbook.Sheets("Sheet1")
LRow = WkSht.UsedRange.Cells.SpecialCells(xlCellTypeLastCell).Row

Set wdApp = CreateObject("Word.Application")

If wdApp Is Nothing Then
MsgBox "Can't start Word.", vbExclamation
Exit Sub
End If

With WkSht
For i = 1 To LRow
If LCase(.Cells(i, 1).Text) = "true" Then
If Dir(.Cells(i, 2).Text, vbNormal) = "" Then
.Cells(i, 3).Value = "Please check File Location"
Else
Set wdDoc = wdApp.Documents.Open(Filename:=.Cells(i, 2).Text, _
AddToRecentFiles:=False, Visible:=False, ReadOnly:=False)
With wdDoc.Range
With .Find
.ClearFormatting
.Replacement.ClearFormatting
.Text = WkSht.Cells(i, 4).Value
.Replacement.Text = ""
.Forward = True
.Format = True
.Style = wdStyleHeading2
.Wrap = wdFindStop
.MatchCase = False
.MatchWholeWord = False
.MatchWildcards = False
.MatchSoundsLike = False
.MatchAllWordForms = False
.Execute
End With
Do While .Find.Found
Set Rng = .Duplicate.GoTo(What:=wdGoToBookmark, Name:="\HeadingLevel")
wdDoc.Characters.Last.FormattedText = Rng.FormattedText
.End = Rng.End
If .End = wdDoc.Range.End Then Exit Do
.Collapse wdCollapseEnd
.Find.Execute
Loop
Set nwdDoc = wdApp.Documents.Add
If Len(.Text) > 1 Then
.Copy
With nwdDoc
nwdDoc.Content.Paste
nwdDoc.SaveAs2 wdDoc.Path & "_Extract_" & wdDoc.Name
End With
Else
WkSht.Cells(i, 3).Value = "No Data"
End If
End With
wdDoc.Close SaveChanges:=False
End With
End If
End If
Next
End With
wdApp.Quit
Set wdDoc = Nothing: Set wdApp = Nothing: Set WkSht = Nothing: Set nwdDoc = Nothing: Set Rng = Nothing
Application.ScreenUpdating = True
End Sub


Thank you for your help!

StefKa
12-13-2017, 09:27 AM
snb, I uploaded the sample word document and it can be downloaded using the following link:

https://ufile.io/97jbh

I was not able to figure out how to attach files here. Hope that is okay to post like this.

Thank you again!

macropod
12-13-2017, 12:14 PM
You could delete .End = Rng.End, but I do have to ask why you've messed around with the code instead of using it as-is.

StefKa
12-13-2017, 07:42 PM
Thank you macropod. I have changed this but keep now getting a Run-time error 13 - Type Mismatch on the following line:


Set Rng = .Duplicate.GoTo(What:=wdGoToBookmark, Name:="\HeadingLevel")

The entire code right now is given below. Also please let me try to explain again why I made some changes. As I have shown in the initial description, I have created a list in excel that contains all the path names of the untreated .docx files (about 2700 .docx files) and the specific chapters that need to be extracted from these into new documents (defined absolutely individually, so not the same chapters from each document). Therefore I have included in the code this "opening mechanism" so that the macro opens the files as detailed in the excel (first column) and then searches for the chapters as detailed in the second column. Besides this, I have only changed the saving process of the new documents so that for each old file one new file (containing only the text content of the selected chapters) is saved.

Am I doing any of this wrong? I've done some research and it says that this error occurs due to mismatch of string and integer, however I am not able to identify any. To be honest, in my opinion the code should now work, but it doesn't.

I hope this helps. By the way, sorry if any of my English is hard to read, I am originally from Germany and just reside in the US on a working student exchange; so I am a non-native speaker, but try hard to make it clear.

Again, thank you and I really appreciate your help.


Sub ExtractData()


Application.ScreenUpdating = False

Dim WdApp As Object
Dim DocSrc As Document, DocTgt As Document
Dim Rng As Range
Dim WkSht As Worksheet
Dim LRow As Long, i As Long

Set WkSht = ThisWorkbook.Sheets("Sheet1")
LRow = WkSht.UsedRange.Cells.SpecialCells(xlCellTypeLastCell).Row

Set WdApp = CreateObject("Word.Application")

If WdApp Is Nothing Then
MsgBox "Can't start Word.", vbExclamation
Exit Sub
End If

With WkSht
For i = 1 To LRow
If LCase(.Cells(i, 1).Text) = "true" Then
If Dir(.Cells(i, 2).Text, vbNormal) = "" Then
.Cells(i, 3).Value = "Please check File Location"
Else

Set DocSrc = WdApp.Documents.Open(Filename:=.Cells(i, 2).Text, _
AddToRecentFiles:=False, Visible:=False, ReadOnly:=False)
Set DocTgt = WdApp.Documents.Add

With DocSrc.Range
With .Find
.ClearFormatting
.Replacement.ClearFormatting
.Text = WkSht.Cells(i, 4).Value
.Replacement.Text = ""
.Forward = True
.Format = True
.Style = wdStyleHeading2
.Wrap = wdFindStop
.MatchCase = False
.MatchWholeWord = False
.MatchWildcards = False
.MatchSoundsLike = False
.MatchAllWordForms = False
.Execute
End With
Do While .Find.Found
Set Rng = .Duplicate.GoTo(What:=wdGoToBookmark, Name:="\HeadingLevel")
DocTgt.Characters.Last.FormattedText = Rng.FormattedText

If .End = DocSrc.Range.End Then Exit Do
.Collapse wdCollapseEnd
.Find.Execute
Loop

If Len(.Text) > 1 Then
.Copy

With DocTgt
DocTgt.Content.Paste
DocTgt.SaveAs2 DocSrc.Path & "_Extract_" & DocSrc.Name
End With

Else
WkSht.Cells(i, 3).Value = "No Data"
End If

End With
DocSrc.Close SaveChanges:=False

End If
End If
Next
End With
WdApp.Quit
Set DocSrc = Nothing: Set WdApp = Nothing: Set WkSht = Nothing: Set DocTgt = Nothing: Set Rng = Nothing
Application.ScreenUpdating = True
End Sub

StefKa
12-13-2017, 07:53 PM
This is what the excel file looks like right now. I slightly changed it to add some check in column 1, then the path to the original .docx files in column 2, column 3 is empty to print errors and column 4 is the title of the chapters of which I seek to extract the content from the .docx files (chapters might appear several times in one .docx). In reality this file contains roughly 2700 lines.

21169

macropod
12-13-2017, 10:10 PM
Try the following.

Sub Demo()
Application.ScreenUpdating = False
Dim DocSrc As Document, DocTgt As Document, Rng As Range, ArrCountries(), i As Long
Set DocSrc = ActiveDocument: ArrCountries = Array("China", "North America", "South America", "EMEA")
For i = 0 To UBound(ArrCountries)
Set DocTgt = Documents.Add
With DocSrc.Range
With .Find
.ClearFormatting
.Replacement.ClearFormatting
.Text = ArrCountries(i)
.Replacement.Text = ""
.Forward = True
.Format = True
.Style = wdStyleHeading2
.Wrap = wdFindStop
.MatchCase = False
.MatchWholeWord = False
.MatchWildcards = False
.MatchSoundsLike = False
.MatchAllWordForms = False
.Execute
End With
Do While .Find.Found
Set Rng = .Duplicate.GoTo(What:=wdGoToBookmark, Name:="\HeadingLevel")
Rng.Start = Rng.Paragraphs.First.Range.End
DocTgt.Characters.Last.FormattedText = Rng.FormattedText
.End = Rng.End
If .End = ActiveDocument.Range.End Then Exit Do
.Collapse wdCollapseEnd
.Find.Execute
Loop
End With
With DocTgt
If Len(.Range.Text) = 1 Then
.Close False
Else
.Range.InsertBefore ArrCountries(i) & vbCr
.Range.Paragraphs.First.Style = wdStyleHeading2
End If
End With
Next
Application.ScreenUpdating = True
End Sub
Note: There is no need to involve Excel at all. If you want to process all files in a folder, that too can be done.

StefKa
12-14-2017, 05:36 PM
macropod,

thank you very much! Your code does exactly what I wanted. I have implemented it in word and the results are perfect, with the exception that some of the text is separated by many blanks (i.e. from page breaks). However I found that if I treated my original word documents with a small macro and removed all "^b" with find and replace "", then the text looks good except for maybe a few empty paragraphs sometimes, which are however also a result from the source document. I just read that apparently those blank paragraphs can also be removed by replacing "^13{2,}" with "^p", but have not found the time yet to implement (will do later).

Now I do have however the problem that I cannot figure out how to open all files in the folder through the word macro. As stated above, I thought this had to be done by steering the process through excel. Over the course of the day I have tried to find solutions to open all files in the directory, run the macro and then save the extracted contents to one new .docx (i.e. one new file per source file).

I have tried the follwing options:

1. First, I wrote another macro which should then run the actual extraction macro on all files. This didn't work because I kept getting run-time errors in the code of your suggested macro then.

2. Second, I tried implementing an approach using the following:



vDirectory = "C:\Users\Stef\Desktop\Test\Documents\"
vFile = Dir(vDirectory & "*.docx")

Do While vFile <> ""

....

Loop


Again, I keep struggling because the parts I implement take the functionality of your code. In particular, the files I get in the end do not contain any text.

3.Third, I tried an approach like this, yet no matter what without success.



Set FSO = CreateObject("Scripting.FileSystemObject")
Set Files = FSO.GetFolder("C:\Users\Stef\Desktop\Test\Documents\").Files
Set WordApp = CreateObject("Word.Application")
For Each Item In Files

...

Loop



Would you please be so kind and provide further help? I would really appreciate that.

Thank you and best regards

macropod
12-14-2017, 10:00 PM
Try:

Sub CreateCountryDocuments()
Application.ScreenUpdating = False
Dim strFolder As String, strFile As String, strDocNm As String, i As Long
Dim DocSrc As Document, DocTgt As Document, Rng As Range, ArrCountries()
ArrCountries = Array("China", "North America", "South America", "EMEA")
strDocNm = ActiveDocument.FullName: strFolder = GetFolder: If strFolder = "" Then Exit Sub
strFile = Dir(strFolder & "\*.doc", vbNormal)
'Create the output documents
For i = 0 To UBound(ArrCountries)
Set DocTgt = Documents.Add
With DocTgt
.Range.InsertBefore ArrCountries(i) & vbCr
.Range.Paragraphs.First.Style = wdStyleHeading2
.SaveAs2 FileName:=strFolder & "\" & ArrCountries(i) & ".docx", Fileformat:=wdFormatXMLDocument, AddToRecentFiles:=False
End With
Next
'Process all documents in the source folder
While strFile <> ""
If strFolder & "\" & strFile <> strDocNm Then
Set DocSrc = Documents.Open(FileName:=strFolder & "\" & strFile, AddToRecentFiles:=False, Visible:=False)
For i = 0 To UBound(ArrCountries)
'Point to the appropriate output document
Set DocTgt = Documents(ArrCountries(i) & ".docx")
'Locate the relevant headings in the current source document
With DocSrc
With .Range
With .Find
.ClearFormatting
.Replacement.ClearFormatting
.Text = ArrCountries(i)
.Replacement.Text = ""
.Forward = True
.Format = True
.Style = wdStyleHeading2
.Wrap = wdFindStop
.MatchCase = False
.MatchWholeWord = False
.MatchWildcards = False
.MatchSoundsLike = False
.MatchAllWordForms = False
.Execute
End With
Do While .Find.Found
'Point to the found content
Set Rng = .Duplicate.GoTo(What:=wdGoToBookmark, Name:="\HeadingLevel")
'Get the range after the actual heading
Rng.Start = Rng.Paragraphs.First.Range.End
'Update the output document
DocTgt.Characters.Last.FormattedText = Rng.FormattedText
.End = Rng.End
If .End = ActiveDocument.Range.End Then Exit Do
.Collapse wdCollapseEnd
.Find.Execute
Loop
End With
End With
Next
'close the current source document
DocSrc.Close SaveChanges:=False
End If
strFile = Dir()
Wend
'Clean-up & save the output documents
For i = 0 To UBound(ArrCountries)
'Point to the appropriate output document
Set DocTgt = Documents(ArrCountries(i) & ".docx")
With DocTgt
'Clean up the output
With .Range.Find
.ClearFormatting
.Replacement.ClearFormatting
.Forward = True
.Wrap = wdFindContinue
.Format = False
.MatchWildcards = True
.Text = "^m"
.Replacement.Text = "^p"
.Execute Replace:=wdReplaceAll
.Text = "[^13]{2,}"
.Execute Replace:=wdReplaceAll
End With
'Save & close the output document
.Close SaveChanges:=True
End With
Next
Set Rng = Nothing: Set DocSrc = Nothing: Set DocTgt = Nothing
Application.ScreenUpdating = True
End Sub


Function GetFolder() As String
Dim oFolder As Object
GetFolder = ""
Set oFolder = CreateObject("Shell.Application").BrowseForFolder(0, "Choose a folder", 0)
If (Not oFolder Is Nothing) Then GetFolder = oFolder.Items.Item.Path
Set oFolder = Nothing
End Function

StefKa
12-15-2017, 03:37 PM
macropod,

thank you again, I appreciate your help. This is very kind.

I have tried your code and there are two problems that are probably connected. First, the code goes through all the documents that are in the directory and copies the text from all documents into only one file. In other words, the text that is written under the chapters "China" in e.g. four documents is gathered in one document with the title China. I do however need the number of files to stay the same, i.e. for each old file, I need one new "filtered" file, preferably with the same file name as the old one, maybe with the extention "_Clean.docx". This should be possible I believe because from each file, only the text under one heading (note: which may of course occur several times in that document) needs to be extracted. Also, and maybe this is connected, the code has only done the aforementioned job, when I have only one country in the array. Once I add more, than one, it provided all new target files empty, i.e. without any text. I think the problem is, that it tries to access the target file through several instances of word at the same time and write to it, which then causes an error. I have tried making some changes to the code but without success.

Is there a solution, or is my problem simply not solvable using VBA?

Thank you and best regards

macropod
12-15-2017, 03:53 PM
If you're creating a new output file for each country in the input files, how are those output files to be named? Since you can't have two files with the same name, you might nave to do something like:
Input filename = File1.docx
Output filenames = File1-China.docx, File1-North America.docx, File1-South America.docx, File1-EMEA.docx
Input filename = File2.docx
Output filenames = File2-China.docx, File2-North America.docx, File2-South America.docx, File2-EMEA.docx
etc.

StefKa
12-15-2017, 05:43 PM
Is there no option to leave the files with their old name, so for instance before "Global Company Report Full Century.docx" and after "Global Company Report Full Century_Clean.docx" (or "Client Report 2014.docx" and "Client Report 2014_Clean.docx") where I only have filtered the "Clean" file for e.g. China?

The thing is, the "filters" (= keywords) I will use are also very diverse and basically unique for the (unfiltered) .docx files. E.g. some documents I need customer names instead of countries, some I use client names, etc. So in each file only one of my keywords (which I specify in the array) will match and this is then the content that I want to have left in that file (i.e. in the new "Clean" version).

Does that make sense?

So for example keyword "John Doe" from the array would e.g. only be found once in "Customer Reports 2001.docx", "Customer Reports 2002.docx", "Customer Reports 2003.docx", "Customer Reports 2004.docx" and then I would get four new files with "Customer Reports 2001_Clean.docx", "Customer Reports 2002_Clean.docx", "Customer Reports 2003_Clean.docx", "Customer Reports 2004_Clean.docx" which all contain only the Chapters titled "John Doe" before. In the array, there will be no other client name, which could be found in any of the reports (just other stuff which will not match in these four .docx files; this is what I mean by "the keywords are specific"). I hope this makes sense, I just made this up to maybe help make it clear.

Thank you so much!

macropod
12-15-2017, 06:07 PM
Is there no option to leave the files with their old name, so for instance before "Global Company Report Full Century.docx" and after "Global Company Report Full Century_Clean.docx" (or "Client Report 2014.docx" and "Client Report 2014_Clean.docx") where I only have filtered the "Clean" file for e.g. China?

The thing is, the "filters" (= keywords) I will use are also very diverse and basically unique for the (unfiltered) .docx files. E.g. some documents I need customer names instead of countries, some I use client names, etc. So in each file only one of my keywords (which I specify in the array) will match and this is then the content that I want to have left in that file (i.e. in the new "Clean" version).
Sure, but all your discussion to date has given the impression you're wanting to create separate files for a list of countries, not just extracting data for one country - or a single customer name. Even then, if you extract the details for one country or customer to a new 'Clean' document, then do a second one, the second one will simply overwrite the first. Using your "John Doe" example, if you then did the same for "Jane Doe", you'd lose some/all of the "John Doe" files.

StefKa
12-15-2017, 07:06 PM
I am sorry if I left a wrong impression of what I was trying to do. I thought it would make sense to post the problem rather abstract and simple to make it easier to understand. Maybe I can briefly try again.

The files would be for example like this (again abstract example):

File 1: Names from A to D (Year 1)
File 2: Names from A to D (Year 2)
File 3: Names from A to D (Year 3)
File 4: Names from A to D (Year 4)

File 5: Names from E to J (Year 1)
File 6: Names from E to J (Year 2)
File 7: Names from E to J (Year 3)
File 8: Names from E to J (Year 4)

File 9: Names from K to T (Year 1)
File 10: Names from K to T (Year 2)
File 11: Names from K to T (Year 3)
File 12: Names from K to T (Year 4)

File 13: Names from U to Z (Year 1)
File 14: Names from U to Z (Year 2)
File 15: Names from U to Z (Year 3)
File 16: Names from U to Z (Year 4)

Now if I enter in the array the names { Albert, Frank, Tom, Will }. Wouldn't it be able to give back specific clean files, since the name Albert can not be in any other files but Files 1 to 4, etc. This example could also be constructed with countries etc. (as is in my case to be done but with actually many variables and differing ones).

So once the clean files are extracted, with the specification, they will be removed from the directory and then a new array is inserted by the user and a new set of clean files is created. This is how the process is thought to be implemented by me and I think then there wouldn't be a problem with loosing data through overwrite, or am I wrong?

Thank you

Stef

macropod
12-15-2017, 07:34 PM
So once the clean files are extracted, with the specification, they will be removed from the directory and then a new array is inserted by the user and a new set of clean files is created. This is how the process is thought to be implemented by me and I think then there wouldn't be a problem with loosing data through overwrite, or am I wrong?
Provided you can be sure that's how it will work in practice, it should be OK. Still, I wouldn't want to have files with the same names in multiple folders...

Try:

Sub CreateExtracts()
Application.ScreenUpdating = False
Dim strFolder As String, strFile As String, strDocNm As String, i As Long
Dim DocSrc As Document, DocTgt As Document, Rng As Range, StrFilter As String
'get the source folder
strFolder = GetFolder: If strFolder = "" Then Exit Sub
'Get the filtering criterion
StrFilter = Trim(InputBox("Please input the required filter (e.g. country, company)", "Extract Filter"))
strDocNm = ActiveDocument.FullName: If StrFilter = "" Then Exit Sub
'Process all documents in the source folder
strFile = Dir(strFolder & "\*.doc", vbNormal)
While strFile <> ""
If strFolder & "\" & strFile <> strDocNm Then
Set DocSrc = Documents.Open(FileName:=strFolder & "\" & strFile, AddToRecentFiles:=False, Visible:=False)
'Create the output document
Set DocTgt = Documents.Add '(Visible:=False)
With DocTgt
.Range.InsertBefore StrFilter & vbCr
.Range.Paragraphs.First.Style = wdStyleHeading2
.SaveAs2 FileName:=Split(DocSrc.FullName, ".doc")(0) & "_Clean.docx", Fileformat:=wdFormatXMLDocument, _
AddToRecentFiles:=False
End With
'Locate the relevant headings in the current source document
With DocSrc
With .Range
With .Find
.ClearFormatting
.Replacement.ClearFormatting
.Text = StrFilter
.Replacement.Text = ""
.Forward = True
.Format = True
.Style = wdStyleHeading2
.Wrap = wdFindStop
.MatchCase = False
.MatchWholeWord = False
.MatchWildcards = False
.MatchSoundsLike = False
.MatchAllWordForms = False
.Execute
End With
Do While .Find.Found
'Point to the found content
Set Rng = .Duplicate.GoTo(What:=wdGoToBookmark, Name:="\HeadingLevel")
'Get the range after the actual heading
Rng.Start = Rng.Paragraphs.First.Range.End
'Update the output document
DocTgt.Characters.Last.FormattedText = Rng.FormattedText
.End = Rng.End
If .End = ActiveDocument.Range.End Then Exit Do
.Collapse wdCollapseEnd
.Find.Execute
Loop
End With
'close the current source document
.Close SaveChanges:=False
End With
End If
With DocTgt
'Clean up the output
With .Range.Find
.ClearFormatting
.Replacement.ClearFormatting
.Forward = True
.Wrap = wdFindContinue
.Format = False
.MatchWildcards = True
.Text = "^m"
.Replacement.Text = "^p"
.Execute Replace:=wdReplaceAll
.Text = "[^13]{2,}"
.Execute Replace:=wdReplaceAll
End With
'Save & close the output document
.Close SaveChanges:=True
End With
strFile = Dir()
Wend
Set Rng = Nothing: Set DocSrc = Nothing: Set DocTgt = Nothing
Application.ScreenUpdating = True
End Sub

Function GetFolder() As String
Dim oFolder As Object
GetFolder = ""
Set oFolder = CreateObject("Shell.Application").BrowseForFolder(0, "Choose a folder", 0)
If (Not oFolder Is Nothing) Then GetFolder = oFolder.Items.Item.Path
Set oFolder = Nothing
End Function

StefKa
12-17-2017, 02:29 PM
Macropod,

I tried your code several times today and it works perfectly! The solution with the box where I can enter my criteria is even better than I thought was possible! You are a VBA genius for real and I am beyond grateful for your patience and help! Thank you so much!

Cheers

Stef

macropod
12-17-2017, 03:45 PM
You're welcome.