PDA

View Full Version : Help With VBA For A Large Search and Replace Operation, Problem with .MatchWholeWord



Orge
11-15-2010, 09:03 AM
Hi guys,

I'm new to this forum and have attempted to do a search for this information but can't find anything quite relevant - so I apologise if it's been posted before!

Basically I have a hundred or so documents at work, each about 50 pages which need to be anonymised - removing references to he/she, dates, names, places etc. so that they effectively can't be traced back to who they are referring to.

I did 50 or so manually and decided that was a waste of time, 100 more to go and it's time to make a macro to do it for me. So far I've modified someone else's 'find and highlight' code to find and replace a list of names with [REMOVED] in red, this has worked to an extent. The list of names is about 10,000 long and is in Word format, on name per line e.g.

John
Bill
Rob
James

and so on. Obviously I only want it to find the whole and exact word and replace that, for example I don't want the name 'Mee' to cause the word 'Meeting' to become '[REMOVED]ting'. It does this quite well for the first few entries on the list and then after that starts to pick them out in a string of characters. I'm totally confused by this as I can't work out what changes after 30 or so lines in the list!

Code is as follows:



Sub Anonymouse()
Dim sCheckDoc As String
Dim docRef As Document
Dim docCurrent As Document
Dim wrdRef As String
Dim wrdPara As Paragraph


sCheckDoc = "d:\checklist.doc"
Set docCurrent = Selection.Document
Set docRef = Documents.Open(sCheckDoc)
docCurrent.Activate

With Selection.Find
.ClearFormatting
.Replacement.ClearFormatting
.Replacement.Font.Color = wdColorRed
.Replacement.Text = "[REMOVED]"
.Forward = True
.Format = True
.MatchWholeWord = True
.MatchCase = True
.MatchWildcards = False
End With

For Each wrdPara In docRef.Paragraphs
wrdRef = wrdPara.Range.Text
' remove the paragraph mark:
wrdRef = Left(wrdRef, Len(wrdRef) - 1)
If Asc(Left(wrdRef, 1)) > 32 Then
With Selection.Find
.Wrap = wdFindContinue
.Text = wrdRef
.Execute Replace:=wdReplaceAll
End With
End If
Next wrdPara

docRef.Close
docCurrent.Activate
End Sub


Any help would be massively appreciated.

Cheers guys.

fumei
11-15-2010, 11:35 AM
Option Explicit
Function ReplaceWord(inDoc As Document, j As Long) As String
Dim wrdPara As Paragraph
Dim wrdRef As String

wrdPara = inDoc.Paragraphs(j)
wrdRef = wrdPara.Range.Text
' remove the paragraph mark:
wrdRef = Left(wrdRef, Len(wrdRef) - 1)
ReplaceWord = wrdRef
End Function

Sub Anonymouse()
Dim docCurrent As Document
Dim docRef As Document
Dim j As Long
Dim r As Range
Dim wrdRef As String

Set docCurrent = ActiveDocument
Set docRef = Documents.Open("d:\checklist.doc")

For j = 1 To docRef.Paragraphs.Count
Set r = docCurrent.Range
' gets the next word from reference doc
wrdRef = ReplaceWord(docRef, j)
With r.Find
.ClearFormatting
.Replacement.ClearFormatting
.Replacement.Font.Color = wdColorRed
.Replacement.Text = "[REMOVED]"
.MatchWholeWord = True
.MatchCase = True
Do While .Execute(Findtext:=wrdRef, _
Forward:=True) = True
If Asc(r.Text) > 32 Then
r.Text = wrdRef
r.collapse 0
Loop
End With
Next
docRef.Close
Set docRef = Nothing
End Sub
You do not need to activate anything. You are setting the reference document as a document object. You can action it whether it is activae, or not. The same for the docCurrent object.

So the code above gets the reference doc, sets it as an object.

The getting each paragraph as a string is a Function, with the reference document passed in as a parameter, and a counter - the Paragraph count of the reference document.
For j = 1 To docRef.Paragraphs.Count

NOTE: Important!!!! This assumes the reference document is a list of paragraphs with no "empty" paragraphs.

So, for each paragraph in the reference document, the Find gets the next word from the reference doc, strips the paragraph mark, and returns the clean word to the current iteration of the Find.

If Asc(r.text) - the current .Found - is not 32, replace the .Found with the current word from the reference doc.

Orge
11-16-2010, 03:05 AM
Hi Gerry,

First up thanks very much for the rapid response.

When pasting your VBA into Word I get an error on running it: Compile error: Loop without Do, which referes to the 'Loop' instruction 6 lines from the bottom.

Any idea how to get around this or have I done something wrong?

Cheers.

gmaxey
11-16-2010, 12:10 PM
Seems to me that something like this would work:

Option Explicit
Function ReplaceWord(inDoc As Document, j As Long) As String
Dim wrdPara As Paragraph
Dim wrdRef As String
Set wrdPara = inDoc.Paragraphs(j)
wrdRef = wrdPara.Range.Text
'remove the paragraph mark:
wrdRef = Left(wrdRef, Len(wrdRef) - 1)
ReplaceWord = wrdRef
End Function
Sub Anonymouse()
Dim docCurrent As Document
Dim docRef As Document
Dim j As Long
Dim r As Range
Dim wrdRef As String
Set docCurrent = ActiveDocument
Set docRef = Documents.Open("d:\checklist.doc")
For j = 1 To docRef.Paragraphs.Count
Set r = docCurrent.Range
' gets the next word from reference doc
wrdRef = ReplaceWord(docRef, j)
With r.Find
.ClearFormatting
.Replacement.ClearFormatting
.Text = wrdRef
.Replacement.Font.Color = wdColorRed
.Replacement.Text = "[REMOVED]"
.MatchWholeWord = True
.MatchCase = True
.Execute Replace:=wdReplaceAll
End With
Next
docRef.Close
Set docRef = Nothing
End Sub

fumei
11-16-2010, 12:26 PM
I am missing the close to ther IF statement.
With r.Find
.ClearFormatting
.Replacement.ClearFormatting
.Replacement.Font.Color = wdColorRed
.Replacement.Text = "[REMOVED]"
.MatchWholeWord = True
.MatchCase = True
Do While .Execute(Findtext:=wrdRef, _
Forward:=True) = True
If Asc(r.Text) > 32 Then
r.Text = wrdRef
r.collapse 0
Loop
End With


should be:

With r.Find
.ClearFormatting
.Replacement.ClearFormatting
.Replacement.Font.Color = wdColorRed
.Replacement.Text = "[REMOVED]"
.MatchWholeWord = True
.MatchCase = True
Do While .Execute(Findtext:=wrdRef, _
Forward:=True) = True
If Asc(r.Text) > 32 Then
r.Text = wrdRef
r.collapse 0
End if
Loop
End With

Orge
11-16-2010, 02:01 PM
Thanks again guys!

Gerry - I thought this was the case so I added the End If bit in earlier, unfortunately it brings up another problem which is:

Compile error:
Invalid use of property

Referencing the 4th line of code and can be seen in the attached image.

As before, any help would be greatly appreciated.

Cheers.

fumei
11-16-2010, 02:57 PM
Ooops!

wrdPara is an object

SET oPara = inDoc.Paragraphs(j)

My bad.

fumei
11-16-2010, 02:59 PM
Notice that Greg, being better than I, has the correct syntax! He used SET.

That is what I get for typing in here directly.

fumei
11-16-2010, 03:08 PM
Also, if your objective is to replace all in one swoop, then Greg's code is better, as it does precisely that. Mine actions each individually.

The difference is that Greg's code does NOT have the additional test of:

If Asc(r.Text) > 32 Then

His replaces all, with no testing the ASCII of the first character. If the situation is that some of the .Found may not pass that test, and some will, then obviously you need the test. In which case, you need to action each individually in order to DO that test.

That is why I added it.

Although I have to say that it seems unlikely that you are going to have a word in the reference doc that DOES have Asc(first)character) as < 32. If it does, it seems to me that your reference doc is in error.

macropod
11-16-2010, 03:11 PM
Hi Orge,

Does your 'doclist' include posessives (eg Bill's)? If not, a simple change to the Find/Replace can avoid ending up with [REMOVED]'s. Taking Gerry's code as an example, you could change:

With r.Find
.ClearFormatting
.Replacement.ClearFormatting
.Replacement.Font.Color = wdColorRed
.Replacement.Text = "[REMOVED]"
.MatchWholeWord = True
.MatchCase = True
Do While .Execute(Findtext:=wrdRef, _
Forward:=True) = True
If Asc(r.Text) > 32 Then
r.Text = wrdRef
r.collapse 0
End If
Loop
End With
to:

With r.Find
.ClearFormatting
.Replacement.ClearFormatting
.Replacement.Font.Color = wdColorRed
.Replacement.Text = "[REMOVED]"
.MatchWholeWord = True
.MatchCase = True
Do While .Execute(Findtext:=wrdRef & "'s", _
Forward:=True) = True
If Asc(r.Text) > 32 Then
r.Text = wrdRef
r.collapse 0
End If
Loop
Do While .Execute(Findtext:=wrdRef, _
Forward:=True) = True
If Asc(r.Text) > 32 Then
r.Text = wrdRef
r.collapse 0
End If
Loop
End With

Orge
11-16-2010, 03:49 PM
Ooops!

wrdPara is an object

SET oPara = inDoc.Paragraphs(j)

My bad.

Gerry - presumably I need to remove the references to wrdPara in other lines as well, so:

Option Explicit
Function ReplaceWord(inDoc As Document, j As Long) As String
Dim wrdPara As Paragraph
Dim wrdRef As String

wrdPara = inDoc.Paragraphs(j)
wrdRef = wrdPara.Range.Text
' remove the paragraph mark:
wrdRef = Left(wrdRef, Len(wrdRef) - 1)
ReplaceWord = wrdRef
End Function

becomes:

Option Explicit
Function ReplaceWord(inDoc As Document, j As Long) As String
Dim oPara As Paragraph
Dim wrdRef As String

SET oPara = inDoc.Paragraphs(j)
wrdRef = oPara.Range.Text
' remove the paragraph mark:
wrdRef = Left(wrdRef, Len(wrdRef) - 1)
ReplaceWord = wrdRef
End Function

Am I correct? When I try to run this code it hangs Word 2007, which is unfortunate.

Macropod - my doclist contains straight names only, such as 'James' 'Jim' etc, etc - no possessives, so thankyou for that!

GMaxey - I seem to be unable to get your code to work properly either, it also hangs the computer. Is this just me or does it work for anybody else?

Thanks again guys, if I can get this to work it's going to be fantastic.

Cheers.

gmaxey
11-16-2010, 03:57 PM
Orge,

It doesn't hang my PC but I am using a very small list in the reference document. You might try with a shorter list. If that works then you might need to split your list into smaller parts.

Orge
11-16-2010, 04:35 PM
Orge,

It doesn't hang my PC but I am using a very small list in the reference document. You might try with a shorter list. If that works then you might need to split your list into smaller parts.

Aha! You are correct, must be the size of my list then. I will try splitting it down into several smaller lists and see if that fixes it.

Cheers all!

fumei
11-17-2010, 11:25 AM
Crikey. I am really not paying attention.

Yes, whatever variable name you use, be consistent!

If you use wrdPara
Dim wrdPara As Paragraph
then use wrdPara, in your Set instruction, and where ever you use wrdPara.

If you use oPara, then use oPara.

I can't believe I did that.

Orge
11-18-2010, 04:43 AM
Right, she works! List reduced to ~3000 names and people who are definitely going to be encountered has helped. It takes about 7-10 minutes to process on each document.


Option Explicit
Function ReplaceWord(inDoc As Document, j As Long) As String
Dim wrdPara As Paragraph
Dim wrdRef As String
Set wrdPara = inDoc.Paragraphs(j)
wrdRef = wrdPara.Range.Text
'remove the paragraph mark:
wrdRef = Left(wrdRef, Len(wrdRef) - 1)
ReplaceWord = wrdRef
End Function
Sub Names()
Dim docCurrent As Document
Dim docRef As Document
Dim j As Long
Dim r As Range
Dim wrdRef As String
Set docCurrent = ActiveDocument
Set docRef = Documents.Open("d:\Anonymouse\Checklists\checklist.doc")
For j = 1 To docRef.Paragraphs.Count
Set r = docCurrent.Range
' gets the next word from reference doc
wrdRef = ReplaceWord(docRef, j)
With r.Find
.ClearFormatting
.Replacement.ClearFormatting
.Text = wrdRef
.Replacement.Font.Color = wdColorRed
.Replacement.Text = "[NAME REMOVED]"
.MatchWholeWord = True
.MatchCase = True
.Execute Replace:=wdReplaceAll
End With
Next
docRef.Close
Set docRef = Nothing
End Sub


Only issue at the moment, is I want to bath run it on ~100 documents overnight which I have a macro for, however when I ran it last night it did the first 19 of them and then I got a windows error saying memory was low and it was going to increase my virtual memory size! I'm not sure why it would do this as it should be closing each file after it's finished.

Still, even if I can run them in batches of 15 that's fantastic compared to the alternative of going through them manually.

Orge.

fumei
11-18-2010, 12:00 PM
Post your code that is doing th eprocessing.

Word is famous, unfortunately, for its memory handling (poor at best, terrible at worst), although it is better that it used to be. Word 6 was so bad, that if you did NOTHING - just let the computer sit there - Word would eventually steal ALL the system memory, so much that Wiindows itself would die.

So. Depending of precisely what you are doing, the code could likely be tweaked enough to handle things.

"I'm not sure why it would do this as it should be closing each file after it's finished."

Post your code. Let's see.

macropod
11-18-2010, 02:33 PM
Hi Orge,

Re:
It takes about 7-10 minutes to process on each document.That really does seem to be an excessive lenght of time. One certain way of speeding things up - and reducing resource issues - would be to read all the strings from the 'checklist' document into an array, then close that document. From then on, you would open each of the documents to be processed and run the Find/Replace loops via the strings stored in the array.

I also haven't seen anything in the code you've posted so far to indicate whether you're turning off screenupdating while the code runs. That too should increase processing speed.

Orge
11-18-2010, 04:20 PM
Once again - thanks for the tips.

This is all of the code processing the checklist:

Option Explicit
Function ReplaceWord(inDoc As Document, j As Long) As String
Dim wrdPara As Paragraph
Dim wrdRef As String
Set wrdPara = inDoc.Paragraphs(j)
wrdRef = wrdPara.Range.Text
'remove the paragraph mark:
wrdRef = Left(wrdRef, Len(wrdRef) - 1)
ReplaceWord = wrdRef
End Function
Sub Names()
Dim docCurrent As Document
Dim docRef As Document
Dim j As Long
Dim r As Range
Dim wrdRef As String
Set docCurrent = ActiveDocument
Set docRef = Documents.Open("d:\Anonymouse\Checklists\checklist.doc")
For j = 1 To docRef.Paragraphs.Count
Set r = docCurrent.Range
' gets the next word from reference doc
wrdRef = ReplaceWord(docRef, j)
With r.Find
.ClearFormatting
.Replacement.ClearFormatting
.Text = wrdRef
.Replacement.Font.Color = wdColorRed
.Replacement.Text = "[NAME REMOVED]"
.MatchWholeWord = True
.MatchCase = True
.Execute Replace:=wdReplaceAll
End With
Next
docRef.Close
Set docRef = Nothing
End Sub

Pretty much exactly what gmaxey wrote earlier with my filename substituted. There are a couple of other sections doing things like removing dates 'he' 'she' 'his' 'hers' etc. but I don't believe any of these are causing an issue, all run instantly. It's this part that reads from the checklist that takes the time.

Cheers, Orge.

macropod
11-18-2010, 08:07 PM
Hi Orge,

Try this code:

Sub Anonymiser()
Application.ScreenUpdating = False
Dim FileList As Variant, ChkDoc As Document, TestDoc As Document, FilePath As String, ChkList As String, j As Long
'Load the strings from the reference doc into a text string to be used as an array.
Set ChkDoc = Documents.Open("D:\Anonymouse\Checklists\Checklist.doc")
ChkList = ChkDoc.Range.Text: ChkDoc.Close False: Set ChkDoc = Nothing
'Get the path to the documents to process
FilePath = InputBox("Please input the path to the documents to process", "Path to Files", ActiveDocument.Path)
'Exit if the filepath is empty
If FilePath = "" Then GoTo Done
'Ensure the filepath ends with "\"
If Right(FilePath, 1) <> "\" Then FilePath = FilePath & "\"
'Get a list of all documents in the target folder
FileList = Dir(FilePath & "*.doc", vbNormal)
'Process each found file
While FileList <> ""
Set TestDoc = Documents.Open(FilePath & FileList)
With TestDoc.Range.Find
.ClearFormatting
.Replacement.ClearFormatting
.Replacement.Font.Color = wdColorRed
.Replacement.Text = "[NAME REMOVED]"
.MatchWholeWord = True
.MatchCase = True
'Process each word from the Check List
For j = 0 To UBound(Split(ChkList, vbCr))
.Text = Split(ChkList, vbCr)(j) & "'s"
.Execute Replace:=wdReplaceAll
.Text = Split(ChkList, vbCr)(j)
.Execute Replace:=wdReplaceAll
Next
End With
TestDoc.Close True
FileList = Dir()
Wend
'Clean up and exit
Done:
Set TestDoc = Nothing
Application.ScreenUpdating = True
End Sub
Notes:
1. Screen updating is turned off, but you'll probably see some flickering as each new document is loaded.
2. The checklist document is opened, data gathered, then closed.
3. You're asked to nominate the folder to process. The code then processes all files in that folder without further intervention.
4. The Find/Replace loop has been optimised to eliminate the unnecessary resetting of variables on each iteration
5. Possessive cases are catered for.

fumei
11-19-2010, 10:12 AM
I agree, I can not see anything here that would remotely cause processing of a single iteration 7-10 minutes. Mind you, we have no idea how large these are, but still, see strange.

Let us know how it turns out running macropod's code, as it is optimized well. If it is still taking such a long time, OR you seem to be still having memory issues, something is wrong.