PDA

View Full Version : Solved: Comparing Duplicate Paragraphs or Text Strings



MDY
01-12-2009, 06:31 PM
Hi All,
I was wondering if someone would be able to help me out here, I don't think that this one will be too difficult.
I was wondering how I could highlight duplicate paragraphs or text strings of 5 or more words, 6 or more words etc or a desired value within a document? It would be great if I could highlight the duplicate Red and the first instance yellow or something along those lines. Essentially I was hoping that this could be done using VB/macro with just a simple button loaded on the toolbar. I hope that my explanation is easy enough to follow. I'll apologise in advance for my limited knowledge of VB.

Thanks everyone.

Mdy

MDY
01-13-2009, 04:31 PM
G'day,
Does anyone have any Ideas for this question? Maybe its a little bit harder than I first thought.

Thanks
MDY

fumei
01-14-2009, 10:13 AM
Please clarify what you mean by "I could highlight the duplicate Red and the first instance yellow". Also, what does that have to do with "duplicate"?

MDY
01-14-2009, 03:30 PM
Hi Fumei,
Thanks for the reply. I was hoping that there was some VB code that was able to find duplicate strings of text/paragraphs in a document, identify where a string/paragraph has been repeated/duplicated and program the macro so that the code would highlight, not delete, the repeated strings of text/paragraphs. In doing this I was hoping that the first string could be highlighted yellow and the second string red and that these same colours could be repeated for the 1st, 2nd, 3rd etc identification of repeated/duplicated strings of text/paragraphs.

I hope that this clarifies your question Fumei and thanks for helping me out!

Cheers
MDY

TonyJollans
01-14-2009, 04:46 PM
There is nothing of this kind that exists (as far as I know) and it would be fairly complex to write something general purpose that didn't take forever to run on non-trivial documents. Can you not be a bit more specific about what you are looking for?

MDY
01-14-2009, 08:03 PM
Gday Tony,
Thanks a lot for taking this question on board! I'm not after any specific lines of text. To give you a bit of background I'm editing a number of documents which are about 100 pages in length each and unfourtunately have a number of repeated sentences and statements in them. I only want the macro to run over individual documents not compare 2 seperate documents. I was hoping to find a macro which identifies any repeated statements and highlights them so i can asses which ones we keep/delete or move to another section of the document. I'm not at all worried about how long it takes for the macro to run but i'm just trying to avoid having to do this manually, as you could imagine this would take quite some time!

Any help with this one would be much appreciated, thanks for your time Tony!

Cheers
MDY

TonyJollans
01-15-2009, 07:11 AM
This is very rough and ready and has had minimal testing (just enough to make it not fall over with artificial data) ...

Sub VBAXTest()

Const NMin As Long = 5
Dim R As Range, W As Range
Dim C As Range, C2 As Range, N As Long

Set R = ActiveDocument.Content

For Each W In R.Words
If W.HighlightColorIndex = wdNoHighlight Then
N = NMin

Do
Set C = W.Duplicate
C.MoveEnd wdWord, N
If C.End = ActiveDocument.Range.End Then Exit Sub
Set C2 = ActiveDocument.Range(C.End, ActiveDocument.Range.End)

Select Case True
Case Len(C.Text) > 256, _
C.HighlightColorIndex = 9999999, _
C2.Find.Execute(FindText:=C.Text, Wrap:=wdFindStop) = False

If N > NMin Then DoHighLight C
Exit Do

Case Else
N = N + 1

End Select
Loop

End If
Next
End Sub

Sub DoHighLight(C As Range)

Dim C2 As Range

C.MoveEnd wdWord, -1
C.HighlightColorIndex = wdYellow
Set C2 = ActiveDocument.Range(C.End, ActiveDocument.Range.End)
While C2.Find.Execute(FindText:=C.Text, Wrap:=wdFindStop)
C2.HighlightColorIndex = wdRed
Wend

End Sub

fumei
01-15-2009, 11:27 AM
Interesting Tony. What did you use for your minimal testing? With the text from =rand(1,4) you ONE paragraphs with FOUR sentences, as in:

The quick brown fox jumps over the lazy dog. The quick brown fox jumps over the lazy dog. The quick brown fox jumps over the lazy dog. The quick brown fox jumps over the lazy dog.

The result of your code is the first TWO sentences are both highlighted yellow. The second TWO sentences are both highlighted red. As the OP asked:

"I was hoping to find a macro which identifies any repeated statements and highlights them "

are they not aiming to have (in this case), the four repeated statements (i.e. text) highlighted differently.

The second sentence is a repetition of the first, but it is highlighted the same. Ditto, the second sentence of the next two.

If you do =rand(12,4) - 12 paragraphs of 4 sentences, the result is:

The first TWO sentences are highlighted yellow, and the next 10 paragraphs are all highlighted red.

TonyJollans
01-15-2009, 01:43 PM
The problem with the =rand text is that it repeats a bit too much.

I'm sure you can read the code but all it really does is run through the document a word at a time and consider strings of a number of words (minimum 5) starting from that word. If another copy of the string is found, the starting string is widened a word at a time until a duplicate isn't found; the string is then narrowed by a word and highlighted yellow with all duplicates highlighted red. The search for duplicate strings is cut short as soon as it trips over an already highlighted word. Also (and here is the problem with the quick foxes), the search has to be broken off when the text is longer than 256 characters.

A full solution would need a different approach. If paragraphs were wanted to be checked as opposed to arbitrary strings a different approach could probably be used quite easily, but more generally I didn't feel inclined to go that far - at least until MDY has some feedback.

MDY
01-15-2009, 04:22 PM
Hi Tony and Fumei,
Tony, there is some brilliant coding in there and the macro works pretty well. It does take a little bit of time to run through but that has to be expected. You've done a brilliant job! The only problem I have is with sections of the document which contain tables of contents and glossaries etc which should not be picked up by the code. Is it possible to tell the code to ignore certain pages and skip over them? For example pages 1,2,3,69,70,71 etc?

I guess another way of drawing comparisons would be as fumei suggested would be to compare whole Sentences with one another, and I guess it would be great to have this as an option if possible.

I suppose other functionality that would be useful and its not what I first intended would be to have the ability to compare 2 different documents against each other, I'm tipping that this would require some pretty difficult programming though!

This is really helping me out guys! Thankyou!

Cheers MDY

fumei
01-16-2009, 11:01 AM
Hey I was not dissing you for the code, and yes I suppose =rand makes a bit...ummmm..."too much" repetition.

Although one has to wonder who defines (and how) the amount.

If ANY string is repeated, does it make a difference if that is repeated three times....or 16 times? In my first example, there is only four sentences. Why is four times "too much"?

It does not matter, no big deal, and yes I understand.

TonyJollans
01-17-2009, 12:32 PM
I'm sure there are many things my code won't handle; if that's all you've got we've fallen lucky! Trying to skip pages would, no doubt, be possible but it would be very specific to a document. I could amend the code fairly easily to ignore Word Fields which would take care of TOCs. I'm not sure what else you have, or how your glossary may be set up to be recognised as duplicate. I haven't tried this but could you perhaps format anything you want to be ignored as hidden text, which should then be ignored by default.

Comparing sentences is pretty straightforward and I think this will do it:

Dim S1 As Range, S2 As Range
For Each S1 In ActiveDocument.Content.Sentences
If S1.HighlightColorIndex = wdNoHighlight Then
If S1.End < ActiveDocument.Range.End Then
For Each S2 In ActiveDocument.Range(S1.End, ActiveDocument.Range.End).Sentences
If Trim(S1.Text) = Trim(S2.Text) Then
S1.HighlightColorIndex = wdYellow
S2.HighlightColorIndex = wdRed
End If
Next S2
End If
End If
Next S1

Comparing two documents is a diffferent animal altogether. Take a look at Word's compare features for that one.

MDY
01-18-2009, 06:06 PM
Hi Tony,
Thanks once again for your help, its been fantastic! The Comparison of sentences works exeptionally well, possibly even better than the number of characters option but both will be extremely useful. The only error I have noticed is with graphics/pictures embedded in the document which unusually are highlighted.
In response to the code being able to skip certain pages i'm mostly worried about the glossary. The glossary in these documents is taken from a standard document which is only normal text. Due to the structure of the documents which I'm checking the glossary is required to be pasted into different areas of the documents. These are technical documents which I am working with so inevitably the glossary contains text which will be repeated. If it is possible to include a section within the code that you are able to program to skip certain pages that would be fantastic and provide a final solution. Its pretty easy to check the documents before running the macro to check which pages need to be skipped.

Thanks for all your time and help in providing a solution! You've been a fantastic help and a real time saver!

MDY

MDY
01-18-2009, 09:03 PM
.

MDY
01-18-2009, 09:03 PM
Hi Tony,
Sorry I should have said, your not wrong about the compare function in word being a different animal! This is one of the first tools I tried but there are not a lot of functions available and the merge function completed a heap of edits etc, ones I definitely was not expecting. Not to worry about this one mate its easy enough to copy and paste the documents one under the other and then run the applicable code.

Thanks Tony

Mdy.

TonyJollans
01-19-2009, 10:27 AM
Having thought a little about this, I think the quickest solution would be to highlight the glossary (in a different colour perhaps) before beginning and unhighlight it afterwards - the existing code would then ignore it. You could do this manually or add a little bit of code to prompt for the page numbers and do it in the code. Let me know if you want help to code it.

MDY
01-19-2009, 06:00 PM
.

MDY
01-19-2009, 06:03 PM
Hi Tony,
It would be fantastic if I was able to tell the code to ignore or skip certain pages. I'm not sure how exactly this would be programmed but would you mind helping me out?

Once again thanks for all your help!

Cheers
Mdy

TonyJollans
01-20-2009, 02:44 AM
I've done this rather quickly, but here is an extra routine which will highlight pages you specify:

Static Sub HighlightIgnorePages(Optional Remove As Boolean = False)

Dim PageNosIn As String
Dim PageRanges, Rx As Long
Dim PageNos, PageA As Long, PageB As Long, Px As Long
Dim PagesInDoc As Long
Dim IgnoreStart As Long, IgnoreEnd As Long, IgnoreColour As Long

If Not Remove Then
PagesInDoc = ActiveDocument.Range.Information(wdNumberOfPagesInDocument)
PageNos = InputBox("Please Enter Page numbers to be ignored" & vbNewLine & _
"Example of format: 33-34, 38 , 41 - 43")
PageRanges = Split(PageNos, ",")
End If

Application.ScreenUpdating = False
For Rx = LBound(PageRanges) To UBound(PageRanges)
PageNos = Split(Trim(PageRanges(Rx)), "-", 2)
PageA = CLng(Trim(PageNos(0)))
If UBound(PageNos) = 0 Then
PageB = PageA
Else
PageB = CLng(Trim(PageNos(1)))
End If
If PageA < 1 Then PageA = 1
If PageB < PageA Then PageB = PageA
If PageB > PagesInDoc Then PageB = PagesInDoc
If PageA <= PagesInDoc Then
With Selection
.GoTo wdGoToPage, wdGoToAbsolute, PageA
IgnoreStart = .Start
If PageB = PagesInDoc Then
IgnoreEnd = ActiveDocument.Range.End
Else
.GoTo wdGoToPage, wdGoToAbsolute, PageB + 1
IgnoreEnd = .End
End If
End With
If Remove Then
IgnoreColour = wdNoHighlight
Else
IgnoreColour = wdGray25
End If
ActiveDocument.Range(IgnoreStart, IgnoreEnd).HighlightColorIndex = IgnoreColour
End If
Next

End Sub

In the existing routines, put this line at the start:

HighlightIgnorePages

and this line at the end:

HighlightIgnorePages Remove:=True

MDY
01-20-2009, 06:47 PM
Hi tony,
that works brilliantly! thankyou. Their is only one remaining problem, unfourtunately the code highlights headings and pictures, the headings are set as headings in the documents. Is there any way that this can be avoided?

Thanks for all your time mate!

Cheers
Mdy

TonyJollans
01-21-2009, 05:35 AM
This gets rougher and readier with each change. Given what we've got so far, highlighting the headings is perhaps the easiest way. Here is an amendment to the glossary routine that also addresses headings:

Static Sub HighlightIgnorePages(Optional Remove As Boolean = False)

Dim PageNosIn As String
Dim PageRanges, Rx As Long
Dim PageNos, PageA As Long, PageB As Long, Px As Long
Dim PagesInDoc As Long
Dim IgnoreStart As Long, IgnoreEnd As Long, IgnoreColour As Long

If Not Remove Then
PagesInDoc = ActiveDocument.Range.Information(wdNumberOfPagesInDocument)
PageNos = InputBox("Please Enter Page numbers to be ignored" & vbNewLine & _
"Example of format: 33-34, 38 , 41 - 43")
PageRanges = Split(PageNos, ",")
End If

If Remove Then
IgnoreColour = wdNoHighlight
Else
IgnoreColour = wdGray25
End If

Application.ScreenUpdating = False
For Rx = 1 To 0 ' Bound(PageRanges) To UBound(PageRanges)
PageNos = Split(Trim(PageRanges(Rx)), "-", 2)
PageA = CLng(Trim(PageNos(0)))
If UBound(PageNos) = 0 Then
PageB = PageA
Else
PageB = CLng(Trim(PageNos(1)))
End If
If PageA < 1 Then PageA = 1
If PageB < PageA Then PageB = PageA
If PageB > PagesInDoc Then PageB = PagesInDoc
If PageA <= PagesInDoc Then
With Selection
.GoTo wdGoToPage, wdGoToAbsolute, PageA
IgnoreStart = .Start
If PageB = PagesInDoc Then
IgnoreEnd = ActiveDocument.Range.End
Else
.GoTo wdGoToPage, wdGoToAbsolute, PageB + 1
IgnoreEnd = .End
End If
End With
ActiveDocument.Range(IgnoreStart, IgnoreEnd).HighlightColorIndex = IgnoreColour
End If
Next

Dim Hx As WdBuiltinStyle
With ActiveDocument.Content.Find
Options.DefaultHighlightColorIndex = IgnoreColour
.Replacement.Highlight = True
For Hx = wdStyleHeading9 To wdStyleHeading1
.Style = ActiveDocument.Styles(Hx)
.Execute Replace:=wdReplaceAll
Next
End With

End Sub

MDY
02-08-2009, 09:56 PM
Hi Tony,
thanks for all your help insolving this problem. your solution was fantastic and worked really well.

Cheers MDY