PDA

View Full Version : Search for similar paragraphs in several docs



Prpikachu
12-06-2016, 02:41 AM
Hi,

I am quite new to VBA and I would like to ask for some tips with this.

I have a Word doc with a single paragraph. I would like to search for similar paragraphs in all the docs in a directory and copy them in the first doc below the original paragraph.

The difficulty is searching for not exactly the same paragraph, but similar ones. Ideally, I would like to establish a match percentage. There may be some different words within the other paragraphs. Infact, if the paragraphs are identical, I do not need to copy them.

Any idea on how to approach this, please?

Thank you in advance!

gmaxey
12-06-2016, 05:22 AM
It would be a lengthy process even for VBA IMHO. First your would need to establish what the process should consider similar and then start culling out from the easiest to the most difficult. For example if a tested paragraph is < 5 words or >5 words the standard paragraph, regardless of other similarities, then it is not similar so skip it an move to the next. How does word choice determine similarity How does word order determine similarity?

Prpikachu
12-08-2016, 02:22 AM
You are right. I am starting to think about another process: maybe, once an occurrence is found, searching only within the relevant paragraph for sets of two words and calculating the match percentage.

But, in the meantime, I am trying it in a more primitive way: I select a relevant part of the paragraph and search for it. The problem is that it seems that I am struggling with the basics here. I have this code, which is a mix of others I have found. It work well, expect for the fact that the search seems to be wrongly performed: it just pastes blank paragraphs. I am not even sure if it is really searching for the selected sentence (it seems that it is not searching through the whole document and just expends the initial selection (nothing) and pastes it in the other document.

Any idea of what I am doing wrong, please?

Thank you in advance!


Sub DocsGenerator()

Dim InitialDoc As String
InitialDoc = Documents(ActiveDocument.FullName)

Dim FolderName As String
With Application.FileDialog(msoFileDialogFolderPicker)
.AllowMultiSelect = False
.Show
On Error Resume Next
FolderName = .SelectedItems(1)
Err.Clear
On Error GoTo 0
End With

Static SelectedSentence As String
SelectedSentence = Selection

Dim oWordApp As Object, oWordDoc As Object, rngStory As Object
Dim sFolder As String, strFilePattern As String
Dim strFileName As String, sFileName As String

'~~> Change this to the folder which has the files
sFolder = FolderName & "\"
'~~> This is the extention you want to go in for
strFilePattern = "*.doc*"

'~~> Establish an Word application object
On Error Resume Next
Set oWordApp = GetObject(, "Word.Application")

If Err.Number <> 0 Then
Set oWordApp = CreateObject("Word.Application")
End If
Err.Clear
On Error GoTo 0

oWordApp.Visible = True

'~~> Loop through the folder to get the word files
strFileName = Dir(sFolder & strFilePattern)
Do Until strFileName = ""
sFileName = sFolder & strFileName

'~~> Open the word doc
Set oWordDoc = oWordApp.Documents.Open(sFileName)

'~~> Do Find and Replace
Dim c As Range
Set c = ActiveDocument.Content
c.Find.ClearFormatting
c.Find.Replacement.ClearFormatting
With c.Find
.Text = SelectedSentence
.Forward = True
.Wrap = wdFindStop
.Format = False
.MatchCase = False
.MatchWholeWord = False
.MatchSoundsLike = False
.MatchAllWordForms = False
End With

c.Find.Execute
While c.Find.Found
Selection.Expand wdParagraph 'Expands your selection to current paragraph
Selection.Copy 'Copy your selection
Documents(InitialDoc).Activate 'Activate the other document
Selection.EndKey wdStory 'Move to end of document
Selection.PasteAndFormat wdPasteDefault 'Pastes in the content
Selection.TypeText vbCr & "(" & strFileName & ")" & vbCr
Documents(sFileName).Activate
c.Find.Execute
Wend

Documents(sFileName).Activate
'~~> Close the file after saving
oWordDoc.Close SaveChanges:=True

'~~> Find next file
strFileName = Dir()
Loop

'~~> Quit and clean up
oWordApp.Quit

Set oWordDoc = Nothing
Set oWordApp = Nothing
End Sub

gmaxey
12-08-2016, 07:51 AM
Not sure exactly what you are trying to do, but this seems like it should get you close:


Sub DocsGenerator()
Dim strFind As String
Dim oThisDoc As Document, oDoc As Document
Dim strFolder As String, strFileName As String
Dim oRng As Range
Set oThisDoc = ActiveDocument
With Application.FileDialog(msoFileDialogFolderPicker)
.AllowMultiSelect = False
If .Show = -1 Then strFolder = .SelectedItems(1)
End With
strFind = Selection.Range.Text
strFileName = strFolder & "\" & Dir(strFolder & "\*.doc*")
Do Until strFileName = ""
Set oDoc = Documents.Open(strFileName)
Set oRng = oDoc.Range
With oRng.Find
.Text = strFind
While .Execute
oRng.Expand wdParagraph
oThisDoc.Activate
oThisDoc.Range.InsertAfter vbCr & oRng.FormattedText & vbCr & "(" & strFileName & ")"
oRng.Collapse wdCollapseEnd
Wend
End With
oDoc.Close SaveChanges:=True
strFileName = Dir()
Loop
Set oDoc = Nothing: Set ThisDoc = Nothing
End Sub

Prpikachu
12-08-2016, 12:58 PM
Wow! Thank you, Greg!

Far simpler than my attempt. I am getting an error in this line:

Set oDoc = Documents.Open(strFileName)

It works with the first document, but with the second one I have a runtime error 5174, file not found.
Maybe it is because strFilename is defined out of the loop?

gmaxey
12-08-2016, 01:16 PM
Try this:

Sub DocsGenerator()
Dim strFind As String
Dim oThisDoc As Document, oDoc As Document
Dim strFolder As String, strFileName As String
Dim oRng As Range
Set oThisDoc = ActiveDocument
With Application.FileDialog(msoFileDialogFolderPicker)
.AllowMultiSelect = False
If .Show = -1 Then strFolder = .SelectedItems(1)
End With
strFind = Selection.Range.Text
strFileName = Dir(strFolder & "\*.doc*")
Do Until strFileName = ""
Set oDoc = Documents.Open(strFolder & "\" & strFileName)
Set oRng = oDoc.Range
With oRng.Find
.Text = strFind
While .Execute
oRng.Expand wdParagraph
oThisDoc.Activate
oThisDoc.Range.InsertAfter vbCr & oRng.FormattedText & vbCr & "(" & strFileName & ")"
oRng.Collapse wdCollapseEnd
Wend
End With
oDoc.Close SaveChanges:=True
strFileName = Dir()
Loop
Set oDoc = Nothing: Set ThisDoc = Nothing
End Sub

Prpikachu
12-08-2016, 02:37 PM
Thanks A LOT, Greg! Now it is perfect! I have added a new line (oRng.Font.ColorIndex = wdRed), after (oRng.Expand wdParagraph) so that it changes the color of the extracted paragraph. This way I can see which paragraphs have been already used by the macro. I have to review your code to be sure that I understand it all, so maybe I will come back with some questions, if you do not mind. This would be the "primitive" way to get part of what I am searching for. I will think how to deal with the "similarity" issue.

Prpikachu
12-10-2016, 02:31 PM
I am including some lines so that if the paragraph is identical, it is not copied (just a reference):


Sub DocsGenerator()
Dim strFind As String
Dim oThisDoc As Document, oDoc As Document
Dim strFolder As String, strFileName As String
Dim oRng As Range
Dim strRange As Range
Set oThisDoc = ActiveDocument
With Application.FileDialog(msoFileDialogFolderPicker)
.AllowMultiSelect = False
If .Show = -1 Then strFolder = .SelectedItems(1)
End With
strFind = Selection.Range.Text
Set strRange = oThisDoc.Range
strRange.Expand wdParagraph
strFileName = Dir(strFolder & "\*.doc*")
Do Until strFileName = ""
Set oDoc = Documents.Open(strFolder & "\" & strFileName)
Set oRng = oDoc.Range
With oRng.Find
.Text = strFind
While .Execute
oRng.Expand wdParagraph
oRng.Font.ColorIndex = wdRed
oThisDoc.Activate
If oRng.Text = strRange.Text Then
oThisDoc.Range.InsertAfter vbCr & "Exact paragraph in (" & strFileName & ")"
Else
oThisDoc.Range.InsertAfter vbCr & oRng.FormattedText & vbCr & "(" & strFileName & ")"
End If
oRng.Collapse wdCollapseEnd
Wend
End With
oDoc.Close SaveChanges:=True
strFileName = Dir()
Loop
Set oDoc = Nothing: Set ThisDoc = Nothing
End Sub

It is working when the paragraph appears alone in a document, but not when the same one is within a document with other paragraphs (then, it is copied as usual).

Any idea of what may be the reason? I understand that the range may be different, but I do not understand why.

Thanks!

gmaxey
12-11-2016, 10:48 AM
I have no idea what you are trying to do, but setting a range equal to an entire document range then expanding it to the end of a paragraph doesn't make much sense:

Set strRange = oThisDoc.Range
strRange.Expand wdParagraph

Prpikachu
12-11-2016, 12:12 PM
Ooops! You are right. Still learning to use ranges... I have replaced both lines by Set strRange = Selection.Paragraphs(1).Range But it is still working the same way: it detects that the whole paragraph is equal in one document, but not in the others (I have tried to cut and paste exactly the same paragraph, so I do not know which may be the difference) And it is not because in the first one to compare the paragraph is alone. I have added several paragraphs around and it still detects that it is the same one.

Prpikachu
12-11-2016, 12:43 PM
I have been thinking about a possible way to check the similarity between paragraphs.

The idea would be as follows (if it makes sense at all!):

- First, splitting the original paragraph into single words. Probably, storing each of them as a variable in an array.
- Second (or simultaneously), skipping and not storing those words which are defined terms (words starting with a capital letter which do not follow a period nor start a paragraph; some expressions have to be also considered defined terms and skipped, like two defined terms separated by certain connectors ("of", "of the", etc.))
- Third, the search should be carried out this way:
a) Parameters IgnorePunct and MatchAllWordForms will be set to true.
b) It will search for the first word. If it is found, it will check, only within the current paragraph, if the next word matches the next word in the original paragraph (again, discarding defined terms, which may be different from those in the original paragraph), and will continue until one word is different.
c) Once a word is different, the length of the string which is equal should be stored, and the fact that there is one string matching.
d) It will search for the next word, again within the same paragraph and starting after the previous string, until all the words have been searched for or the paragraph ends.
e) Once the search has ended, we should have the number of similar strings found and their length.
f) If there are, let's say, more than 2 strings longer than 4 words, or a single one longer than 8 words (these numbers may change), the document is considered similar and then it will be copied as in the current macro.

Does it make sense? I have been checking with some paragraphs and it should work, but I do not know if it is possible to do it this way or if there is any better way to have the same result.
The key is keeping the search restricted within the same paragraph, discarding defined terms, and counting the results in terms of matching words in a row and number of matching strings. Maybe a way to simplify the search is discarding also some common words, like articles and prepositions.

Thank you!

Prpikachu
12-17-2016, 12:56 PM
If have been trying and it seems that a collection may be better than an array for what I am trying to do.

I am trying separately and I have manage to prepare this (probably with a lot of mistakes!) which creates a collection skipping certain common words like articles and prepositions (sorry, they are in Spanish), as well as colons, periods, etc. and defined terms (those starting with a capital letter not following a period).

So with this code, steps First and Second should be done, converting the paragraph in the original file into a collection including only the relevant words. The next step would searching for the words in the collection within the target paragraphs as explained above.


Sub TestCollection()

Dim rngPar As Range
Dim strwords As New Collection
Dim strInd As String
Dim StrLen As Integer

Set rngPar = Selection.Paragraphs(1).Range
For i = 1 To rngPar.Words.Count
StrLen = Len(rngPar.Words(i).Text)
If Right(rngPar.Words(i).Text, 1) = " " Then
strInd = Left(rngPar.Words(i).Text, StrLen - 1)
Else
strInd = rngPar.Words(i).Text
End If
With rngPar.Words(i)
Select Case strInd
Case "y", "o", "por", "el", "los", "la", "las", "de", "con", "u", "e", "en", "a", "del", "su", "sus", "él", "ella", "este", "éste", "esta", "ésta", "estos", "éstos", "estas", "éstas", "solo", "sólo", "aquel", "aquél", "ese", "esa", "esas", "esos", "aquellas", "aquéllas", "aquellos", "aquéllos", "él", "El", "La", "Los", "Las", "mismo", "misma", "mismos", "mismas", ",", ".", "-", "—", ";", vbCr
Case Else
If i > 1 Then
If UCase(Left(rngPar.Words(i).Text, 1)) = Left(rngPar.Words(i).Text, 1) And rngPar.Words(i - 1).Text <> "." Then
GoTo Final
End If
End If
strwords.Add (strInd)
Debug.Print strInd
Final:
End Select
End With
Next
End Sub

gmaxey
12-17-2016, 01:36 PM
Would you have to create similar collections of the other paragraph and compare the two?


Sub TestCollection()
Dim i As Long
Dim rngPar As Range
Dim strwords As New Collection, strComp1 As New Collection
Dim strInd As String
Dim lngMatchCount As Long

Set rngPar = Selection.Paragraphs(1).Range
For i = 1 To rngPar.Words.Count
strInd = Trim(rngPar.Words(i).Text)
With rngPar.Words(i)
Select Case strInd
Case "y", "o", "por", "el", "los", "la", "las", "de", "con", "u", "e", "en", "a", "del", "su", "sus", "él", "ella", "este", "éste", "esta", "ésta", "estos", "éstos", "estas", "éstas", "solo", "sólo", "aquel", "aquél", "ese", "esa", "esas", "esos", "aquellas", "aquéllas", "aquellos", "aquéllos", "él", "El", "La", "Los", "Las", "mismo", "misma", "mismos", "mismas", ",", ".", "-", "—", ";", vbCr
Case Else
If i > 1 Then
If UCase(Left(rngPar.Words(i).Text, 1)) = Left(rngPar.Words(i).Text, 1) And rngPar.Words(i - 1).Text <> "." Then
GoTo Final
End If
End If
strwords.Add (strInd)
Final:
End Select
End With
Next
Set rngPar = Selection.Paragraphs(1).Next.Range 'Compare to the next paragraph.
For i = 1 To rngPar.Words.Count
strInd = Trim(rngPar.Words(i).Text)
With rngPar.Words(i)
Select Case strInd
Case "y", "o", "por", "el", "los", "la", "las", "de", "con", "u", "e", "en", "a", "del", "su", "sus", "él", "ella", "este", "éste", "esta", "ésta", "estos", "éstos", "estas", "éstas", "solo", "sólo", "aquel", "aquél", "ese", "esa", "esas", "esos", "aquellas", "aquéllas", "aquellos", "aquéllos", "él", "El", "La", "Los", "Las", "mismo", "misma", "mismos", "mismas", ",", ".", "-", "—", ";", vbCr
Case Else
If i > 1 Then
If UCase(Left(rngPar.Words(i).Text, 1)) = Left(rngPar.Words(i).Text, 1) And rngPar.Words(i - 1).Text <> "." Then
GoTo Final2
End If
End If
strComp1.Add (strInd)
Final2:
End Select
End With
Next
If strwords.Count = strComp1.Count Then MsgBox "Same number of filtered words."
lngMatchCount = 0
For i = 1 To strwords.Count
If strwords.Item(i) = strComp1.Item(i) Then
lngMatchCount = lngMatchCount + 1
End If
Next
MsgBox lngMatchCount / strwords.Count

End Sub

Prpikachu
12-17-2016, 01:50 PM
Thanks, Greg!

I am not sure, yet.

The target documents are normal documents with hundreds of paragraphs, so we would have to loop through all of them.

One idea could be searching for the first word, the normal way, and once it is found, create the second collection and compare both as you propose.

But I am not sure if it is needed or if it could be a long process (repeating it for each paragraph in each of the target documents).

I am starting to think about simply searching for the first word in the collection. If found, restrict the search within this paragraph and search for the second one, if found, search for the third, and simply count the number of matches, since now we just have the "core" words.

I understand that your method would only compare word in the same position within each collection, right?

At the end I would have to integrate this within the original macro, so that, instead of simply searching for the selection, it performs this "similarity" check.

Prpikachu
12-18-2016, 03:13 AM
Step by step: building the collection now works perfectly.
I have slightly changed it, just one line, so that it detects the words irrespective of the capital letters:

Select Case LCase(strInd)

gmaxey
12-18-2016, 03:59 AM
Yes, I only offered two simple comparison. Did the filtered word count match and did the filtered words match position for position. A score of 1 is a perfect match.

Prpikachu
12-18-2016, 11:08 AM
Hmmmm. I am slightly lost.

Now I am integrating both codes, but the search seems to be working wrongly. I am not sure, but it seems it is not searching through all the paragraphs in each document (still learning to use ranges...):


Sub TestParExtractor()

Dim i As Long
Dim j As Long
Dim k As Long
Dim rngPar As Range
Dim strwords As New Collection
Dim strInd As String
Dim lngMatchCount As Long
Dim MatchPerc As Long
Dim oThisDoc As Document, oDoc As Document
Dim strFolder As String, strFileName As String
Dim oRng As Range

Set oThisDoc = ActiveDocument

With Application.FileDialog(msoFileDialogFolderPicker)
.AllowMultiSelect = False
If .Show = -1 Then strFolder = .SelectedItems(1)
End With

'Creates collection of words from initial paragraph
Set rngPar = Selection.Paragraphs(1).Range
For i = 1 To rngPar.Words.Count
strInd = Trim(rngPar.Words(i).Text)
With rngPar.Words(i)
Select Case LCase(strInd)
Case "y", "o", "por", "el", "los", "la", "las", "de", "con", "u", "e", "en", "a", "del", "su", "sus", "él", "ella", "este", "éste", "esta", "ésta", "estos", "éstos", "estas", "éstas", "solo", "sólo", "aquel", "aquél", "ese", "esa", "esas", "esos", "aquellas", "aquéllas", "aquellos", "aquéllos", "él", "ello", "dicha", "dicho", "dichas", "dichos", "mismo", "misma", "mismos", "mismas", ",", ".", "-", "—", ";", vbCr, "al", "que", "tal", "dentro", "tales", "como", "para", "se", "le", "un", "una", "unos", "unas", "lo", "si", "no"
Case Else
If i > 1 Then
If UCase(Left(rngPar.Words(i).Text, 1)) = Left(rngPar.Words(i).Text, 1) And rngPar.Words(i - 1).Text <> "." Then
GoTo Final
End If
End If
strwords.Add (strInd)
'Debug.Print strInd
Final:
End Select
End With
Next

strFileName = Dir(strFolder & "\*.doc*")
Do Until strFileName = ""
Set oDoc = Documents.Open(strFolder & "\" & strFileName)
Set oRng = oDoc.Range
With oRng.Find

For j = 1 To strwords.Count
.Text = strwords(j)
.MatchAllWordForms = True
.MatchCase = False
While .Execute
lngMatchCount = 1
oRng.Expand wdParagraph
With oRng.Find
For k = 2 To strwords.Count
.Text = strwords(k)
.MatchAllWordForms = True
.MatchCase = False
.Forward = True
While .Execute
lngMatchCount = lngMatchCount + 1
MatchPerc = (lngMatchCount / strwords.Count) * 100
Wend
Next
End With

If MatchPerc > 50 Then
oRng.Expand wdParagraph
MsgBox ("Paragraph similarity= " & MatchPerc & "% " & vbCr & oRng.Text)
End If
' oRng.Font.ColorIndex = wdRed
' oThisDoc.Activate
' If oRng.Text = rngPar.Text Then
' oThisDoc.Range.InsertAfter vbCr & "Exact paragraph in (" & strFileName & ")"
' Else
' oThisDoc.Range.InsertAfter vbCr & oRng.FormattedText & vbCr & "(" & strFileName & ")"
' End If
oRng.Collapse wdCollapseEnd
Wend
Next

End With
oDoc.Close SaveChanges:=True
strFileName = Dir()
Loop
Set oDoc = Nothing: Set ThisDoc = Nothing
End Sub

The code is supposed to search for the first word in the collection. When it is found, it searches for the following ones but only within the current paragraph. I do not know if it is possible just to search forward (it is what I am trying to do). The fact is that I suspect that the ranges (oRng) are not properly defined and that is the reason why the search does not work properly. In fact, after some searches, oRng.text is just a word instead of the whole paragraph.

There is a counter that counts the number of matches within a paragraph. For the time being, it is supposed to display a message when similarity is over 50%. I have been checking it and it is never over 13% or so (2 matches in my case) even when I have exactly the same paragraph in the target document.

Help, please!

Prpikachu
12-19-2016, 02:31 AM
Just thinking if there is a more efficient way than surfing through each and every paragraph. Although I may be losing the information about the order of the words, it may work.

Would it be possible to collect the information of the paragraph number where and occurrence appears?

Should this be the case, we could search through all the document for each of the words, as in the first versions of the code. If it appears in paragraph X, we would increase a variable called Paragraph & X. After searching the first word, we have a bunch of variables (Paragraph13, Paragraph22, etc.) with a value of 1. We repeat the search with the next word. At the end, we will have a number of variables called Paragraph & X with different values. We could search the highest of them and copy the corresponding paragraph to the main document, as usual. I understand that we could do this looping through all of them (and we could take the opportunity to delete their value so that they start at 0 for the next document).

Would it work? Could it be an easier/more efficient way?

Thanks!