PDA

View Full Version : Please help me with these code (macro to find repeated words)



Juleshg
10-03-2016, 07:11 PM
I'm writing this macro to find repeated words in text. My goal is to find them, then highlight all the occurrences and then underline the nearest repetitions (10 words, 20 words etc). Something like 1 to 10 - heavy double line, 11 to 50, double line, etc.).

It works until highlighting the words, but there is some trouble with calculating distances. Please check my code and tell me what is failing and how can I fix it, or if I need another approach.


Sub FindWords()
Const maxwords = 9000 'Maximum words allowed

Dim SingleWord As String 'Raw word pulled from doc

Dim Words(maxwords) As String 'It holds unique words
Dim Freq(maxwords) As Integer 'Frequences of the words

Dim RWords(maxwords) As String 'It holds repeated words
Dim RWordsPositions(maxwords) As Integer 'It holds the position of those words

Dim WordNum As Integer
Dim rwnum As Integer
Dim worddistance As Integer
Dim countWord As Integer

Dim thisWord As String 'Current word to search
Dim ttlwds As Long 'Total words in the document
Dim Excludes As String 'Words to be excluded
Dim Includes As String 'To solve the inclusion of foreign characters
Dim Found As Boolean 'Temporary flag
Dim j, k As Integer 'Temporary variables


' Set up excluded words
'Excludes = "[a][an][and][as][at][for][from][he][her][his][in][of][on][she][the][to][was][with]"
Excludes = "[a][al][as][con][de][del][el][en][es][lo][los][la][las][le][me][mi][no][ni][o][por][que][qué][se][si][sí][sin][su][sus][te][tu][un][uno][una][y][ya]"
Includes = "[á][é][í][ó][ú][ñ]"


Selection.HomeKey Unit:=wdStory
System.Cursor = wdCursorWait
ttlwds = ActiveDocument.Words.Count

' Control the repeat

rwnum = 0
countWord = 0



For Each aWord In ActiveDocument.Words

SingleWord = Trim(LCase(aWord))
'Out of range?

If InStr(Includes, Left(SingleWord, 1)) Then
ElseIf SingleWord < "a" Or SingleWord > "z" Then
SingleWord = ""
End If

'On exclude list?
If InStr(Excludes, "[" & SingleWord & "]") Then
SingleWord = ""
countWord = countWord + 1
End If

If Len(SingleWord) > 0 Then
countWord = countWord + 1
Found = False
For j = 1 To WordNum
If Words(j) = SingleWord Then
Freq(j) = Freq(j) + 1
Found = True
rwnum = rwnum + 1
RWords(rwnum) = SingleWord
RWordsPositions(rwnum) = countWord
'Debug.Print Str(rwnum) + "-" + SingleWord + " - " + Str(Freq(j)) + " " + RWords(countWord) + " " + Str(RWordsPositions(countWord))
Exit For
End If
Next j
If Not Found Then
WordNum = WordNum + 1
Words(WordNum) = SingleWord
Freq(WordNum) = 1
End If
If WordNum > maxwords - 1 Then
j = MsgBox("Too many words.", vbOKOnly)
Exit For
End If

End If
ttlwds = ttlwds - 1
StatusBar = "Remaining: " & ttlwds
Next aWord

'For k = 1 To rwnum
' Debug.Print Str(RWordsPositions(k)) + " - "; RWords(k)
'Next k


'COMPARES AND CHECKS DISTANCES

countWord = 0

For Each aWord In ActiveDocument.Words

thisWord = Trim(LCase(aWord))
'Out of range?
If InStr(Includes, Left(thisWord, 1)) Then
ElseIf thisWord < "a" Or thisWord > "z" Then
countWord = countWord + 1
thisWord = ""
End If
'On exclude list?
If InStr(Excludes, "[" & thisWord & "]") Then
thisWord = ""
countWord = countWord + 1
End If

If Len(thisWord) > 0 Then
countWord = countWord + 1

For j = 1 To rwnum
If thisWord = RWords(j) Then

ActiveDocument.Words(countWord).Select
Selection.Expand Unit:=wdWord
If Selection.Characters(Selection.Characters.Count) = " " Then
Selection.MoveEnd Unit:=wdCharacter, Count:=-1
End If
Selection.Range.HighlightColorIndex = wdTurquoise

worddistance = Abs(RWordsPositions(j) - countWord)

'HERE IT WOULD BE MORE IF-THEN TO CHECK OTHER DISTANCES
If worddistance > -1 And worddistance < 10 Then
Selection.Font.Underline = wdUnderlineDouble
End If

End If
Next j
End If
Next aWord

For j = 1 To rwnum
Debug.Print Str(j) + "- " + RWords(j) + " pos:" + Str(RWordsPositions(j))
Next j

System.Cursor = wdCursorNormal
End Sub



My language is spanish, that's why I'm inserting the option to include foreign characters.

Here is some text wich I've tried the code on. If you want to test it with a text in english, please uncomment the corresponding Excludes list:


Laura abrió los ojos en la mañana y por una fracción de segundo no recordó dónde había pasado la noche. La claridad del sol le daba en la cara, se sentía ya un poco de calor y sudaba ligeramente. Y eso fue lo que le dio la pista. Ya. Había dormido en la casa de Daniel, su novio desde dos años atrás. En su propia casa y su propia cama, nunca habría despertado sudando. Pero él tenía la mala costumbre de apagar el aire acondicionado en la mitad de la noche, y las persianas quedaban abiertas. Su novio era un ángel. El sol daba precisamente en la cabecera y él jamás se daba cuenta. Así que ella tenía que levantarse a encender de nuevo el interruptor en la pared. ¿Cuándo pensaba cambiar a un aparato con control remoto? En unos cuantos años llegarían al cambio de siglo, con todas las comodidades que traería la nueva era. Daniel era un poco descuidado en eso, pensaba hablar muy seriamente con él. Cuando ya viviesen juntos, todo debía ajustarse a la mayor comodidad para ambos. Y no era que a él no le gustara la tecnología, ya que vivía de ello. Simplemente no se ocupaba de los detalles caseros hasta que Laura se lo sugería. Ese era su ángel y era un ñoño pero no le preocupaba que fuera muy ñoño.


I've included there some words starting with á, é and ñ to test the code.

gmaxey
10-04-2016, 03:48 AM
What is that text supposed to look like when finished?

Juleshg
10-04-2016, 12:30 PM
When finished, the text will have highlighted all repeated words, heavily underlined all same words with a distance of 1 to 10 words (these parameters may change) and softly underlined more distanced words. Of course, another macro has to remove the highlights and underlines. This is only for revision purposes. It's easier than getting a "text statistics report".

Juleshg
10-04-2016, 12:33 PM
Let me explain: the reason what I'm doing this is because I'm writing in spanish, and I've observed this is a frequent problem with writing. I have to check and recheck my text and sometimes is difficult to spot repeated words. Some repetitions are intentional, some are not. I've tried some software: Scrivener only reports text statistics but I have to close the report and manually search all the occurrences. MS-Word only counts words. ProWritingAid does a very good job but still the text has to be pasted in their box. Notepad++ has a nice feature: when you click any word, it highlights all the same words in the same text, but you still need to make the corrections in the original text, or losing all formatting. My macro helps me to work directly on the text in Word, like a combo of ProWritingAid and Notepad++.