Log in

View Full Version : Use VBA to extract all instances of a specific string



Edyas
01-31-2013, 11:28 AM
Hi, I'm looking for some help to perform the following action, as I am really new to the VBA world, but very eager to learn.

I am working with word documents of judgments, which usually contain countless references to other judgments, and I wish to extract these references.

A positive factor is that these references are always formatted in the same way: they start with "Case ..." and end with "paragraph ###" (the paragraph number could be 1, 2 or 3 digits).

Is there anyway to select all these references through a macro, and then extract them to another document?

Many thanks for your kind help.

macropod
01-31-2013, 10:20 PM
Are these references all in one file, or in multiple files (in which case, you presumably also need the file names). Do you need the source document's page numbers as well?

Edyas
02-01-2013, 02:25 AM
Dear Paul,

Thank you very much for your reply.

The references are in one file only, and there is no need for the source document's page numbers. I attach a word document which includes one such judgment: as you can see, at the end of paragraph 15, there are references to other cases. In short, it would be fabulous to extract all such references to another document:
- Case C‑414/07 Magoora [2008] ECR I‑10921, paragraph 22
- Joined Cases C‑316/07, C‑358/07 to C‑360/07, C‑409/07 and C‑410/07 Stoß and Others [2010] ECR I‑0000, paragraph 51
- Case C‑45/09 Rosenbladt [2010] ECR I‑0000, paragraph 32

Sometimes, judgments can be hundreds pages long, so that copy-pasting all references is one serious hassle.
Yet they always end with ", paragraph ##"
And they always start either with "Case" or "Joined Cases"

I must say I would be eternally grateful for such a macro, and so would be countless law students throughout the world!

Many thanks again,
Edyas

macropod
02-01-2013, 04:56 AM
Hi Edyas,

Your sample file contains some examples that don't fit your specifications. See, for example, paragraphs 42, 54, 75, 76, 81 and 84. These variously lack the 'Case' prefix and/or a paragraph reference and/or refer to more than one case, joined by ', and '. If you want all of these extracted as well, it rather complicates the project.

Edyas
02-01-2013, 06:34 AM
Dear Paul,

Thank you very much for your reply once again.

You may ignore these other references, as they concern cases already mentioned beforehand. Focusing on the "clean" first references would be more than enough.
As a matter of fact, the macro could be targeted to a string starting with "Case" and ending with "paragraph ##". If it gets too complicated, do ignore the other possibilities (case starting with "Joined Cases" and ending with "paragraph #" or "paragraph ###"): I'll just copy-paste and adapt the code to take the other possibilities into account!

macropod
02-01-2013, 05:12 PM
Try the following:
Sub TabulateCases()
Application.ScreenUpdating = False
Dim InDoc As Document, OutDoc As Document, Rng As Range
Dim i As Long, j As Long, bFnd As Boolean
Set InDoc = ActiveDocument
Set OutDoc = Documents.Add
'Go through the document looking for defined terms.
With InDoc.Range
.Collapse wdCollapseEnd
With .Find
.ClearFormatting
.Replacement.ClearFormatting
.Text = ", paragraph [0-9]{1,}>"
.Format = False
.Forward = False
.Wrap = wdFindStop
.MatchWildcards = True
.Execute
End With
Do While .Find.Found
bFnd = False
Set Rng = .Duplicate
With Rng
.MoveStart wdParagraph, -1
If InStrRev(.Text, "Joined Cases") > 0 Then
bFnd = True
.MoveStart wdCharacter, InStrRev(.Text, "Joined Cases") - 1
End If
If InStrRev(.Text, "Case ") > 0 Then
bFnd = True
.MoveStart wdCharacter, InStrRev(.Text, "Case ") - 1
End If
If bFnd = True Then
.Copy
'Output the found content to the output document
With OutDoc.Range
.Characters.First.Paste
.InsertBefore vbCr & vbCr
End With
End If
End With
.End = Rng.Start
.Find.Execute
Loop
End With
Application.ScreenUpdating = True
End Sub
The output to the new document lists the cases and leaves two empty paragraphs at the top of the document. I could have deleted those, but figured you might want to put some relevant material there (eg the source document's case details).

Edyas
02-03-2013, 08:03 AM
Dear Paul,
I can hardly understate my gratitude for your time and energy, thank you so much! The macro unfortunately bugs at some point; I believe I have been overly ambitious.
Do you think it would be easier to limit the "extraction" to the number of the cases referred to (for instance "C-312/12")? Their format is either C-#/##, C-##/## or C-###/##...

I already managed to assign a style (named "Strong") to every and each reference, but haven't figure out how to select all such references and copy them to another document.

Of course, I would perfectly understand if you did not have anymore time to waste on such trifle. In any event, thanks very much for your help!

macropod
02-03-2013, 03:00 PM
Hi Edyas,

Telling me:

The macro unfortunately bugs at some point
isn't exactly helpful. The macro works fine for me with your sample document. What is the nature of the bug (ie what is the error message)?

If you comment out:
Application.ScreenUpdating = False
ie change it to:
'Application.ScreenUpdating = False
and add:
.Select
after:
If bFnd = True Then
then, when an error occurs:
• what gets selected in the source document?
• what does the error message say?
• does the error message give you a 'Debug' option and, if so, what code line is highlighted when you click on that?

Edyas
02-04-2013, 02:13 AM
I've got this error message: "Run-time error 5560 - the Find What text contains a Pattern Match expression which is not valid".
When I "Debug", the code line highlighted is the ".Execute" at the end of the following section:

Sub TabulateCases()
Application.ScreenUpdating = False
Dim InDoc As Document, OutDoc As Document, Rng As Range
Dim i As Long, j As Long, bFnd As Boolean
Set InDoc = ActiveDocument
Set OutDoc = Documents.Add
'Go through the document looking for defined terms.
With InDoc.Range
.Collapse wdCollapseEnd
With .Find
.ClearFormatting
.Replacement.ClearFormatting
.Text = ", paragraph [0-9]{1,}>"
.Format = False
.Forward = False
.Wrap = wdFindStop
.MatchWildcards = True
.Execute

Again, sorry for the trouble!

macropod
02-04-2013, 05:56 AM
The most likely reason for that error message is that your PC is using non-English regional settings. Try changing:
.Text = ", paragraph [0-9]{1,}>"
to"
.Text = ", paragraph [0-9]{1;}>"

Edyas
02-06-2013, 03:03 AM
Dear Paul, your macro is pure magic, thank you so much indeed!!!