PDA

View Full Version : Solved: Regular Expression in Word



r054
11-21-2007, 12:27 PM
Hi,

I've just started using VBA and need to work with regular expression.
Does anybody know what's the symbol in VBA regex for ? " or !
I have this regular expression

regEx.Pattern = "([?!.,]+)(\d+)([\s\b\w*\b]?)"
This regular expression should find:
blablabla.78 blablabla >> .78
blablabla?"129 >> ?"129
blablabla!1 >> !1

but when I run it, it only recognize . and , but not ? or !

And for the case " I used ascii code and it doens't find it as well.
Does anybody know how to do this?

TonyJollans
11-24-2007, 09:37 AM
From my understanding of regular expressions I agree with you, so how are you running it? Show us some more code.

matthewspatrick
11-24-2007, 12:40 PM
The ? is a special character, so you need to escape it. You also
need to double up on the double-quotes to keep VBA from thinking
that it's just a string qualifier. This is working on the sample data:


regEx.Pattern = "([\?!.,""]+)(\d+)([\s\b\w*\b]?)"

r054
11-25-2007, 05:59 PM
Hmm...

after checking it carefully i think my problem might not be in the regular expression. Probably I should explain it more clearly.

I'm trying to reformat a document which has many footnotes. And in the uneditted document, the footnotes are just normal number at the end of a sentence and without any space. For example this paragraph:

Chief among the modelers and, likewise, a great favorite of Napoleon, was the Marquis de Laplace, Senator of France and prince of the world?s physiciens geometres. Napoleon showered blessings on Laplace and once, on the theory that mathematicians with money can do everything, appointed him Minister of the Interior. Laplace lasted six weeks. Later Napoleon explained why: ?Laplace did not look at any question from the proper point of view; he looked for subtleties everywhere, had only problematic ideas, and carried into administration the spirit of the infinitely small.?2 Having set down the burdens of office, Laplace could devote himself entirely to standard modeling, or, as he put it, to making physics as perfect as astronomy by importing into it the mathematics and the method of the theory of gravitation.3

The program should recognize footnote number 2 and 3, and reformat it into:
infinitely small ( 2 )." and gravitation ( 3 ).

The code that i have so far looks like this (including the changes in Regex recommended by Matthew):


Sub FindFootnote()
Dim regEx, Match, Matches

Set myRange = Selection.Range
myRange.WholeStory ' the documents main story range
Dim aSent As Object ' a sentence
Dim aPara As Paragraph ' a paragraph
Dim formattedFootnote As String

Selection.GoTo wdGoToLine, wdGoToFirst

For Each aPara In myRange.Paragraphs

For Each aSent In aPara.Range.Sentences
Set regEx = New RegExp ' Create a regular expression.
regEx.Pattern = "([/?!.,""]+)(\d+)([\s\b\w*\b]?)" ' Set pattern.
regEx.IgnoreCase = False ' Set case insensitivity.
regEx.Global = True ' Set global applicability.

Set Matches = regEx.Execute(aSent.Text) ' Execute search.
For Each Match In Matches ' Iterate Matches collection.
Footnote = MsgBox("Is " & Match.Value & " in " & Chr(34) & aSent & Chr(34) & " a footnote?", vbYesNoCancel + vbQuestion, "Footnote")

If Footnote = 2 Then
Exit Sub
ElseIf Footnote = 6 Then
formattedFootnote = regEx.Replace(Match.Value, " ( $2 )$1 $3")
DoRegularReplaceOne what:=Match.Value, repl:=formattedFootnote, textBold:=False
End If
Next
Next
Next
End Sub
Using my code it finds the footnote # 3 but not # 2.
I think the problem is because I'm reading sentence by sentence, and I'm still not sure how I should change it. If you can give me any suggestion I'd really appreciate it.

Thanks

TonyJollans
11-26-2007, 12:54 AM
You can do this with Word's Find and Replace. It has a unique form of regex but should do the job.

Find: ([\?\!,."]{1,})([0-9]{1,})
Replace: ( \2 )\3
Check Use wildcards
Hit Replace all

(record a macro if you want basic code)

r054
11-26-2007, 12:14 PM
Hi Tony,

Thank you for the suggestion. Your way is easier and more simple than mine :), unfortunately it might not work since reformatting the text from blablabla.123 to blablabla ( 123 ). is just part of a big problem.

The whole problem that I need to solve is to find the text with format like I described before, change it to ( ### ). Then find in the same document or other specified document (we ask the user which document) for a paragraph with format ###(tab)blablabla, and move that paragraph as a new paragraph bellow the paragraph where we find blabla.###

If we find more than one blabla.### in a paragraph, the paragraphs that we move as a new paragraph bellow this one has to be sorted according to the number

for example:

blablabla... problematic ideas, and carried into administration the spirit of the infinitely small.?2 Having set down the burdens of office, Laplace could devote himself entirely to standard modeling, or, as he put it, to making physics as perfect as astronomy by importing into it the mathematics and the method of the theory of gravitation.3

and at the end of the document we have list of footnotes such as:

2 Notes on conversation on St. Helena, quoted ibid., p. 110; cf. Maurice Crosland, The Society of Arcueil (Cambridge, Mass.: Harvard University Press, 1967), pp. 63-4.


3 Details about the Napoleonic standard model can be found in
J. L. Heilbron, Weighing Imponderables and other Quantitative Science Around 1800 (Berkeley: University of California Press, 1992).

The end rsult after the macro should look like:

blablabla... problematic ideas, and carried into administration the spirit of the infinitely small ( 2 )." Having set down the burdens of office, Laplace could devote himself entirely to standard modeling, or, as he put it, to making physics as perfect as astronomy by importing into it the mathematics and the method of the theory of gravitation ( 3 ).


2 Notes on conversation on St. Helena, quoted ibid., p. 110; cf. Maurice Crosland, The Society of Arcueil (Cambridge, Mass.: Harvard University Press, 1967), pp. 63-4.

3 Details about the Napoleonic standard model can be found in
J. L. Heilbron, Weighing Imponderables and other Quantitative Science Around 1800 (Berkeley: University of California Press, 1992).

so the two paragraphs (2 and 3) are moved to bellow the blablabla paragraph.

My plan for this problem is:
1. reformat the footnote number to ( ### )
2. each time it reformat a footnote number, a bookmark with the same footnote number is added at the end of that particular paragraph
3. find the paragraph started with the same footnote number and replace the bookmark with this paragraph.

This is the best way I could think of, but as always... logically it sounds easy but the implementation is driving me nuts. And again, any suggestion is greatly appreciated.

Btw when I implemented the find and replace, weird things still happen. The regular expression in find does not recognize:
blablabla?.123 but it recognize blablabla.?123
It also still doesn't catch double quote sign (")

TonyJollans
11-26-2007, 12:31 PM
Any chance you could post a sample document?

You seem to have worked out the logic and just need the best way to code it. Is each 'footnote' (and I presume these are just normal text as far as Word is concerned) only referenced once? If not, are you not going to create duplicate bookmarks? Also how big are the documents - might it be feasible to build an array of the footnotes before you begin?

I know how regexes work but not the fine detail - one thought, which may not be relevant, is do the characters (?, ", etc.) need to be in ascii sequence in the pattern? Another possibility for the quotes is, are they actually straight quotes, or are they curly ones?

r054
11-26-2007, 12:56 PM
Hi Tony,

I tried to upload the file here but it didn't let me since the file is in RTF format. I saved it as doc, but then it's too big.


Is each 'footnote' (and I presume these are just normal text as far as Word is concerned) only referenced once? If not, are you not going to create duplicate bookmarks?

each footnote is only refered once, so it won't create duplicate bookmark.


Also how big are the documents - might it be feasible to build an array of the footnotes before you begin?

The each document has different size, it can be 10 pages or 30 pages. And the number of footnotes is between 5-30. So it's feasible to build an array I think.

TonyJollans
11-26-2007, 01:05 PM
Zip the document - then you'll be able to post it.

I think the threshold is 4 posts so you should be able to upload now - if not post again and then try (maybe posts in the Test forum count, I'm not sure)

r054
11-26-2007, 01:47 PM
the document is attached

TonyJollans
11-26-2007, 02:41 PM
Thanks.

I presume the numbers in the fifties followed by text "J L Heilbron" are meant to be page headers that somehow haven't translated from rtf properly.

I don't think I'll be able to look at it tonight but my gut feeling is that this can be done fairly easily in a single pass - well, two passes, one to locate the notes, one to embed them.

r054
11-26-2007, 03:33 PM
Hi Tony,

The numbers (45-54) followed by J.L. Heilbron are page numbers. This document is basically from a textbook that is scanned, read by the OCR, and copy pasted special as unformatted text. So the final document will only have body (no header, no footnote, etc).

TonyJollans
11-27-2007, 06:38 AM
I deleted the page headings. And then ran this:

Sub VBAX()

Dim FindRange As Word.Range

Dim NotesStart As Long
Dim NotesPara As Word.Paragraph
Dim NoteNumber As Long
Dim NotesRange() As Word.Range

Set FindRange = ActiveDocument.Content.Duplicate
FindRange.Find.Execute FindText:="Notes^p^p", Forward:=False

NotesStart = FindRange.Start
For Each NotesPara In ActiveDocument. _
Range(NotesStart, ActiveDocument.Range.End).Paragraphs

If IsNumeric(NotesPara.Range.Words(1)) Then
NoteNumber = CLng(NotesPara.Range.Words(1))
If (Not Not NotesRange) = 0 Then
ReDim NotesRange(1 To NoteNumber)
Else
If NoteNumber > UBound(NotesRange) Then
ReDim Preserve NotesRange(1 To NoteNumber)
End If
End If
Set NotesRange(NoteNumber) = NotesPara.Range.Duplicate
End If
Next

Set FindRange = ActiveDocument. _
Range(ActiveDocument.Range.Start, NotesStart)

With FindRange
Do While FindRange.Find.Execute( _
FindText:="([\?\!,.;" & ChrW(8221) & "]{1,})([0-9]{1,})", _
MatchWildcards:=True, _
ReplaceWith:=" ( \2 )\1", _
Replace:=wdReplaceOne, _
Forward:=False)

NoteNumber = ActiveDocument.Range(.Start + 3, .Start + 3).Words(1)
With .Paragraphs(1).Range
.InsertParagraphAfter
.InsertAfter NotesRange(NoteNumber).FormattedText
End With
Loop
End With

End Sub


Note that I look for ChrW(8221) (a curly right double quote) rather than a straight quote - and I also had to add a semicolon to the list of characters.

Other than that it appears to work on the sample doc.

r054
12-17-2007, 10:20 PM
Hi Tony,

Thank you for your answer... The function you have works really well for that particular document. But it doesn't work for general case.

In most of the document, there won't be any "Notes" line. The footnote is not always at the end of the document. It can be anywhere (end of document or end of page). Do you have any suggestion for the general case?

I actually came up with a slightly different logic:

After we are done reformatting the footnote, we will ask the user whether the referred footnotes are at the same document. If it is in a different document (which means it is a bibliography), we will ignore it and nothing will be done. If it is in the same document, we will move the referred footnote to a new paragraph bellow the paragraph that refers to it.

The mechanism of this process is started by searching for footnotes in the text, i.e.: any number with the format ( ### ). The search is done from the end of the document to the beginning of the document. Once we find it, there are a few steps to be done:
1.Create a unique bookmark # 1 on the place where we find the footnote.
2.Create another unique bookmark # 2 at the end of the paragraph.
3.Extract the number from the bracket.
4.Look for a paragraph in the text that is started with the same number as the extracted number.
5.Copy the paragraph.
6.Paste the paragraph at the bookmark # 2 .
7.Remove the bookmark # 2.
8.Place the cursor back to bookmark # 1.
9.Continue the search upward until the beginning of the paragraph.

But I'm not sure how to search upward one by one. I'm not an expert using find and replace.

And i'm not sure if it's the best way to do it

TonyJollans
12-18-2007, 01:38 AM
I think the general case is too complex to deal with.

And, to be honest, I'm not sure it's entirely legal to scan and manipulate other people's printed texts in this way, so I'm not really inclined to pursue it further.

To look upwards in a Find operation, use

.Forward = False