Excel Hints

Results 1 to 15 of 15

Thread: Solved: Regular Expression in Word

  1. #1

    Solved: Regular Expression in Word

    Hi,

    I've just started using VBA and need to work with regular expression.
    Does anybody know what's the symbol in VBA regex for ? " or !
    I have this regular expression

    [vba]regEx.Pattern = "([?!.,]+)(\d+)([\s\b\w*\b]?)"[/vba]
    This regular expression should find:
    blablabla.78 blablabla >> .78
    blablabla?"129 >> ?"129
    blablabla!1 >> !1

    but when I run it, it only recognize . and , but not ? or !

    And for the case " I used ascii code and it doens't find it as well.
    Does anybody know how to do this?

  2. #2
    VBAX Master TonyJollans's Avatar
    Joined
    May 2004
    Location
    Norfolk, England
    Posts
    2,290
    Location
    From my understanding of regular expressions I agree with you, so how are you running it? Show us some more code.
    Enjoy,
    Tony

    ---------------------------------------------------------------
    Give a man a fish and he'll eat for a day.
    Teach him how to fish and he'll sit in a boat and drink beer all day.

    I'm (slowly) building my own site: www.WordArticles.com

  3. #3
    The ? is a special character, so you need to escape it. You also
    need to double up on the double-quotes to keep VBA from thinking
    that it's just a string qualifier. This is working on the sample data:

    [VBA]
    regEx.Pattern = "([\?!.,""]+)(\d+)([\s\b\w*\b]?)"
    [/VBA]
    Regards,

    Patrick

    I wept for myself because I had no PivotTable.

    Then I met a man who had no AutoFilter.

    Microsoft MVP for Excel, 2007 & 2008

  4. #4
    Hmm...

    after checking it carefully i think my problem might not be in the regular expression. Probably I should explain it more clearly.

    I'm trying to reformat a document which has many footnotes. And in the uneditted document, the footnotes are just normal number at the end of a sentence and without any space. For example this paragraph:

    Chief among the modelers and, likewise, a great favorite of Napoleon, was the Marquis de Laplace, Senator of France and prince of the world?s physiciens geometres. Napoleon showered blessings on Laplace and once, on the theory that mathematicians with money can do everything, appointed him Minister of the Interior. Laplace lasted six weeks. Later Napoleon explained why: ?Laplace did not look at any question from the proper point of view; he looked for subtleties everywhere, had only problematic ideas, and carried into administration the spirit of the infinitely small.?2 Having set down the burdens of office, Laplace could devote himself entirely to standard modeling, or, as he put it, to making physics as perfect as astronomy by importing into it the mathematics and the method of the theory of gravitation.3

    The program should recognize footnote number 2 and 3, and reformat it into:
    infinitely small ( 2 )." and gravitation ( 3 ).

    The code that i have so far looks like this (including the changes in Regex recommended by Matthew):

    [vba]
    Sub FindFootnote()
    Dim regEx, Match, Matches

    Set myRange = Selection.Range
    myRange.WholeStory ' the documents main story range
    Dim aSent As Object ' a sentence
    Dim aPara As Paragraph ' a paragraph
    Dim formattedFootnote As String

    Selection.GoTo wdGoToLine, wdGoToFirst

    For Each aPara In myRange.Paragraphs

    For Each aSent In aPara.Range.Sentences
    Set regEx = New RegExp ' Create a regular expression.
    regEx.Pattern = "([/?!.,""]+)(\d+)([\s\b\w*\b]?)" ' Set pattern.
    regEx.IgnoreCase = False ' Set case insensitivity.
    regEx.Global = True ' Set global applicability.

    Set Matches = regEx.Execute(aSent.Text) ' Execute search.
    For Each Match In Matches ' Iterate Matches collection.
    Footnote = MsgBox("Is " & Match.Value & " in " & Chr(34) & aSent & Chr(34) & " a footnote?", vbYesNoCancel + vbQuestion, "Footnote")

    If Footnote = 2 Then
    Exit Sub
    ElseIf Footnote = 6 Then
    formattedFootnote = regEx.Replace(Match.Value, " ( $2 )$1 $3")
    DoRegularReplaceOne what:=Match.Value, repl:=formattedFootnote, textBold:=False
    End If
    Next
    Next
    Next
    End Sub[/vba]
    Using my code it finds the footnote # 3 but not # 2.
    I think the problem is because I'm reading sentence by sentence, and I'm still not sure how I should change it. If you can give me any suggestion I'd really appreciate it.

    Thanks

  5. #5
    VBAX Master TonyJollans's Avatar
    Joined
    May 2004
    Location
    Norfolk, England
    Posts
    2,290
    Location
    You can do this with Word's Find and Replace. It has a unique form of regex but should do the job.

    Find: ([\?\!,."]{1,})([0-9]{1,})
    Replace: ( \2 )\3
    Check Use wildcards
    Hit Replace all

    (record a macro if you want basic code)
    Enjoy,
    Tony

    ---------------------------------------------------------------
    Give a man a fish and he'll eat for a day.
    Teach him how to fish and he'll sit in a boat and drink beer all day.

    I'm (slowly) building my own site: www.WordArticles.com

  6. #6
    Hi Tony,

    Thank you for the suggestion. Your way is easier and more simple than mine , unfortunately it might not work since reformatting the text from blablabla.123 to blablabla ( 123 ). is just part of a big problem.

    The whole problem that I need to solve is to find the text with format like I described before, change it to ( ### ). Then find in the same document or other specified document (we ask the user which document) for a paragraph with format ###(tab)blablabla, and move that paragraph as a new paragraph bellow the paragraph where we find blabla.###

    If we find more than one blabla.### in a paragraph, the paragraphs that we move as a new paragraph bellow this one has to be sorted according to the number

    for example:

    blablabla... problematic ideas, and carried into administration the spirit of the infinitely small.?2 Having set down the burdens of office, Laplace could devote himself entirely to standard modeling, or, as he put it, to making physics as perfect as astronomy by importing into it the mathematics and the method of the theory of gravitation.3

    and at the end of the document we have list of footnotes such as:

    2 Notes on conversation on St. Helena, quoted ibid., p. 110; cf. Maurice Crosland, The Society of Arcueil (Cambridge, Mass.: Harvard University Press, 1967), pp. 63-4.


    3 Details about the Napoleonic standard model can be found in
    J. L. Heilbron, Weighing Imponderables and other Quantitative Science Around 1800 (Berkeley: University of California Press, 1992).

    The end rsult after the macro should look like:

    blablabla... problematic ideas, and carried into administration the spirit of the infinitely small ( 2 )." Having set down the burdens of office, Laplace could devote himself entirely to standard modeling, or, as he put it, to making physics as perfect as astronomy by importing into it the mathematics and the method of the theory of gravitation ( 3 ).


    2 Notes on conversation on St. Helena, quoted ibid., p. 110; cf. Maurice Crosland, The Society of Arcueil (Cambridge, Mass.: Harvard University Press, 1967), pp. 63-4.

    3 Details about the Napoleonic standard model can be found in
    J. L. Heilbron, Weighing Imponderables and other Quantitative Science Around 1800 (Berkeley: University of California Press, 1992).

    so the two paragraphs (2 and 3) are moved to bellow the blablabla paragraph.

    My plan for this problem is:
    1. reformat the footnote number to ( ### )
    2. each time it reformat a footnote number, a bookmark with the same footnote number is added at the end of that particular paragraph
    3. find the paragraph started with the same footnote number and replace the bookmark with this paragraph.

    This is the best way I could think of, but as always... logically it sounds easy but the implementation is driving me nuts. And again, any suggestion is greatly appreciated.

    Btw when I implemented the find and replace, weird things still happen. The regular expression in find does not recognize:
    blablabla?.123 but it recognize blablabla.?123
    It also still doesn't catch double quote sign (")

  7. #7
    VBAX Master TonyJollans's Avatar
    Joined
    May 2004
    Location
    Norfolk, England
    Posts
    2,290
    Location
    Any chance you could post a sample document?

    You seem to have worked out the logic and just need the best way to code it. Is each 'footnote' (and I presume these are just normal text as far as Word is concerned) only referenced once? If not, are you not going to create duplicate bookmarks? Also how big are the documents - might it be feasible to build an array of the footnotes before you begin?

    I know how regexes work but not the fine detail - one thought, which may not be relevant, is do the characters (?, ", etc.) need to be in ascii sequence in the pattern? Another possibility for the quotes is, are they actually straight quotes, or are they curly ones?
    Enjoy,
    Tony

    ---------------------------------------------------------------
    Give a man a fish and he'll eat for a day.
    Teach him how to fish and he'll sit in a boat and drink beer all day.

    I'm (slowly) building my own site: www.WordArticles.com

  8. #8
    Hi Tony,

    I tried to upload the file here but it didn't let me since the file is in RTF format. I saved it as doc, but then it's too big.

    Is each 'footnote' (and I presume these are just normal text as far as Word is concerned) only referenced once? If not, are you not going to create duplicate bookmarks?
    each footnote is only refered once, so it won't create duplicate bookmark.

    Also how big are the documents - might it be feasible to build an array of the footnotes before you begin?
    The each document has different size, it can be 10 pages or 30 pages. And the number of footnotes is between 5-30. So it's feasible to build an array I think.

  9. #9
    VBAX Master TonyJollans's Avatar
    Joined
    May 2004
    Location
    Norfolk, England
    Posts
    2,290
    Location
    Zip the document - then you'll be able to post it.

    I think the threshold is 4 posts so you should be able to upload now - if not post again and then try (maybe posts in the Test forum count, I'm not sure)
    Enjoy,
    Tony

    ---------------------------------------------------------------
    Give a man a fish and he'll eat for a day.
    Teach him how to fish and he'll sit in a boat and drink beer all day.

    I'm (slowly) building my own site: www.WordArticles.com

  10. #10
    the document is attached

  11. #11
    VBAX Master TonyJollans's Avatar
    Joined
    May 2004
    Location
    Norfolk, England
    Posts
    2,290
    Location
    Thanks.

    I presume the numbers in the fifties followed by text "J L Heilbron" are meant to be page headers that somehow haven't translated from rtf properly.

    I don't think I'll be able to look at it tonight but my gut feeling is that this can be done fairly easily in a single pass - well, two passes, one to locate the notes, one to embed them.
    Enjoy,
    Tony

    ---------------------------------------------------------------
    Give a man a fish and he'll eat for a day.
    Teach him how to fish and he'll sit in a boat and drink beer all day.

    I'm (slowly) building my own site: www.WordArticles.com

  12. #12
    Hi Tony,

    The numbers (45-54) followed by J.L. Heilbron are page numbers. This document is basically from a textbook that is scanned, read by the OCR, and copy pasted special as unformatted text. So the final document will only have body (no header, no footnote, etc).

  13. #13
    VBAX Master TonyJollans's Avatar
    Joined
    May 2004
    Location
    Norfolk, England
    Posts
    2,290
    Location
    I deleted the page headings. And then ran this:
    [vba]
    Sub VBAX()

    Dim FindRange As Word.Range

    Dim NotesStart As Long
    Dim NotesPara As Word.Paragraph
    Dim NoteNumber As Long
    Dim NotesRange() As Word.Range

    Set FindRange = ActiveDocument.Content.Duplicate
    FindRange.Find.Execute FindText:="Notes^p^p", Forward:=False

    NotesStart = FindRange.Start
    For Each NotesPara In ActiveDocument. _
    Range(NotesStart, ActiveDocument.Range.End).Paragraphs

    If IsNumeric(NotesPara.Range.Words(1)) Then
    NoteNumber = CLng(NotesPara.Range.Words(1))
    If (Not Not NotesRange) = 0 Then
    ReDim NotesRange(1 To NoteNumber)
    Else
    If NoteNumber > UBound(NotesRange) Then
    ReDim Preserve NotesRange(1 To NoteNumber)
    End If
    End If
    Set NotesRange(NoteNumber) = NotesPara.Range.Duplicate
    End If
    Next

    Set FindRange = ActiveDocument. _
    Range(ActiveDocument.Range.Start, NotesStart)

    With FindRange
    Do While FindRange.Find.Execute( _
    FindText:="([\?\!,.;" & ChrW(8221) & "]{1,})([0-9]{1,})", _
    MatchWildcards:=True, _
    ReplaceWith:=" ( \2 )\1", _
    Replace:=wdReplaceOne, _
    Forward:=False)

    NoteNumber = ActiveDocument.Range(.Start + 3, .Start + 3).Words(1)
    With .Paragraphs(1).Range
    .InsertParagraphAfter
    .InsertAfter NotesRange(NoteNumber).FormattedText
    End With
    Loop
    End With

    End Sub

    [/vba]
    Note that I look for ChrW(8221) (a curly right double quote) rather than a straight quote - and I also had to add a semicolon to the list of characters.

    Other than that it appears to work on the sample doc.
    Enjoy,
    Tony

    ---------------------------------------------------------------
    Give a man a fish and he'll eat for a day.
    Teach him how to fish and he'll sit in a boat and drink beer all day.

    I'm (slowly) building my own site: www.WordArticles.com

  14. #14
    Hi Tony,

    Thank you for your answer... The function you have works really well for that particular document. But it doesn't work for general case.

    In most of the document, there won't be any "Notes" line. The footnote is not always at the end of the document. It can be anywhere (end of document or end of page). Do you have any suggestion for the general case?

    I actually came up with a slightly different logic:

    After we are done reformatting the footnote, we will ask the user whether the referred footnotes are at the same document. If it is in a different document (which means it is a bibliography), we will ignore it and nothing will be done. If it is in the same document, we will move the referred footnote to a new paragraph bellow the paragraph that refers to it.

    The mechanism of this process is started by searching for footnotes in the text, i.e.: any number with the format ( ### ). The search is done from the end of the document to the beginning of the document. Once we find it, there are a few steps to be done:
    1.Create a unique bookmark # 1 on the place where we find the footnote.
    2.Create another unique bookmark # 2 at the end of the paragraph.
    3.Extract the number from the bracket.
    4.Look for a paragraph in the text that is started with the same number as the extracted number.
    5.Copy the paragraph.
    6.Paste the paragraph at the bookmark # 2 .
    7.Remove the bookmark # 2.
    8.Place the cursor back to bookmark # 1.
    9.Continue the search upward until the beginning of the paragraph.

    But I'm not sure how to search upward one by one. I'm not an expert using find and replace.

    And i'm not sure if it's the best way to do it

  15. #15
    VBAX Master TonyJollans's Avatar
    Joined
    May 2004
    Location
    Norfolk, England
    Posts
    2,290
    Location
    I think the general case is too complex to deal with.

    And, to be honest, I'm not sure it's entirely legal to scan and manipulate other people's printed texts in this way, so I'm not really inclined to pursue it further.

    To look upwards in a Find operation, use
    [VBA]
    .Forward = False
    [/VBA]
    Enjoy,
    Tony

    ---------------------------------------------------------------
    Give a man a fish and he'll eat for a day.
    Teach him how to fish and he'll sit in a boat and drink beer all day.

    I'm (slowly) building my own site: www.WordArticles.com

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •