PDA

View Full Version : Correction of wrong usage of spaces



translator_
10-02-2012, 07:03 AM
Would be great if someone could come up with a macro for the correction of wrong usage of spaces.
(At the end of the attached file is a sample word text and the ideal outcome of the macro)

Should work for English and Greek (both lowercase and capitals). The Greek range I think is defined with [;-ώ].

Before these characters there should be no space and there should always be one space after them.
. [period]
, (comma)

Exception 1: not if a number precedes and follows, i.e. 100,000.34 should not become 100, 000. 34
Exception 2 (only for period): not if it is part of an abbreviation, i.e. U.S.A., i.e., π.χ., κ.τ.λ., should not become U. S. A. , i. e. , π. χ. , κ. τ. λ. ,

Before these characters there should be no space and there should always be one space after them.

... [three dots]
… [three dots character, i.e. ellipse]
· [Greek ano teleia character]
• [middle dot character]
:
;
?
!

Before these characters there should be no space and there should always be one space after them (exception: no space if a comma, period, exclamation mark, semicolon follows, for example [I think], should not become [I think] ,).

” [left smart quote, not rendered properly, see in attached file]

»
]
)
}

Before these characters there should be one space and no space afterwards.

“ [right smart quote, not rendered properly, see in attached file]

[
(
{

macropod
10-03-2012, 10:59 PM
Looking for an deleting errant spaces in numbers and around punctuation marks can be done without too much difficulty, but acronym handling would require the construction of a truly comprehensive list. Here's a macro that will handle the basic comma/period/space issues for standard latin text.
Sub CleanUpText()
' Turn Off Screen Updating
Application.ScreenUpdating = False
With ActiveDocument.Content.Find
.ClearFormatting
.Replacement.ClearFormatting
.Forward = True
.Wrap = wdFindContinue
.Format = False
.MatchAllWordForms = False
.MatchSoundsLike = False
.MatchWildcards = True
'Replace all multiple spaces with single spaces
.Text = "[ ]{2,}"
.Replacement.Text = " "
.Execute Replace:=wdReplaceAll
'Swap spaces followed by periods or commas
.Text = "( )([.,])"
.Replacement.Text = "\2\1"
.Execute Replace:=wdReplaceAll
'Insert a space between all lower-case letters followed by
' a period or comma then an upper-case letter or number.
.Text = "([a-z][.,])((A-Z0-9])"
.Replacement.Text = "\1 \2"
.Execute Replace:=wdReplaceAll
'Replace all multiple spaces with single spaces
.Text = "[ ]{2,}"
.Replacement.Text = " "
.Execute Replace:=wdReplaceAll
'Close up spaces between numbers with periods or commas
.Text = "([0-9][.,]) ([0-9])"
.Replacement.Text = "\1\2"
.Execute Replace:=wdReplaceAll
'Fix double quotes
.Text = """"
.Replacement.Text = "^&"
.Execute Replace:=wdReplaceAll
End With
' Restore Screen Updating
Application.ScreenUpdating = True
End Sub
As for the mal-formatted quote characters, the macro has a go but, if the problem is because there's a space character of the wrong side of the double-quote, it would be quite a chore to code for - and even then the results might be flaky.

Note: Depending on your regional settings, you may need to change the ',' in the '{2,}' expressions to ';'.

translator_
10-04-2012, 05:39 AM
Many thanks. I get:
The find what expression contains a pattern match expression which is not valid. When debugging, it highlights:

.Text = "[ ]{2,}"
.Replacement.Text = " "
.Execute Replace:=wdReplaceAll

I tried {2;} in both instances too and it gives:

The find what text contains a range that is not valid, highlighting:

.Text = "([a-z][.,])((A-Z0-9])"
.Replacement.Text = "\1 \2"
.Execute Replace:=wdReplaceAll

which I fixed by

([A-Z0-9])

When running macro on test text, I get:

Mr Berlusconi is on trial in two corruption cases.Βut legislation being discussed in “ parliament ” ;would in effect stop him going to court ?
The protesters accuse the PM of seeking : to undermine the legal system! He says he is i.e. the ( victim ) of political [persecution ], by the judiciary, which he recently compared to the Taliban.

All red instances are of spacing issues which were not fixed. The results improved somewhat when I changed the expression to:

[.,\!\?:\]\)]

Wouldn't it be possible to handle the initialisms issue with an expression that would exclude matches like [anyletter.anyletter.anyletter.] and [anyletter.anyletter.] instead of using a comprehensive list?

macropod
10-04-2012, 07:49 PM
The "([a-z][.,])([A-Z0-9])" test will only fix problems like 'cases.Βut' where the letters are from the ASCII character set. In this case your 'Β' is not an ASCII 66 'B', but is the Greek Beta! As I said, the code hadn't been written to handle Greek text. Try the following. As you'll see, handling Greek text adds a lot more complexity. The handling of quotes is, as I said in my previous post, limited. It's a little 'richer' now, but the results can't be guaranteed if the quotes were completely misplaced.
Sub CleanUpText()
' Turn Off Screen Updating
Application.ScreenUpdating = False
With ActiveDocument.Content.Find
.ClearFormatting
.Replacement.ClearFormatting
.Forward = True
.Wrap = wdFindContinue
.Format = False
.MatchAllWordForms = False
.MatchSoundsLike = False
.MatchWildcards = True
'Replace all multiple spaces with single spaces
.Text = "[ ]{2;}"
.Replacement.Text = " "
.Execute Replace:=wdReplaceAll
'Swap spaces followed by punctuation & closing formatted quotes & brackets
.Text = "( )([.,Ąż”’;\!\?:\]\)\}\]]{1;})"
.Replacement.Text = "\2\1"
.Execute Replace:=wdReplaceAll
.Execute Replace:=wdReplaceAll
'Swap opening formatted quotes & brackets followed by spaces
.Text = "([\{\[\(“`]{1;})( )"
.Replacement.Text = "\2\1"
.Execute Replace:=wdReplaceAll
.Execute Replace:=wdReplaceAll
'Insert a space between all lower-case letters followed by
' punctutation or a bracket then an upper-case letter or number.
.Text = "([a-z" & ChrW(&H3AC) & "-" & ChrW(&H3CE) & ChrW(&H1F00) & "-" & ChrW(&H1F07) & _
ChrW(&H1F10) & "-" & ChrW(&H1F15) & ChrW(&H1F20) & "-" & ChrW(&H1F27) & _
ChrW(&H1F30) & "-" & ChrW(&H1F37) & ChrW(&H1F40) & "-" & ChrW(&H1F45) & _
ChrW(&H1F50) & "-" & ChrW(&H1F57) & ChrW(&H1F60) & "-" & ChrW(&H1F67) & _
ChrW(&H1F70) & "-" & ChrW(&H1F87) & ChrW(&H1F90) & "-" & ChrW(&H1F97) & _
ChrW(&H1FA0) & "-" & ChrW(&H1FA7) & ChrW(&H1FB0) & "-" & ChrW(&H1FB7) & _
ChrW(&H1FC2) & "-" & ChrW(&H1FC7) & ChrW(&H1FD0) & "-" & ChrW(&H1FD7) & _
ChrW(&H1FE0) & "-" & ChrW(&H1FE7) & "][.,Ąż”’;\!\?:\]\)\}\]\{\[\(“`]{1,})([A-Z0-9" & _
ChrW(&H386) & "-" & ChrW(&H3AB) & ChrW(&H1F08) & "-" & ChrW(&H1F0F) & _
ChrW(&H1F18) & "-" & ChrW(&H1F1D) & ChrW(&H1F28) & "-" & ChrW(&H1F2F) & _
ChrW(&H1F38) & "-" & ChrW(&H1F3F) & ChrW(&H1F48) & "-" & ChrW(&H1F4D) & _
ChrW(&H1F59) & "-" & ChrW(&H1F5F) & ChrW(&H1F68) & "-" & ChrW(&H1F6F) & _
ChrW(&H1F88) & "-" & ChrW(&H1F8F) & ChrW(&H1F98) & "-" & ChrW(&H1F9F) & _
ChrW(&H1FA8) & "-" & ChrW(&H1FAF) & ChrW(&H1FB8) & "-" & ChrW(&H1FBC) & _
ChrW(&H1FC8) & "-" & ChrW(&H1FCC) & ChrW(&H1FD8) & "-" & ChrW(&H1FDB) & _
ChrW(&H1FE8) & "-" & ChrW(&H1FEC) & ChrW(&H1FF8) & "-" & ChrW(&H1FFC) & "])"
.Replacement.Text = "\1 \2"
.Execute Replace:=wdReplaceAll
.Execute Replace:=wdReplaceAll
'Replace all multiple spaces with single spaces
.Text = "[ ]{2;}"
.Replacement.Text = " "
.Execute Replace:=wdReplaceAll
'Close up spaces between numbers with periods or commas
.Text = "([0-9][.,]) ([0-9])"
.Replacement.Text = "\1\2"
.Execute Replace:=wdReplaceAll
'Fix leading spaces
.Text = "([^09-^13])[ ]{1;}"
.Replacement.Text = "\1"
.Execute Replace:=wdReplaceAll
'Fix trailing spaces
.Text = "[ ]{1;}([^09-^13])"
.Replacement.Text = "\1"
.Execute Replace:=wdReplaceAll
'Fix double quote formatting
.Text = """"
.Replacement.Text = "^&"
.Execute Replace:=wdReplaceAll
End With
' Restore Screen Updating
Application.ScreenUpdating = True
End Sub

translator_
10-08-2012, 03:57 PM
You are right, it does get complex. So we can dispense with quotes to simplify. When I run it I got: "The find what expression contains a Pattern match expression which is not valid". Debugging highlighted the third line:


ChrW(&H1FE8) & "-" & ChrW(&H1FEC) & ChrW(&H1FF8) & "-" & ChrW(&H1FFC) & "])"
.Replacement.Text = "\1 \2"
.Execute Replace:=wdReplaceAll

Also, the result was punctuation/quote marks duplication:

Original:
Mr Berlusconi is on trial in two corruption cases.Βut legislation being discussed in “ parliament ” ;would in effect stop him going to court ?

After macro:

Mr Berlusconi is on trial in two corruption cases.Βut legislation being discussed in ““parliament” ”; ; would in effect stop him going to court? ?

macropod
10-08-2012, 08:53 PM
Re:

"The find what expression contains a Pattern match expression which is not valid"
I missed one of the required separator changes! Change:
ChrW(&H1FE0) & "-" & ChrW(&H1FE7) & "][.,Ąż”’;\!\?:\]\)\}\]\{\[\(“`]{1,})([A-Z0-9" & _
to:
ChrW(&H1FE0) & "-" & ChrW(&H1FE7) & "][.,Ąż”’;\!\?:\]\)\}\]\{\[\(“`]{1;})([A-Z0-9" & _

As for

the result was punctuation/quote marks duplication:
I am unable to reproduce that behaviour and there is nothing in the code I posted that would be capable of doing so.

translator_
10-09-2012, 10:03 AM
I did try on a different PC with both Word 2003 and 2010. Same results (i.e. punctuation duplication, using your latest version with the correction).

However, problem did not appear on this PC for the issue in other thread.
http://www.vbaexpress.com/forum/showthread.php?t=43864

Maybe something to do with locale? Who knows...

macropod
10-10-2012, 03:04 AM
As per the other thread, this is something that's probably due to differences in addins and/or configuration that you'll have to sort out.