PDA

View Full Version : Finding ill converted Word Perfect Characters



gmaxey
12-08-2013, 05:54 PM
I have some documents that were converted from old Word Perfect files. During the conversion opening and closing quotation marks were substituted with the an "A" and a and at symbol e.g., "Hello" looks like AHello@.

VBA isn't able to detect these characters accurately e.g., if I select one and run:

?Asc(Selection.Text)

it returns 40 regardless if I select the A or the @.

To find these instances, I am using:

(A)(*{1,})(\@)

which does find the instances, but this also finds a lot of false positives, e.g.,

"See Chapter A123 and then email your thoughts to thoughts@msn.com"

The code above would find "A123 and then email your thoughts to thoughts@" which isn't a quoted phrase.

Looking for help to how to minimize or eliminate the false positives.

Thanks

Jay Freedman
12-09-2013, 04:03 PM
Hi Greg!
The ASCII 40 is an indication that Word has "protected" a symbol from a non-Unicode font. Klaus Linke, who was a Word MVP about 10 years ago, studied this stuff more than anyone I know. Here are a couple of posts he made on the subject:
https://groups.google.com/forum/#!topic/microsoft.public.word.printingfonts/pPtEuYqrlvQ
https://groups.google.com/forum/?hl=en#!topic/microsoft.public.word.vba.beginners/DUfymg_R_LU

Also the article http://www.word.mvps.org/FAQs/MacrosVBA/FindReplaceSymbols.htm may be useful.

gmaxey
12-09-2013, 06:13 PM
Jay,

Thanks for the reply. I've looked at those and unless I am missing the boat, I don't see a solution there.

Here is a bit a sample text from these troublesome documents AQuote1@ some other text AQuote2@

This text needs to read "Qoute1" some other text "Quote2"

If I select the A or the @ and run ?AscW(Selection.Text) in the immediate window, I get a return of 40 for both of them. This certainly sounds like the issue the Klaus addressed.

However, if I run:


With oRng.Find
.Text = "^u40"

Neither is found. That code only finds opening parenthesis as I would expect.

I tried Klaus unprotect symbols code, but nothing was found or modified.

This is what I am doing now which works, but it is slow in large documents because it has to process all of the valid capital A characters along with the rogue ones :-(


Sub ScratchMacro()
'A basic Word macro coded by Greg Maxey
Dim oRng As Word.Range
Set oRng = ActiveDocument.Range
With oRng.Find
.Text = "A"
.MatchCase = True
While .Execute
If oRng.Text = Chr(40) Then
oRng.Text = Chr(34)
End If
oRng.Collapse wdCollapseEnd
Wend
End With
Set oRng = ActiveDocument.Range
With oRng.Find
.Text = "@"
.MatchCase = True
While .Execute
If oRng.Text = Chr(40) Then
oRng.Text = Chr(34)
End If
oRng.Collapse wdCollapseEnd
Wend
End With
End Sub

Jay Freedman
12-09-2013, 07:17 PM
Greg, I'd like to poke at this some more. Can you share the document with me? Thx, Jay

gmaxey
12-09-2013, 07:31 PM
Jay,

Some sample text with all the issues is on the way sepcor via e-mail.

macropod
12-09-2013, 07:49 PM
Hi Greg,

If you select the problem characters then use Insert|Symbol, Word should display them and their Unicode values - that's what you need to use for the Find expression (I'm suspecting FF20 (@) & FF21 (A), which you can insert into your Find expression as ChrW(65312) & ChrW(65313) or as ChrW(&HFF20) & ChrW(&HFF21)).

gmaxey
12-09-2013, 08:10 PM
Paul,

No joy. I tried those codes explicitly as well as Klaus Linke's code to find a wide range of characters. The only thing that finds them is the code posted earlier or this variation:


Sub ScratchMacroII()
'A basic Word macro coded by Greg Maxey
Dim oRng As Word.Range
Set oRng = ActiveDocument.Range
With oRng.Find
.Text = ChrW(65)
.MatchCase = True
While .Execute
If oRng.Text = Chr(40) Then
oRng.Text = Chr(34)
End If
oRng.Collapse wdCollapseEnd
Wend
End With
Set oRng = ActiveDocument.Range
With oRng.Find
.Text = ChrW(64)
.MatchCase = True
While .Execute
If oRng.Text = Chr(40) Then
oRng.Text = Chr(34)
End If
oRng.Collapse wdCollapseEnd
Wend
End With
End Sub


Its like they have a Chr value of the symbol they represent on the screen, but an Asc value of 40.

Sample file attached.

Jay Freedman
12-09-2013, 08:27 PM
Paul, in the sample that Greg sent, selecting the character and opening the Insert|Symbol dialog just displays the A and @ characters, as if they were typed from the keyboard, but ?AscW(Selection.Text) does return decimal 40.

Also, the searches for ChrW(&HFF20) and ChrW(&HFF21) don't find the characters. In fact, the following macro finds nothing in the upper Unicode range in the document:


Sub x()
Dim rg As Range
Set rg = ActiveDocument.Range
With rg.Find
.MatchWildcards = True
.Text = "[" & ChrW(&HF020) & "-" & ChrW(&HF0FF) & "]"
While .Execute
rg.Select
Stop
Wend
End With
End Sub

In other words, this is not behaving like a symbol inserted by Word. The converter has done a nasty bit of work.

macropod
12-09-2013, 10:35 PM
hi Greg,

You might be able to achieve you goal with the following, interactive code:

Sub Demo()
Dim StrTmp As String, Rslt
With ActiveDocument.Range
With .Find
.ClearFormatting
.Replacement.ClearFormatting
.Text = "<" & ChrW(65) & "[!" & ChrW(65) & "]{1,}"
.Replacement.Text = ""
.Forward = True
.Wrap = wdFindStop
.Format = False
.MatchWildcards = True
.Execute
End With
Do While .Find.Found
StrTmp = .Text
StrTmp = Left(StrTmp, InStrRev(StrTmp, "("))
.End = .Start + InStrRev(StrTmp, "(")
.Select
Do While Rslt <> vbCancel
Rslt = MsgBox("Extend (Yes), Contract (No) or Process (Cancel)?", vbYesNoCancel)
If Rslt = vbYes Then
With Selection
.MoveEndUntil cset:="(", Count:=wdForward
.End = .End + 1
End With
End If
If Rslt = vbNo Then
With Selection
.End = .End - 1
.MoveEndUntil cset:="(", Count:=wdBackward
End With
End If
Loop
Rslt = vbNullString
.End = Selection.End
.Characters.First = Chr(34)
.Characters.Last = Chr(34)
.Collapse wdCollapseEnd
.Find.Execute
Loop
End With
Application.ScreenUpdating = True
End Sub
The code executes a Find, then runs a message box soliciting a user response. Unfortunately, with only three message box options, it's a bit limiting & counterintuitive.

Frosty
12-11-2013, 09:20 AM
Just jumping in a little... But I'm assuming there is no font name information. In years past, some conversion process from word perfect would leave a different font applied to just that character (WP Typographical Symbols, or something).

Quick stab in the dark...

Frosty
12-11-2013, 09:35 AM
Greg,
Shoot me the sample document too. Although in curious-- is the original in .doc, .docx, or .docx while maintaining compatibility with previous versions?

It may be that some of the information which was retained in previous formats has simply been discarded at some point.

Edit: apparently can't remove my signature, which isn't really needed in this thread ;)

gmaxey
12-12-2013, 06:06 PM
Jason

Thanks for your reply. Document is on the way via e-mail. You are correct Selection.Font.Name does return WP TypographicSymbols, however attempting to find just "A" where the font name = WP TypographicSymbols comes up nothing.

Frosty
12-13-2013, 12:05 PM
I played around with trying font substitution, saving the document into different formats, and a number of other things... and despite being able to select a known bad character and getting Selection.Font.Name = "WP Typographic Symbols" and Selection.Range.Font.Name = "Times New Roman" -- I can't find a way to use the .Find object to successfully find stuff... which means I can't think of a way to optimize what would be a lengthy search to analyze a long documents. Obviously you can reduce the processing by hiding the entire word process while performing the selection

It's not hard to build a collection of ranges which have that particular condition-- but it is heavy processing time, and it seems like you already have a work-able process, so I don't know that it would be worth it to reinvent the wheel.

I still think (as I said in an email), that I'm betting something has been lost in the translation... and that if you had the original source, you could probably extract some additional info to help the process. But you basically have a translator for specific characters, and you have to analyze the false-positives-- I don't know what can really be done to improve upon that when the .Find object behaves so poorly on this sample.

Frosty
12-13-2013, 12:19 PM
Oh, and this is one other brainstorm -- if you look at the .xml in IE... and do a search on WP TypographicSymbols, you'll see your characters as well as specific character codes. So *technically*, you could rip out the xml and replace it... doing finds on the font in the xml, then changing the character codes... but again, if you have a working process... this is just a different flavor.