Solved: Converting symbols to numbers [Archive]

View Full Version : Solved: Converting symbols to numbers

defcon_3

03-19-2012, 08:11 PM

Hi everyone,

Im having a hard time determining what will be the proper algorithm for my problem. I have this little project which is task to separate the author from its affiliation. I already manage to do that, the only problem who is giving me a hard time is the matching of author and its affiliation using the indicator, another problem is that indicator is not always the same. Sample:

Chun Hay Ko†‡, Wing Sum Siu†‡§, Hing Lok WongII, Wai Ting Shum†‡, Kwok Pui Fung†‡,II Clara Bik San Lau†‡, and Ping Chung Leung*†‡§
†Institute of Chinese Medicine, ‡State Key Laboratory of Phytochemistry and Plant Resources in West China§Department of Orthopaedics and Traumatology, Jockey Club Centre for Osteoporosis Care and Control and IISchool of Biomedical Sciences, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong SAR, China

The special character (†,‡,§,II) there is the indicator Im talking about. The expected output should be like this:

Source:

Chun Hay^Ko†‡
Wing Sum^Siu†‡§
Hing Lok^WongII
Wai Ting^Shum†‡
Kwok Pui^Fung†‡II
Clara Bik San^Lau†‡
Ping Chung^Leung†‡§

†Institute of Chinese Medicine,
‡State Key Laboratory of Phytochemistry and Plant Resources in West China
§Department of Orthopaedics and Traumatology, Jockey Club Centre for Osteoporosis Care and Control
IISchool of Biomedical Sciences, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong SAR,^China

Expected Output:

Chun Hay^Ko12
Wing Sum^Siu123
Hing Lok^Wong4
Wai Ting^Shum12
Kwok Pui^Fung124
Clara Bik San^Lau12
Ping Chung^Leung123

1Institute of Chinese Medicine,
2State Key Laboratory of Phytochemistry and Plant Resources in West China
3Department of Orthopaedics and Traumatology, Jockey Club Centre for Osteoporosis Care and Control
4School of Biomedical Sciences, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong SAR,^China

Note: Indicators are not always the same depending on the sources. So the possible solution I think is to convert the string after the author to number, but how should it be indicated on its affiliation? Any solution?

Thanks you.

macropod

03-19-2012, 09:06 PM

If there can be more than 9 affiliations, you'll need a different approach; either leading 0s or some form of affiliation separator (eg ',' or '-').

In any event, given what you've described, it looks like a simple Find/Replace operation. For example:
Find = †
Replace = 1
Find = ‡
Replace = 2

defcon_3

03-19-2012, 09:13 PM

Yes I already thought about that but it will be only applicable on the authors which have that indicator. The problem is that indicators are not always the same, it depends on the source.

I will try another workaround and post back the result.

Thanks.

macropod

03-19-2012, 09:35 PM

If your symbols were all from the upper ASCII range, it'd be a simple matter to use a loop with a Find/Replace. However, with character pairs like 'II', you're bound to run into problems. For example, how would you process "Robert Smith II†‡§', where the II is a generation indicator and not an affiliation?

defcon_3

03-19-2012, 10:01 PM

That is another problem i have to address. Can you show me the find and replace loop process? Thanks..

macropod

03-19-2012, 10:19 PM

A Find/Replace loop of the kind described might look like:
Sub ApplyNumbers()
Application.ScreenUpdating = False
Dim i As Long, j As Long
For i = 128 To 255
With ActiveDocument.Range
With .Find
.ClearFormatting
.Format = False
.MatchWholeWord = False
.Wrap = wdFindContinue
.Text = Chr(i)
.Execute
End With
If .Find.Found Then j = j + 1
With .Find
.Replacement.Text = j
.Execute Replace:=wdReplaceAll
End With
End With
Next
Application.ScreenUpdating = True
End Sub

defcon_3

03-19-2012, 10:40 PM

Ok thank you.. ill test it with different scenario.

defcon_3

03-19-2012, 11:07 PM

To be accurate I guess I have to search all the extended ascii character. The only problem for me is when the indicator is an alphabet, because it will be difficult to know especially if there is no space between the author and the indicator like Wong Fei Hongabc where Wong Fei Hong is the author and abc is the indicator.

macropod

03-20-2012, 12:36 AM

That's really a data problem, more than a code problem, because your data lacks adequate differentiation between the names and tags. There's little that code can do to address the potential breadth of issues involved.

defcon_3

03-25-2012, 06:31 PM

macropod

I have some changes in the format, using the above macro ApplyNumbers()

How can I achieve the result like this one.

Chun Hay Ko%1,2,3

The first digit in "j" should have percent (%1) and the preceding incrementation should only have comma (,2,3,4).

macropod

03-25-2012, 06:37 PM

You could try something like:
Sub ApplyNumbers()
Application.ScreenUpdating = False
Dim i As Long, j As Long
For i = 128 To 255
With ActiveDocument.Range
With .Find
.ClearFormatting
.Format = False
.MatchWholeWord = False
.Wrap = wdFindContinue
.Text = Chr(i)
.Execute
End With
If .Find.Found Then j = j + 1
With .Find
If j = 1 Then
.Replacement.Text = "%" & j
Else
.Replacement.Text = "," & j
End If
.Execute Replace:=wdReplaceAll
End With
End With
Next
Application.ScreenUpdating = True
End Sub

defcon_3

03-25-2012, 06:43 PM

That was fast. Thanks a lot.. I was thinking of adding new variable..never thought its a simple IFs statement. :) Thanks again macropod.

Another thing pops into my mind. What if the indicator is already a number? Would you think we can filter that as well to have the same result like this

Instance A
Chun Hay Ko†‡§ - done
Instance B
Chun Hay Ko123 - new condition
Instance C
Chun Hay Koabc - not possible

Result
Chun Hay Ko%1,2,3

macropod

03-25-2012, 07:24 PM

Given that the macro only operates on the higher-level ASCII character set, it will ignore any numbers, letters etc in your data.

As I said before, you have some basic data issues to resolve. Using undifferentiated letters and numbers as tags is a bad idea. With 'Chun Hay Ko123', for example, how would the macro be able to determine whether that's a simple 123 or, perhaps, 1,2,3 or 1,23, etc?

defcon_3

03-25-2012, 07:46 PM

Hmm thats still a problem. I got the data from the books that was ocr, from the source those tags was superscript but when it was extracted it result to something link Chun Hay Koabc or Chun Hay Ko123. I wonder if Isnumeric and IsAlpha also work in vba? Btw thanks again..

fumei

03-25-2012, 08:35 PM

IsNumeric works. However, I agree with macropod, there is a data issue here. Knowing it comes from OCR is significant. There is not an OCR system I have ever seen that is 100%. I think you can probably get close, but I seriously doubt you (or anyone else) can come up with anything that you can count on 100%.

defcon_3

03-25-2012, 09:44 PM

I see thanks for the very informative feed. Thanks guys.

Talis

03-25-2012, 10:07 PM

Why don't posters state what the background of their project is right at the beginning?
Here we have someone who has obtained OCR'ed manuscripts and the problem is with superscripted characters which have been converted to plain text which means that, as you would expect, there is no space between them and the preceding word.
Furthermore, the superscripts do not follow any discernible pattern - as was gradually revealed by the poster.
How on earth you mentors keep your cool is beyond me!

defcon_3

03-25-2012, 10:48 PM

It was stated Talis that the indicator is not always the same, as my project first deals with symbols which is the lead of the query and there's comes what IF's. That's why the scope was expanded. If this is a bad idea to post directly with the point and add some conditions when it was already answered then sorry about that. The problem here is the type of data not where i get the data. I do appreciate all of the help and learn from it. Thanks a lot as well :)

fumei

03-25-2012, 10:56 PM

LOL. As I stated, knowing the source is from OCR is significant, and yes, significant information would best be offered at the start. However as you are well aware, sometimes information is not offered, or even answered when we ask questions.

In this particular case, it is marked as Solved. My guess is that it mostly is. Not totally, but if the OP is happy then that is good.

Yes, it sure would be nice if posters gave a good start, including relevant background. Unfortunately, posters often do not realize what they really want; asking for things that they THINK they want - and this is sometimes contrary to what they actually NEED. Good posters can learn though. Just as we all try to learn.

It is the ones that resist the idea they can possibly be incorrect, these are the ones we lose our cool over...sometimes. Obviously we should NOT lose our cool, and I think - for the most part - we don't.

defcon_3

03-25-2012, 11:01 PM

Thanks fumei for the kind words. Lesson learned :)
And yes its already solved, as the answer satisfy my queries only that certain new conditions was not resolved due to serious data problems, but the main query was already solved. Thanks again.

macropod

03-25-2012, 11:20 PM

Hi defcon,

It could have made a huge difference to the processing if you'd mentioned the superscripting. For the vba processsing it doesn't really matter how you get your data but, if your OCR process preserves superscripting (and some do), the idea would be to focus on superscripted characters rather than worrying about whether they're high ASCII, low ASCII, numeric, alpha, etc. and matching those strings up with the (endnote/footnote?) paragraphs beginning with the same characters.

defcon_3

03-25-2012, 11:26 PM

Thanks Paul, I got the point. I thought it will not be necessary as I just based my post on the output data from the OCR thats why i didn't mentioned it first. And sadly yes the superscripts was not preserves. I guess I have to look for an good alternative application that will preserve the superscript so that the macro can handle different scenarios..

Thanks Guys :)

defcon_3

03-26-2012, 11:20 PM

Paul I think I have overlook some data's.

Using the this one

If j = 1 Then
.Replacement.Text = "%" & j
Else
.Replacement.Text = "," & j
End If

on the my query, I encountered a problem.

Like this example

Antonietta^Baiano†‡

Carmela^Terracone‡

The result is

Antonietta^Baiano%1,2

Carmela^Terracone,2

The expected output should be

Antonietta^Baiano%1,2

Carmela^Terracone%2

I just realized that we cannot only filter number 1 to put a %. The first number after the author name should always start with a %.

Thanks.

macropod

03-26-2012, 11:29 PM

Try:
Sub ApplyNumbers()
Application.ScreenUpdating = False
Dim i As Long, j As Long
For i = 128 To 255
With ActiveDocument.Range
With .Find
.ClearFormatting
.Format = False
.MatchWholeWord = False
.Wrap = wdFindContinue
.Text = Chr(i)
.Execute
End With
If .Find.Found Then j = j + 1
While .Find.Found
If .Duplicate.Characters.First.Previous Like "[A-Za-z]" Then
.Duplicate.Text = "%" & j
Else
.Duplicate.Text = "," & j
End If
.Collapse wdCollapseEnd
.Find.Execute
Wend
With .Find
End With
End With
Next
Application.ScreenUpdating = True
End Sub

defcon_3

03-26-2012, 11:32 PM

Oh thats a bit advance for me to follow. Thank you it works..
Ill be testing it with different scenario..

edited: Tested on different scenario and indeed worked.

What do exactly this means?
"[A-Za-z]"

Thanks so much Paul.