PDA

View Full Version : A (fiendly) Challenge to brettdj



Bob Phillips
05-25-2005, 04:34 PM
In the few weeks that I have visiting this forum, I have seen a number of responses from brettdj advocating the use of Regular Expressions.I have a problem with RegExp, so I thought I would throw it out as a challenge to see if anyone can take what I have done to its conclusion. I have addressed this post at brettdj, but I would love to hear suggestions from anyone. What's in it for you? Okay, a virtual bottle of Lagavulin to the winner (virtual means that, you win it, but as I can't deliver, I will drink it for you - I will let you know what it tastes like though :) ).

As a quick aside, I tend not to advocate the use of RegExp for problem solving. Not because they are not good, they are great, and are the ideal tool for most (all?) text pasring problems. I don't advocate them on the basis of future maintainability. IMO, RegExp are not intuitive and not something that you use every day. RegExp to me seem somthing for the real obscure language function geek, or the super brains like Harlan Grove, not for us lesser mortals. Couple that with the fact that they are not like the high-level languages that we mainly use nowadays, this means that when you want to change a RegExp, finding the skills is difficult, so we either pay big bucks for a RegExp geek, or we do it another way. I know which way my vote goes.

Anyway I digress, and onto the problem. I am sure you have come across as many questions as I have regarding splitting a name pair (surname & forename, or vices versa) into separate components. This has always seemed an ideal candidate for RegExp to me (although I never offer it in my responses to such questions for the reasons stated above), but when I was buiding my sort add-in I decided to use RegExp in the function to sort by name. Obviously I need to cater for all varities of name, such as
- Mr Alan Jones
- Ian St John
- Peter McDougall
- Peter MacDougall
- Baron von Richtofen
etc. etc.

The problem is, with my RegExp skills (or lack of such), I couldn't get a RegExp to do it all. I had to add some code to pre-process the data before I passed it to my RegExp routine. The things I couldn't manage were the Mc, Mac issues (they need to be processed as the first M names, and St, also as the first S, and ensuring that the Fornename and Surname are capitalized (so that Jim jones doesn't get sorted after Alan Minter), and treating von and Von, van and Van as the same.

This is the RegExp parser, nothing unusal here


'---------------------------------------------------------------------
Private Function RESubString(Inp As String, _
Pattern As String, _
Optional N As Long = 0) As String
'---------------------------------------------------------------------
Dim oRegExp As Object, m As Object
On Error GoTo RE_error
Set oRegExp = CreateObject("VBScript.RegExp")
oRegExp.Pattern = Pattern
oRegExp.Global = True
Set m = oRegExp.Execute(Inp)
RESubString = IIf(m.Count > 0, m(N).Value, "")
GoTo RE_Exit
RE_error:
RESubString = "RE Error"
RE_Exit:
Set oRegExp = Nothing
On Error GoTo 0
End Function


This is the original calling routine which sets up the parsing string (you can see one of my other attempts)


'---------------------------------------------------------------------
Private Function LastName(nme As String)
'---------------------------------------------------------------------
' Function: Extracts the last name from a names string
' Synopsis: Calls 2 functions:-
' ReFornmat - frigs the data for St, Mc and Mac, and
' capitalizesc all names
' RESubString - runs a regular expression to return the
' lats name
'---------------------------------------------------------------------
Dim sREgExp As String

' sREgExp = "\b([a-z]+ +)*(O'|Mc|Mac)?[A-Z](\w+\S?)*(-[A-Z](\w+\S?)*)?\b(?=(( +)(Sr\.?|Jr\.?|[IVX][IVX]*))|,|\s*$)"
sREgExp = "\b([a-z]+\s+)*[A-Z](\w+\S?)*([-'][A-Z](\w+\S?)*)?\b(?=(\s+([JS]r\.?|[IVX]+))?\s*$|,)"
LastName = RESubString(ReFormat(nme), sREgExp)

End Function


And these are all my pre-processing routines which I wrote to make up for the deficiencies of my RegExp parsing expression.


Private Function ReFormat(Name As String)

ReFormat = Capitalize(AdjustNamePreps(Name))

End Function


'---------------------------------------------------------------------------
Public Function Capitalize(Name As String)
'---------------------------------------------------------------------------
Dim aParts
Dim i As Long

aParts = pzSplit(LCase(RemoveMultipleSpaces(Name)), " ")
For i = LBound(aParts, 1) To UBound(aParts, 1)
aParts(i) = UCase(Left(aParts(i), 1)) & Right(aParts(i), Len(aParts(i)) - 1)
Next i
Capitalize = pzJoin(aParts, " ")

End Function


'---------------------------------------------------------------------------
Public Function AdjustNamePreps(Name As String)
'---------------------------------------------------------------------------
Dim i As Long, j As Long

For i = 1 To Len(Name)
If Mid(Name, i, 1) = "S" Then
If Mid(Name, i + 1, 1) = "t" And Mid(Name, i + 2, 1) = " " Then
Name = Left(Name, i) & Chr(65) & _
Mid(Name, i + 1, 1) & Mid(Name, i + 3, 255)
Exit For
End If
End If
Next i

For i = 1 To Len(Name)
If Mid(Name, i, 1) = "M" Then
If Mid(Name, i + 1, 1) = "c" Then
Name = Left(Name, i) & Chr(65) & Mid(Name, i + 1, 255)
Exit For
ElseIf Mid(Name, i + 1, 1) = "a" And Mid(Name, i + 2, 1) = "c" Then
Name = Left(Name, i) & Chr(65) & Mid(Name, i + 1, 255)
Exit For
End If
End If
Next i

AdjustNamePreps = Name

End Function

BlueCactus
05-25-2005, 05:29 PM
Nice one xld! I've never investigated RegExp, and I have to congratulate you on giving me a one-line reason for my aversions:


sREgExp = "\b([a-z]+\s+)*[A-Z](\w+\S?)*([-'][A-Z](\w+\S?)*)?\b(?=(\s+([JS]r\.?|[IVX]+))?\s*$|,)"

:rofl: I wish you luck in your quest!

brettdj
05-25-2005, 05:37 PM
. In the few weeks that I have visiting this forum, I have seen a number of responses from brettdj advocating the use of Regular ExpressionsYou must have come accross some of my older posts as I haven't been hugely active lately. But yes, I am RegExp fan.

. I had to add some code to pre-process the data before I passed it to my RegExp routine. The things I couldn't manage were the Mc, Mac issues (they need to be processed as the first M names, and St, also as the first S, and ensuring that the Fornename and Surname are capitalized (so that Jim jones doesn't get sorted after Alan Minter), and treating von and Von, van and Van as the sameYep, RegExp is fantastic when parsing/seraching a particular patterm

When the rules have a number of exceptions or checks then its a choice of using multiple RegExps and/or doing "pre or post processing" as you did above.

For example, when I wrote code to change cell references from absolute to relative or vice versa, the patten I used for the cell address was
(\$?)([A-Z]{1,2})(\$?)(\d{1,5})
The code then has a step to check that any five digit number is less than 65537, and that the next character is not numeric, ie that the numeric portion is a valid cell reference

A longer pattern could have done the job with the RegExp alone, with an additional check that the next character neccesitates a word break
(\$?)([A-Z]{1,2})(\$?)(\d{1,4}|[1-5]\d{4}|6[0-4]\d{3}|65[0-4]\d{2}|655[0-2]\d{1}|6553[0-6])\b

But I figured it was more transparent to work with a shorter pattern and then post process in the code.

When I've parsed street addresses I've had to do similar pre-work to catch the "inconsistencies" in how the data may be presented.

So its a choice between RegEx world and "normal" coding, ie in your code below you could use a RegExp Replace to change your case from lower to upper case rather than Ucase.

One of VBscripts problems is that is doesn't have the lookahead or lookbehind capability of RegExp, this adds to the prework needed.

Anyhow ......... is there more info on your sort program?

I'll have a crack at this with as much RegExp as possible but I don' tthink it will be pretty.

Cheers

Dave

brettdj
05-25-2005, 10:35 PM
how about a series of RegExps, ie


Option Explicit
Function RevisedLastName(Source As String)
Dim Regex As Object, UpCase As String, StrMan As String
Set Regex = CreateObject("vbscript.regexp")
'remove all "."
Source = Replace(Source, ".", "")
'append Saint or St as SAaint or SAt
With Regex
.ignorecase = True
.Global = False
.Pattern = ".+\b(S)(aint|t|)\b\s+(\b[a-z]+$)"
Source = Regex.Replace(Source, "$1A$2$3")
End With
' Replace Mac or Mc at the start of the last word if the next letter is capitalised
' Note that this will miss lower case Mac's but there is no foolprooh test
' ie Macarthur will not be changed, MacMillan will be
With Regex
.ignorecase = False
.Pattern = ".+\b(M)(ac|c)([A-Z]{1}[a-z]+$)"
Source = Regex.Replace(Source, "$1A$2$3")
End With
'Capitalise string properly for last word.
' Note, can't use ApplicationProper in case name is like MacDougall.
With Regex
.ignorecase = False
.Pattern = "(^.+\b)([a-z])([a-z]*$)"
UpCase = UCase(Regex.Replace(Source, "$2"))
Source = .Replace(Source, "$1" & UpCase & "$3")
End With
'return last names
With Regex
.ignorecase = True
.Global = False
.Pattern = "(^.+)(V(o|a)n)(\b\s+\b[a-z]+)$"
'Check for von or van
If .Test(Source) = True Then
StrMan = .Replace(Source, "$2$4")
'Ensure that Von or Van is also capitalised
RevisedLastName = UCase(Left(StrMan, 1)) & Right(StrMan, Len(StrMan) - 1)
Else
'No von or van so just return last name
.Pattern = "^.+\b\s+(\b[a-z]+)$"
RevisedLastName = .Replace(Source, "$1")
End If
End With
Set Regex = Nothing
End Function

Bob Phillips
05-26-2005, 01:17 AM
You must have come accross some of my older posts as I haven't been hugely active lately. But yes, I am RegExp fan.
Yep, RegExp is fantastic when parsing/seraching a particular patterm

Ir wasn't necessarily solutions in RegExp that I referred to, but comments made. remember, I have only been on this board a month.


But I figured it was more transparent to work with a shorter pattern and then post process in the code.

Take you point, but it would still be nice to see a complete RegExp solution, if only for the satisfaction of doing so. In many ways this is an intellectual exercise, as my code does what I want (within the limits of my testing thus far), but that doesn't mean I like it.


One of VBscripts problems is that is doesn't have the lookahead or lookbehind capability of RegExp, this adds to the prework needed.

Don't quote me, check it yourself, but I think that VBScript 5.5 introduced look-ahead!


I'll have a crack at this with as much RegExp as possible but I don' tthink it will be pretty.

That will be fine, no RegExp is pretty IMO.

brettdj
05-26-2005, 02:50 AM
Have you tried the UDF above?


Don't quote me, check it yourself, but I think that VBScript 5.5 introduced look-ahead!Unfortunately not

From http://aspnet.4guysfromrolla.com/articles/022603-1.aspx (http://aspnet.4guysfromrolla.com/articles/022603-1.aspx)

Lookaround is a feature that is partially implemented in JScript but not in VBScript. There are two directions of lookaround - lookahead and lookbehind - and two flavors of each direction - positive assertion and negative assertion. The syntax for each is:

(?=...) - Positive lookAHEAD
(?!...) - Negative lookAHEAD
(?<=...) - Positive lookBEHIND
(?<!...) - Negative lookBEHIND

Cheers

Dave

Bob Phillips
05-26-2005, 04:44 PM
Dave,

Finally created some test data for it (my data was more of a demo and didn't cover enough conditions).

So far it looks good. Major fault with it at present is that it doesn't handle Jr and Snr correctly, so Bill Oakes Jr somes out as Jr. I think it should parse Jr, Jr., Jr, Junior, with or without capitals as Jr. Similalrly for Sr, Sr., Snr, Snr., and Senior to Snr, but of course retaing the surname.

I'll keep on testing some more.

Cheers Bob

brettdj
05-26-2005, 05:11 PM
Bob,

So it needs a search for the Jr or Sr variations as the last word, and if this word exists it should be appended to the last name(s) as Jr. or Snr. ??

Cheers

Dave

Bob Phillips
05-26-2005, 05:16 PM
So it needs a search for the Jr or Sr variations as the last word, and if this word exists it should be appended to the last name(s) as Jr. or Snr. ??

Yeah, you've got it.

brettdj
05-26-2005, 05:44 PM
how about


Option Explicit
Function RevisedLastName(Source As String)
Dim Regex As Object
Dim AmericanName As String
Dim UpCase As String, StrMan As String
Set Regex = CreateObject("vbscript.regexp")
'remove all "."
Source = Replace(Source, ".", "")
'Find and remove for now any variation of Senior or Junior at the end of the string
'Store this value in AmericanName
With Regex
.ignorecase = True
.Global = False
.Pattern = "(.+\b)\s+(Senio|S|Junio|J)(r)$"
If Regex.test(Source) = True Then
AmericanName = " " & Regex.Replace(Source, "$2$3") & "."
Source = Regex.Replace(Source, "$1")
End If
End With
'append Saint or St as SAaint or SAt
With Regex
.ignorecase = True
.Pattern = ".+\b(S)(aint|t|)\b\s+(\b[a-z]+$)"
Source = Regex.Replace(Source, "$1A$2$3")
End With
' Replace Mac or Mc at the start of the last word if the next letter is capitalised
' Note that this will miss lower case Mac's but there is no foolproof test
' ie Macarthur will not be changed, MacMillan will be
With Regex
.ignorecase = False
.Pattern = ".+\b(M)(ac|c)([A-Z]{1}[a-z]+$)"
Source = Regex.Replace(Source, "$1A$2$3")
End With
'Capitalise string properly for last word.
' Note, can't use ApplicationProper in case name is like MacDougall.
With Regex
.ignorecase = False
.Pattern = "(^.+\b)([a-z])([a-z]*$)"
UpCase = UCase(Regex.Replace(Source, "$2"))
Source = .Replace(Source, "$1" & UpCase & "$3")
End With
'return last names
With Regex
.ignorecase = True
.Global = False
.Pattern = "(^.+)?(V(o|a)n)(\b\s+\b[a-z]+)$"
'Check for von or van
If .test(Source) = True Then
StrMan = .Replace(Source, "$2$4")
'Ensure that Von or Van is also capitalised
RevisedLastName = UCase(Left(StrMan, 1)) & Right(StrMan, Len(StrMan) - 1) & AmericanName
Else
'No von or van so just return last name
.Pattern = "^.+\b\s+(\b[a-z]+)$"
RevisedLastName = .Replace(Source, "$1") & AmericanName
End If
End With
Set Regex = Nothing
End Function

brettdj
05-26-2005, 06:01 PM
Change this line from
AmericanName = " " & Regex.Replace(Source, "$2$3") & "."
to
AmericanName = " " & UCase(Left(Regex.Replace(Source, "$2"), 1)) & "r."

to force the Jr. rather than Junior.

Cheers

Dave

brettdj
05-30-2005, 07:02 PM
Do I get the Lagavulin?

MOS MASTER
06-28-2005, 02:24 PM
Wow missed this topic somewhere I think. :yes

Excellent material looks nice Dave. I do hope you make it a kb entry? (Or Bob's sollution if he wins...or better yet both)

Now for the contest who won Bob? Better yet who's drinking "Lagavulin" (If I'd only knew what that was...) :rofl:

mdmackillop
06-28-2005, 04:11 PM
sl?inte (slan?tchuh): A drinking toast

Bob Phillips
06-28-2005, 06:01 PM
Excellent material looks nice Dave. I do hope you make it a kb entry? (Or Bob's sollution if he wins...or better yet both)

In reality it is being won at the moment because I was looking for a single regular expresssion, and we currently have a series. Also there are still a few anomalies that I need to get back to Dave on, but I haven't yet gotten around to. SO I still owe Dave a response.


Better yet who's drinking "Lagavulin" (If I'd only knew what that was...) :rofl:

Well I am for sure! As for not knowing what Lagavulin is Joost, don't they educate you in your country? Lagavulin is one of the finest examples of God's nectar, the real ambrosia of the Gods, the water of life, the reason for living ... single malt Scotch Whisky... phew, I feel better now!

MOS MASTER
06-29-2005, 09:59 AM
Well I am for sure! As for not knowing what Lagavulin is Joost, don't they educate you in your country? Lagavulin is one of the finest examples of God's nectar, the real ambrosia of the Gods, the water of life, the reason for living ... single malt Scotch Whisky... phew, I feel better now!
Indeed Bob, we are a dumb uneducated mob over here. (not worthy of God's nectar) How did I ever get by not knowing this?footinmout

We seldom get out of our cave and if we do we look for female companions who are dumb enough to come home with us. (and enjoy a simple beer)

But again being ignorant has the benefit of not having to carry al that weight in knowledge!

Let the games continue! (Hope you find that single line sollution) :whistle:

brettdj
08-14-2005, 06:38 PM
In reality it is being won at the moment because I was looking for a single regular expresssion, and we currently have a series. Also there are still a few anomalies that I need to get back to Dave on, but I haven't yet gotten around to. SO I still owe Dave a response.

If would be the mother of all RegExp's to do this in one hit -my question would be "why?"

Bob, let me know what the anomalies are and I'll tidy them up

Cheers

Dave

Sir Babydum GBE
08-15-2005, 03:02 AM
Bob, Can I have some of your Lagavulin for being so lovely to you in your first few months?
If so, enjoy it and think of me! :)

brettdj
12-30-2013, 05:47 AM
Have you tried the UDF above?

Unfortunately not

From http://aspnet.4guysfromrolla.com/articles/022603-1.aspx (http://aspnet.4guysfromrolla.com/articles/022603-1.aspx)

Lookaround is a feature that is partially implemented in JScript but not in VBScript. <snip>

Cheers

Dave

I note that this is incorrect and Bob was right, lookahead - but not lookbehind - is available

Aussiebear
12-31-2013, 03:43 AM
Left field post or was it the Lagavulin??

brettdj
01-04-2014, 01:58 AM
Left field post or was it the Lagavulin??
I never saw the Lagavulin!

Came across this by chance when I googling for my historical regexp threads :)