Bob Phillips
05-25-2005, 04:34 PM
In the few weeks that I have visiting this forum, I have seen a number of responses from brettdj advocating the use of Regular Expressions.I have a problem with RegExp, so I thought I would throw it out as a challenge to see if anyone can take what I have done to its conclusion. I have addressed this post at brettdj, but I would love to hear suggestions from anyone. What's in it for you? Okay, a virtual bottle of Lagavulin to the winner (virtual means that, you win it, but as I can't deliver, I will drink it for you - I will let you know what it tastes like though :) ).
As a quick aside, I tend not to advocate the use of RegExp for problem solving. Not because they are not good, they are great, and are the ideal tool for most (all?) text pasring problems. I don't advocate them on the basis of future maintainability. IMO, RegExp are not intuitive and not something that you use every day. RegExp to me seem somthing for the real obscure language function geek, or the super brains like Harlan Grove, not for us lesser mortals. Couple that with the fact that they are not like the high-level languages that we mainly use nowadays, this means that when you want to change a RegExp, finding the skills is difficult, so we either pay big bucks for a RegExp geek, or we do it another way. I know which way my vote goes.
Anyway I digress, and onto the problem. I am sure you have come across as many questions as I have regarding splitting a name pair (surname & forename, or vices versa) into separate components. This has always seemed an ideal candidate for RegExp to me (although I never offer it in my responses to such questions for the reasons stated above), but when I was buiding my sort add-in I decided to use RegExp in the function to sort by name. Obviously I need to cater for all varities of name, such as
- Mr Alan Jones
- Ian St John
- Peter McDougall
- Peter MacDougall
- Baron von Richtofen
etc. etc.
The problem is, with my RegExp skills (or lack of such), I couldn't get a RegExp to do it all. I had to add some code to pre-process the data before I passed it to my RegExp routine. The things I couldn't manage were the Mc, Mac issues (they need to be processed as the first M names, and St, also as the first S, and ensuring that the Fornename and Surname are capitalized (so that Jim jones doesn't get sorted after Alan Minter), and treating von and Von, van and Van as the same.
This is the RegExp parser, nothing unusal here
'---------------------------------------------------------------------
Private Function RESubString(Inp As String, _
Pattern As String, _
Optional N As Long = 0) As String
'---------------------------------------------------------------------
Dim oRegExp As Object, m As Object
On Error GoTo RE_error
Set oRegExp = CreateObject("VBScript.RegExp")
oRegExp.Pattern = Pattern
oRegExp.Global = True
Set m = oRegExp.Execute(Inp)
RESubString = IIf(m.Count > 0, m(N).Value, "")
GoTo RE_Exit
RE_error:
RESubString = "RE Error"
RE_Exit:
Set oRegExp = Nothing
On Error GoTo 0
End Function
This is the original calling routine which sets up the parsing string (you can see one of my other attempts)
'---------------------------------------------------------------------
Private Function LastName(nme As String)
'---------------------------------------------------------------------
' Function: Extracts the last name from a names string
' Synopsis: Calls 2 functions:-
' ReFornmat - frigs the data for St, Mc and Mac, and
' capitalizesc all names
' RESubString - runs a regular expression to return the
' lats name
'---------------------------------------------------------------------
Dim sREgExp As String
' sREgExp = "\b([a-z]+ +)*(O'|Mc|Mac)?[A-Z](\w+\S?)*(-[A-Z](\w+\S?)*)?\b(?=(( +)(Sr\.?|Jr\.?|[IVX][IVX]*))|,|\s*$)"
sREgExp = "\b([a-z]+\s+)*[A-Z](\w+\S?)*([-'][A-Z](\w+\S?)*)?\b(?=(\s+([JS]r\.?|[IVX]+))?\s*$|,)"
LastName = RESubString(ReFormat(nme), sREgExp)
End Function
And these are all my pre-processing routines which I wrote to make up for the deficiencies of my RegExp parsing expression.
Private Function ReFormat(Name As String)
ReFormat = Capitalize(AdjustNamePreps(Name))
End Function
'---------------------------------------------------------------------------
Public Function Capitalize(Name As String)
'---------------------------------------------------------------------------
Dim aParts
Dim i As Long
aParts = pzSplit(LCase(RemoveMultipleSpaces(Name)), " ")
For i = LBound(aParts, 1) To UBound(aParts, 1)
aParts(i) = UCase(Left(aParts(i), 1)) & Right(aParts(i), Len(aParts(i)) - 1)
Next i
Capitalize = pzJoin(aParts, " ")
End Function
'---------------------------------------------------------------------------
Public Function AdjustNamePreps(Name As String)
'---------------------------------------------------------------------------
Dim i As Long, j As Long
For i = 1 To Len(Name)
If Mid(Name, i, 1) = "S" Then
If Mid(Name, i + 1, 1) = "t" And Mid(Name, i + 2, 1) = " " Then
Name = Left(Name, i) & Chr(65) & _
Mid(Name, i + 1, 1) & Mid(Name, i + 3, 255)
Exit For
End If
End If
Next i
For i = 1 To Len(Name)
If Mid(Name, i, 1) = "M" Then
If Mid(Name, i + 1, 1) = "c" Then
Name = Left(Name, i) & Chr(65) & Mid(Name, i + 1, 255)
Exit For
ElseIf Mid(Name, i + 1, 1) = "a" And Mid(Name, i + 2, 1) = "c" Then
Name = Left(Name, i) & Chr(65) & Mid(Name, i + 1, 255)
Exit For
End If
End If
Next i
AdjustNamePreps = Name
End Function
As a quick aside, I tend not to advocate the use of RegExp for problem solving. Not because they are not good, they are great, and are the ideal tool for most (all?) text pasring problems. I don't advocate them on the basis of future maintainability. IMO, RegExp are not intuitive and not something that you use every day. RegExp to me seem somthing for the real obscure language function geek, or the super brains like Harlan Grove, not for us lesser mortals. Couple that with the fact that they are not like the high-level languages that we mainly use nowadays, this means that when you want to change a RegExp, finding the skills is difficult, so we either pay big bucks for a RegExp geek, or we do it another way. I know which way my vote goes.
Anyway I digress, and onto the problem. I am sure you have come across as many questions as I have regarding splitting a name pair (surname & forename, or vices versa) into separate components. This has always seemed an ideal candidate for RegExp to me (although I never offer it in my responses to such questions for the reasons stated above), but when I was buiding my sort add-in I decided to use RegExp in the function to sort by name. Obviously I need to cater for all varities of name, such as
- Mr Alan Jones
- Ian St John
- Peter McDougall
- Peter MacDougall
- Baron von Richtofen
etc. etc.
The problem is, with my RegExp skills (or lack of such), I couldn't get a RegExp to do it all. I had to add some code to pre-process the data before I passed it to my RegExp routine. The things I couldn't manage were the Mc, Mac issues (they need to be processed as the first M names, and St, also as the first S, and ensuring that the Fornename and Surname are capitalized (so that Jim jones doesn't get sorted after Alan Minter), and treating von and Von, van and Van as the same.
This is the RegExp parser, nothing unusal here
'---------------------------------------------------------------------
Private Function RESubString(Inp As String, _
Pattern As String, _
Optional N As Long = 0) As String
'---------------------------------------------------------------------
Dim oRegExp As Object, m As Object
On Error GoTo RE_error
Set oRegExp = CreateObject("VBScript.RegExp")
oRegExp.Pattern = Pattern
oRegExp.Global = True
Set m = oRegExp.Execute(Inp)
RESubString = IIf(m.Count > 0, m(N).Value, "")
GoTo RE_Exit
RE_error:
RESubString = "RE Error"
RE_Exit:
Set oRegExp = Nothing
On Error GoTo 0
End Function
This is the original calling routine which sets up the parsing string (you can see one of my other attempts)
'---------------------------------------------------------------------
Private Function LastName(nme As String)
'---------------------------------------------------------------------
' Function: Extracts the last name from a names string
' Synopsis: Calls 2 functions:-
' ReFornmat - frigs the data for St, Mc and Mac, and
' capitalizesc all names
' RESubString - runs a regular expression to return the
' lats name
'---------------------------------------------------------------------
Dim sREgExp As String
' sREgExp = "\b([a-z]+ +)*(O'|Mc|Mac)?[A-Z](\w+\S?)*(-[A-Z](\w+\S?)*)?\b(?=(( +)(Sr\.?|Jr\.?|[IVX][IVX]*))|,|\s*$)"
sREgExp = "\b([a-z]+\s+)*[A-Z](\w+\S?)*([-'][A-Z](\w+\S?)*)?\b(?=(\s+([JS]r\.?|[IVX]+))?\s*$|,)"
LastName = RESubString(ReFormat(nme), sREgExp)
End Function
And these are all my pre-processing routines which I wrote to make up for the deficiencies of my RegExp parsing expression.
Private Function ReFormat(Name As String)
ReFormat = Capitalize(AdjustNamePreps(Name))
End Function
'---------------------------------------------------------------------------
Public Function Capitalize(Name As String)
'---------------------------------------------------------------------------
Dim aParts
Dim i As Long
aParts = pzSplit(LCase(RemoveMultipleSpaces(Name)), " ")
For i = LBound(aParts, 1) To UBound(aParts, 1)
aParts(i) = UCase(Left(aParts(i), 1)) & Right(aParts(i), Len(aParts(i)) - 1)
Next i
Capitalize = pzJoin(aParts, " ")
End Function
'---------------------------------------------------------------------------
Public Function AdjustNamePreps(Name As String)
'---------------------------------------------------------------------------
Dim i As Long, j As Long
For i = 1 To Len(Name)
If Mid(Name, i, 1) = "S" Then
If Mid(Name, i + 1, 1) = "t" And Mid(Name, i + 2, 1) = " " Then
Name = Left(Name, i) & Chr(65) & _
Mid(Name, i + 1, 1) & Mid(Name, i + 3, 255)
Exit For
End If
End If
Next i
For i = 1 To Len(Name)
If Mid(Name, i, 1) = "M" Then
If Mid(Name, i + 1, 1) = "c" Then
Name = Left(Name, i) & Chr(65) & Mid(Name, i + 1, 255)
Exit For
ElseIf Mid(Name, i + 1, 1) = "a" And Mid(Name, i + 2, 1) = "c" Then
Name = Left(Name, i) & Chr(65) & Mid(Name, i + 1, 255)
Exit For
End If
End If
Next i
AdjustNamePreps = Name
End Function