PDA

View Full Version : Regular Expression Replace



Paul_Hossler
01-27-2012, 07:22 AM
I have a column of suffixs and a column of company names

I'm trying to use a regular expression replace to remove each suffix if it's on the end of each company name (entire column with a really big replace)

Ex. 100 suffixes (CO, COM, INC, INCORP, ...) and 300K lines of company name

"ACME INCORP" becomes "ACME"


Right now I use brute force: loop through each company for each suffix

I'd like to do the entire company column at once

1. I don't think .Replace with wildcards will work

2. I can't figure out a RegEx expression that will

ConOps and as far as I got is in the attachment

Appreciate any ideas, since right now it can take 20-30 minutes to go thru the data and clean it

Paul

Bob Phillips
01-27-2012, 09:46 AM
Paul,

I am assuming the slow loop is the data loop, and that you don't have many items in the replacement list. I would just use replace



Function ReplaceValues()
Dim i As Long

Application.ScreenUpdating = False

With Worksheets("Replacements")

For i = 2 To .Range("A2").End(xlDown).Row

Worksheets("Data").Columns("A").Replace _
What:=.Cells(i, "A").Value2, _
Replacement:=.Cells(i, "B").Value2, _
LookAt:=xlPart, _
MatchCase:=False
Next i
End With

Application.ScreenUpdating = True
End Function

Kenneth Hobs
01-27-2012, 09:56 AM
I would also recommend using manual calculation mode.

e.g.
http://vbaexpress.com/kb/getarticle.php?kb_id=1035

Paul_Hossler
01-27-2012, 02:03 PM
Bob / Ken

1. thanks
2. I believe that the .Replace xlPart would replace more than just the string when it's at the end

So if the suffix column contains

COMPANY
CO
INC.

and the data column contains

"ACME WIDGET COMPANY"
"COMPANY OF PROGRAMMERS, INC."

then

"ACME WIDGET COMPANY" should become "ACME WIDGET"

and "COMPANY OF PROGRAMMERS, INC." should become "COMPANY OF PROGRAMMERS"

Some sort of regular expression was the only thing I could think of

Right now it's a outer loop for each suffix (40-50), and an inner loop on the companies (300k-400k)


Paul

XLD -- as an aside, what's the advantage of .Value2 instead of just .Value ?

Kenneth Hobs
01-27-2012, 02:23 PM
The Replace that I was thinking of Paul was to Split() the string value and check the first and last elements of the array. You could then Join() the array or put non-blank elements into a dictionary object. It should go fairly quickly.

I use Value2 when I know that it is a number or string, and Value for date types. It is a tad faster I would guess because less guessing of type is done.

Paul_Hossler
01-27-2012, 03:08 PM
I was hoping that I could hit the entire 'Company' list with just a single request for each entry in the 'Suffix' list

I know that I'll need the outer 'Suffix' loop, but that's the short one

Paul

Kenneth Hobs
01-27-2012, 05:59 PM
Option Explicit
Option Compare Text

Sub RemoveSuffixes()
Dim suffRange As Range, c As Range, bc As Range, beforeRange As Range
Dim s As String, a() As String

On Error GoTo EndSub
' http://vbaexpress.com/kb/getarticle.php?kb_id=1035
SpeedOn

Set suffRange = Range("A2", Range("A" & Rows.Count).End(xlUp))
Set beforeRange = Range("C2", Range("C" & Rows.Count).End(xlUp))

For Each bc In beforeRange
If IsEmpty(bc) Then GoTo NextBC
a() = Split(bc.Value2, " ")
For Each c In suffRange
If a(UBound(a)) = c.Value2 Then
ReDim Preserve a(0 To (UBound(a) - 1))
bc.Value2 = Join(a(), " ")
GoTo NextBC
End If
Next c
NextBC:
Next bc

EndSub:
SpeedOff
End Sub

shrivallabha
01-27-2012, 09:36 PM
You really have huge data.

Supposing the Original Company names are in Column A and their replacements listed in Column C, how does this formula fare?

In Cell B2:

=SUBSTITUTE(A2," "&LOOKUP(9999,SEARCH($C$2:$C$101,A2,1),$C$2:$C$101),"")
And then copied down.

If it goes faster then it can be implemented through VBA. And it will be non-looping as we can use DataSeries.

Paul_Hossler
01-28-2012, 01:12 PM
Thnks, but the issue unfortunately is not a simple 1 for 1 replacement

ACME COMPANY --> ACME
BILL COMPANY --> BILL
CHARLES COMPANY --> CHARLES
DAN COMPANY --> DAN
DAN CO --> DAN
DAN COMP --> DAN


That would require a 2 column list for all the unique names that requre replacing and their replacement; that would be a maintenance nightmare

I am trying to make it rules based. For performance I'd like to apply the replace operation to an entire column without looping through the company column

So in the above,

(*)(COMPANY) ---> (\1) (and that pretty much is all the RegEx I know)

This would leave

COMPANY OF PEOPLE --> COMPANY OF PEOPLE since COMPANY is not the suffix


Right now the outer loop contains each suffix string (30-40), and the inner loop is for each of the 300k+ lines

So if the Right(Company, Len(suffix)) = suffix, then trim Len(suffix) from company

30-40 suffixes and 300k+ companies is a LOT of looping

Word and regular expression have a 'only if at the end of the word' option, and that's what I think I'm hoping I can do on an entire column

Paul

shrivallabha
01-28-2012, 09:48 PM
Thnks, but the issue unfortunately is not a simple 1 for 1 replacement

ACME COMPANY --> ACME
BILL COMPANY --> BILL
CHARLES COMPANY --> CHARLES
DAN COMPANY --> DAN
DAN CO --> DAN
DAN COMP --> DAN


That would require a 2 column list for all the unique names that requre replacing and their replacement; that would be a maintenance nightmare

......

Paul I think there's a bit of confusion, the formula should work with your suffix list. I'd think there will be few pitfalls but those deviations can be looked after.

I am attaching the concept as I had it in my mind.

Paul_Hossler
01-29-2012, 08:24 AM
I appreciate all the ideas, and while they will work, I'm still trying to improve performance (avoiding premature optimization as I go)

This is what I currently use, with 2 loops


'not using Ken's SpeedOn and SpeedOff yet
Sub WhatItIsNow()
Call WhatItIsNow_1(Worksheets("Sheet1").Range("A2:A8"), Worksheets("Sheet1").Range("C2:C11"))
End Sub

Sub WhatItIsNow_1(rSuffixs As Range, rSource As Range)
Dim aSuffix As Variant, aCompany As Variant
Dim iSuffix As Long, iCompany As Long


aSuffix = rSuffixs.Value
aCompany = rSource.Value

For iSuffix = LBound(aSuffix, 1) To UBound(aSuffix, 1)

aSuffix(iSuffix, 1) = UCase(aSuffix(iSuffix, 1))


For iCompany = LBound(aCompany, 1) To UBound(aCompany, 1)

If UCase(Right(aCompany(iCompany, 1), Len(aSuffix(iSuffix, 1)))) = aSuffix(iSuffix, 1) Then
aCompany(iCompany, 1) = Trim(Left(aCompany(iCompany, 1), Len(aCompany(iCompany, 1)) - Len(aSuffix(iSuffix, 1))))
End If

Next iCompany
Next iSuffix
rSource.Value = aCompany
End Sub



If I could use the built in Replace


Sub UsingBuiltinReplace()
Call UsingBuiltinReplace_1(Worksheets("Sheet1").Range("A2:A8"), Worksheets("Sheet1").Range("C2:C11"))
End Sub

Sub UsingBuiltinReplace_1(rSuffixs As Range, rSource As Range)
Dim rSuffix As Range

For Each rSuffix In rSuffixs.Cells

rSource.Replace What:=rSuffix.Value, Replacement:=vbNullString, _
LookAt:=xlPart, SearchOrder:=xlByRows, _
MatchCase:=False, SearchFormat:=False, ReplaceFormat:=False
Next
End Sub


but that does not limit the test to just the end of the string

Actually what I do now is not perfect either, since sometimes a Company value will be trimmed twice

Since the built in Replace allows me to 'hit' an entire column, I was hoping a RegEx would allow that also, thus eliminating the inner loop

Paul

mdmackillop
01-29-2012, 10:43 AM
Add an "xx" to the end of each company and adjust the search accordingly
This takes 18 sec with 470k names

Sub DoReplace()
Dim r As Range
Dim Suffix(), s

Suffix = Array(" coxx", " comxx", " companyxx", " incxx", " inc.xx", " incorpxx", "xx")


Set r = Range(Cells(2, 1), Cells(Rows.Count, 1).End(xlUp))
r.Offset(, 1) = "xx"
r.Offset(, 2).FormulaR1C1 = "=RC1 & RC2"
r.Offset(, 2).Value = r.Offset(, 2).Value
r.Offset(, 1).Clear

For Each s In Suffix
r.Offset(, 2).Replace What:=s, Replacement:="", LookAt:=xlPart
Next


End Sub

Paul_Hossler
01-29-2012, 11:36 AM
Now that is VERY clever :beerchug:

The first thing when I get back to work on Monday -- OK, the second thing after my coffee -- will be to try that out on the real data

Paul

mdmackillop
01-29-2012, 12:20 PM
Glad to help. Just hope it works on the real thing!