Consulting

Results 1 to 14 of 14

Thread: Regular Expression Replace

  1. #1
    VBAX Sage
    Joined
    Apr 2007
    Location
    United States
    Posts
    8,728
    Location

    Regular Expression Replace

    I have a column of suffixs and a column of company names

    I'm trying to use a regular expression replace to remove each suffix if it's on the end of each company name (entire column with a really big replace)

    Ex. 100 suffixes (CO, COM, INC, INCORP, ...) and 300K lines of company name

    "ACME INCORP" becomes "ACME"


    Right now I use brute force: loop through each company for each suffix

    I'd like to do the entire company column at once

    1. I don't think .Replace with wildcards will work

    2. I can't figure out a RegEx expression that will

    ConOps and as far as I got is in the attachment

    Appreciate any ideas, since right now it can take 20-30 minutes to go thru the data and clean it

    Paul
    Attached Files Attached Files

  2. #2
    Distinguished Lord of VBAX VBAX Grand Master Bob Phillips's Avatar
    Joined
    Apr 2005
    Posts
    25,453
    Location
    Paul,

    I am assuming the slow loop is the data loop, and that you don't have many items in the replacement list. I would just use replace

    Function ReplaceValues()
        Dim i As Long
        Application.ScreenUpdating = False
        With Worksheets("Replacements")
            For i = 2 To .Range("A2").End(xlDown).Row
                Worksheets("Data").Columns("A").Replace _
                What:=.Cells(i, "A").Value2, _
                Replacement:=.Cells(i, "B").Value2, _
                LookAt:=xlPart, _
                MatchCase:=False
            Next i
        End With
        Application.ScreenUpdating = True
    End Function
    ____________________________________________
    Nihil simul inventum est et perfectum

    Abusus non tollit usum

    Last night I dreamed of a small consolation enjoyed only by the blind: Nobody knows the trouble I've not seen!
    James Thurber

  3. #3
    VBAX Guru Kenneth Hobs's Avatar
    Joined
    Nov 2005
    Location
    Tecumseh, OK
    Posts
    4,956
    Location
    I would also recommend using manual calculation mode.

    e.g.
    http://vbaexpress.com/kb/getarticle.php?kb_id=1035

  4. #4
    VBAX Sage
    Joined
    Apr 2007
    Location
    United States
    Posts
    8,728
    Location
    Bob / Ken

    1. thanks
    2. I believe that the .Replace xlPart would replace more than just the string when it's at the end

    So if the suffix column contains

    COMPANY
    CO
    INC.

    and the data column contains

    "ACME WIDGET COMPANY"
    "COMPANY OF PROGRAMMERS, INC."

    then

    "ACME WIDGET COMPANY" should become "ACME WIDGET"

    and "COMPANY OF PROGRAMMERS, INC." should become "COMPANY OF PROGRAMMERS"

    Some sort of regular expression was the only thing I could think of

    Right now it's a outer loop for each suffix (40-50), and an inner loop on the companies (300k-400k)


    Paul

    XLD -- as an aside, what's the advantage of .Value2 instead of just .Value ?

  5. #5
    VBAX Guru Kenneth Hobs's Avatar
    Joined
    Nov 2005
    Location
    Tecumseh, OK
    Posts
    4,956
    Location
    The Replace that I was thinking of Paul was to Split() the string value and check the first and last elements of the array. You could then Join() the array or put non-blank elements into a dictionary object. It should go fairly quickly.

    I use Value2 when I know that it is a number or string, and Value for date types. It is a tad faster I would guess because less guessing of type is done.

  6. #6
    VBAX Sage
    Joined
    Apr 2007
    Location
    United States
    Posts
    8,728
    Location
    I was hoping that I could hit the entire 'Company' list with just a single request for each entry in the 'Suffix' list

    I know that I'll need the outer 'Suffix' loop, but that's the short one

    Paul

  7. #7
    VBAX Guru Kenneth Hobs's Avatar
    Joined
    Nov 2005
    Location
    Tecumseh, OK
    Posts
    4,956
    Location
    Option Explicit
    Option Compare Text
    
    Sub RemoveSuffixes()
        Dim suffRange As Range, c As Range, bc As Range, beforeRange As Range
        Dim s As String, a() As String
        On Error GoTo EndSub
        ' http://vbaexpress.com/kb/getarticle.php?kb_id=1035
        SpeedOn
        Set suffRange = Range("A2", Range("A" & Rows.Count).End(xlUp))
        Set beforeRange = Range("C2", Range("C" & Rows.Count).End(xlUp))
        For Each bc In beforeRange
            If IsEmpty(bc) Then GoTo NextBC
                a() = Split(bc.Value2, " ")
                For Each c In suffRange
                    If a(UBound(a)) = c.Value2 Then
                        ReDim Preserve a(0 To (UBound(a) - 1))
                        bc.Value2 = Join(a(), " ")
                        GoTo NextBC
                    End If
                Next c
            NextBC:
        Next bc
    EndSub:
      SpeedOff
    End Sub

  8. #8
    VBAX Expert shrivallabha's Avatar
    Joined
    Jan 2010
    Location
    Mumbai
    Posts
    750
    Location
    You really have huge data.

    Supposing the Original Company names are in Column A and their replacements listed in Column C, how does this formula fare?

    In Cell B2:
    =SUBSTITUTE(A2," "&LOOKUP(9999,SEARCH($C$2:$C$101,A2,1),$C$2:$C$101),"")
    And then copied down.

    If it goes faster then it can be implemented through VBA. And it will be non-looping as we can use DataSeries.
    Regards,
    --------------------------------------------------------------------------------------------------------
    Shrivallabha
    --------------------------------------------------------------------------------------------------------
    Using Excel 2016 in Home / 2010 in Office
    --------------------------------------------------------------------------------------------------------

  9. #9
    VBAX Sage
    Joined
    Apr 2007
    Location
    United States
    Posts
    8,728
    Location
    Thnks, but the issue unfortunately is not a simple 1 for 1 replacement

    ACME COMPANY --> ACME
    BILL COMPANY --> BILL
    CHARLES COMPANY --> CHARLES
    DAN COMPANY --> DAN
    DAN CO --> DAN
    DAN COMP --> DAN


    That would require a 2 column list for all the unique names that requre replacing and their replacement; that would be a maintenance nightmare

    I am trying to make it rules based. For performance I'd like to apply the replace operation to an entire column without looping through the company column

    So in the above,

    (*)(COMPANY) ---> (\1) (and that pretty much is all the RegEx I know)

    This would leave

    COMPANY OF PEOPLE --> COMPANY OF PEOPLE since COMPANY is not the suffix


    Right now the outer loop contains each suffix string (30-40), and the inner loop is for each of the 300k+ lines

    So if the Right(Company, Len(suffix)) = suffix, then trim Len(suffix) from company

    30-40 suffixes and 300k+ companies is a LOT of looping

    Word and regular expression have a 'only if at the end of the word' option, and that's what I think I'm hoping I can do on an entire column

    Paul

  10. #10
    VBAX Expert shrivallabha's Avatar
    Joined
    Jan 2010
    Location
    Mumbai
    Posts
    750
    Location
    Quote Originally Posted by Paul_Hossler
    Thnks, but the issue unfortunately is not a simple 1 for 1 replacement

    ACME COMPANY --> ACME
    BILL COMPANY --> BILL
    CHARLES COMPANY --> CHARLES
    DAN COMPANY --> DAN
    DAN CO --> DAN
    DAN COMP --> DAN


    That would require a 2 column list for all the unique names that requre replacing and their replacement; that would be a maintenance nightmare

    ......

    Paul
    I think there's a bit of confusion, the formula should work with your suffix list. I'd think there will be few pitfalls but those deviations can be looked after.

    I am attaching the concept as I had it in my mind.
    Attached Files Attached Files
    Regards,
    --------------------------------------------------------------------------------------------------------
    Shrivallabha
    --------------------------------------------------------------------------------------------------------
    Using Excel 2016 in Home / 2010 in Office
    --------------------------------------------------------------------------------------------------------

  11. #11
    VBAX Sage
    Joined
    Apr 2007
    Location
    United States
    Posts
    8,728
    Location
    I appreciate all the ideas, and while they will work, I'm still trying to improve performance (avoiding premature optimization as I go)

    This is what I currently use, with 2 loops

    'not using Ken's SpeedOn and SpeedOff yet
    Sub WhatItIsNow()
        Call WhatItIsNow_1(Worksheets("Sheet1").Range("A2:A8"), Worksheets("Sheet1").Range("C2:C11"))
    End Sub
     
    Sub WhatItIsNow_1(rSuffixs As Range, rSource As Range)
        Dim aSuffix As Variant, aCompany As Variant
        Dim iSuffix As Long, iCompany As Long
        aSuffix = rSuffixs.Value
        aCompany = rSource.Value
        For iSuffix = LBound(aSuffix, 1) To UBound(aSuffix, 1)
            aSuffix(iSuffix, 1) = UCase(aSuffix(iSuffix, 1))
            For iCompany = LBound(aCompany, 1) To UBound(aCompany, 1)
                If UCase(Right(aCompany(iCompany, 1), Len(aSuffix(iSuffix, 1)))) = aSuffix(iSuffix, 1) Then
                    aCompany(iCompany, 1) = Trim(Left(aCompany(iCompany, 1), Len(aCompany(iCompany, 1)) - Len(aSuffix(iSuffix, 1))))
                End If
            Next iCompany
        Next iSuffix
        rSource.Value = aCompany
    End Sub

    If I could use the built in Replace

    Sub UsingBuiltinReplace()
        Call UsingBuiltinReplace_1(Worksheets("Sheet1").Range("A2:A8"), Worksheets("Sheet1").Range("C2:C11"))
    End Sub
     
    Sub UsingBuiltinReplace_1(rSuffixs As Range, rSource As Range)
        Dim rSuffix As Range
        For Each rSuffix In rSuffixs.Cells
            rSource.Replace What:=rSuffix.Value, Replacement:=vbNullString, _
            LookAt:=xlPart, SearchOrder:=xlByRows, _
            MatchCase:=False, SearchFormat:=False, ReplaceFormat:=False
        Next
    End Sub
    but that does not limit the test to just the end of the string

    Actually what I do now is not perfect either, since sometimes a Company value will be trimmed twice

    Since the built in Replace allows me to 'hit' an entire column, I was hoping a RegEx would allow that also, thus eliminating the inner loop

    Paul

  12. #12
    Administrator
    VP-Knowledge Base
    VBAX Grand Master mdmackillop's Avatar
    Joined
    May 2004
    Location
    Scotland
    Posts
    14,489
    Location
    Add an "xx" to the end of each company and adjust the search accordingly
    This takes 18 sec with 470k names

    Sub DoReplace()
        Dim r As Range
        Dim Suffix(), s
        Suffix = Array(" coxx", " comxx", " companyxx", " incxx", " inc.xx", " incorpxx", "xx")
        Set r = Range(Cells(2, 1), Cells(Rows.Count, 1).End(xlUp))
        r.Offset(, 1) = "xx"
        r.Offset(, 2).FormulaR1C1 = "=RC1 & RC2"
        r.Offset(, 2).Value = r.Offset(, 2).Value
        r.Offset(, 1).Clear
        For Each s In Suffix
            r.Offset(, 2).Replace What:=s, Replacement:="", LookAt:=xlPart
        Next
    End Sub
    MVP (Excel 2008-2010)

    Post a workbook with sample data and layout if you want a quicker solution.


    To help indent your macros try Smart Indent

    Please remember to mark threads 'Solved'

  13. #13
    VBAX Sage
    Joined
    Apr 2007
    Location
    United States
    Posts
    8,728
    Location
    Now that is VERY clever

    The first thing when I get back to work on Monday -- OK, the second thing after my coffee -- will be to try that out on the real data

    Paul

  14. #14
    Administrator
    VP-Knowledge Base VBAX Grand Master mdmackillop's Avatar
    Joined
    May 2004
    Location
    Scotland
    Posts
    14,489
    Location
    Glad to help. Just hope it works on the real thing!
    MVP (Excel 2008-2010)

    Post a workbook with sample data and layout if you want a quicker solution.


    To help indent your macros try Smart Indent

    Please remember to mark threads 'Solved'

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •