PDA

View Full Version : Solved: Need to Replace incorrectly spelled names, based on a list of before and after.



frank_m
12-21-2011, 04:19 AM
Thanks in advance for your time to help me with this dilemma.

In Column A i've put together a list of thousands of Customer Invoice names, where a few of the names are misspelled. I have already identified the spelling variations In Column B , and highlighted in blue the correct spelling, followed by the misspells.(seperated with commas)

I'd like to have code that looks at the misspelled name possibilities and makes the correction directly in column A, but also reports in Column C
the date and time, as well as the before and after spelling.

I've attached a workbook to make what I need more clear

Column A---------(Column B) to the left in blue is the correct spelling, other's are variations to look for
U.S FASTENER----(U.S. FASTENER, U.S FASTENER, US FASTENERS, U.S FASTENERS, U.S. FASTENERS)
U.S FASTENERS
U.S. FASTENERS
U.S. FASTENER
U.S FASTENERS
US FASTENERS-------------------(Column C) History of before and after change
U.S. FASTENER-------------------(Todays Date, U.S. FASTENER, changed from: US FASTENERS)

p45cal
12-21-2011, 06:11 AM
This should get you started (works on the active sheet):Sub correctSpelling()
With ActiveSheet
For Each cll In Intersect(.UsedRange, .Columns(1)).Cells
'cll.Select
Set FoundSpelling = Nothing
Set FoundSpelling = .Columns(2).Find(Application.Trim(cll.Value), LookIn:=xlFormulas, lookat:=xlPart)
If Not FoundSpelling Is Nothing Then
SpelArr = Split(Application.Trim(FoundSpelling.Value), ",")
If UBound(SpelArr) > 0 Then
TheRightSpelling = Application.Trim(SpelArr(0))
If cll.Value <> TheRightSpelling Then 'spelling needs correction:
.Cells(cll.Row, 3).Value = Date & " from " & cll.Value & " to " & TheRightSpelling
cll.Value = TheRightSpelling
End If
End If
End If
Next cll
End With
End Sub

frank_m
12-21-2011, 06:55 AM
Hi p45cal,

Thanks so much kind sir, (I've studied the results fairly carefully) and I'm nearly certain that catches everything very nicely. :bow:

Now I need to work on a validation strategy to prevent a new mess from being accumulated..

For questions on that I'm going to start a new thread.

Thanks again

Paul_Hossler
12-21-2011, 07:08 AM
I've had pretty good luck with a 'table' to standardize enteries. Not perfect, and does require manual maintenance.

The 'table' has a 'cleaned up' version of enteries that are used to match a standard

Something else to think about



Option Explicit
Function CorrectSpelling(s As String, r As Range) As String
Dim sOrig As String, sWork As String, sMatch As String, sNew As String
Dim i As Long

sOrig = s

sWork = Trim(UCase(sOrig))

sMatch = vbNullString

For i = 1 To Len(sWork)
Select Case Mid(sWork, i, 1)
Case "A" To "Z", "0" To "9"
sMatch = sMatch & Mid(sWork, i, 1)
End Select
Next i


Call TrimRight(sMatch, "S")
Call TrimRight(sMatch, " CO")
Call TrimRight(sMatch, " INC")
'etc.



i = 0
On Error Resume Next
i = Application.WorksheetFunction.Match(sMatch, r.Columns(1), 0)
On Error GoTo 0



'Todays Date, U.S. FASTENER, changed from: US FASTENERS
If i = 0 Then
CorrectSpelling = Format(Now, "m/d/yyyy") & " " & s & ", Not Changed"

ElseIf r.Cells(i, 2).Value = sOrig Then
CorrectSpelling = Format(Now, "m/d/yyyy") & " " & s & ", Not Changed"


Else
CorrectSpelling = Format(Now, "m/d/yyyy") & " " & r.Cells(i, 2).Value & ", changed from: " & sOrig
End If

End Function

Private Sub TrimRight(s As String, suffix As String)
If Right(s, Len(suffix)) = suffix Then
s = Left(s, Len(s) - Len(suffix))
End If
End Sub



Paul

p45cal
12-21-2011, 07:36 AM
(I've studied the results fairly carefully) and I'm nearly certain that catches everything very nicely.Just watch out for valid co. names such as FINITY. Because that name can be found in AFFINITY ENG it will get changed!

frank_m
12-21-2011, 08:01 AM
Thank the heavens your eyes are so much keener than mine.. That is likely going to be a problem down the road. Another example would CAL, that will be changed to CALIFORNIA METAL PRODUCTS when it is only CALIMETAL that should be changed.

I guess its back to the drawing board.

I'll take a look at your validation now. Thanks again

Edit: I see now that your last code is another method of attack for my situation.. Thanks buddy, I'll start fiddling with it now and report back.

Edit#2 - I need to take a nap for a few hours, as my heads getting dizzy trying to wrap my head around how pass the strings and range to the functions. I get that way, from both not enough sleep and sometimes just because I have a pea sized brain :rofl:

Those function's address the corrections only, am I right? ... not validation<- for entry validation I would think its best for me to post a new thread.

Paul_Hossler
12-21-2011, 09:22 AM
Just watch out for valid co. names such as FINITY. Because that name can be found in AFFINITY ENG it will get changed!


The table approach would handle things like that, but the down side is having to keep the table updated

The last time I used something like this, I added a test to catch matching first parts, so "CAL" would match "CALIF..." and "CALCI ..." and then match it to the standard. The data when I used this, didn't have too much overlap

Paul

shrivallabha
12-21-2011, 10:59 AM
Is this related to your case:
http://www.mrexcel.com/forum/showthread.php?t=195635&highlight=fuzzy+vlookup&page=3

frank_m
12-21-2011, 02:07 PM
HI shrivallabha,

Are you asking if the code at that link is helpful to me?

I thought perhaps you think that is my thread, and its not :dunno

:)

Edit: If you are asking if that code is helpful to me, I have to admit that I can't even figure out how to pass my strings to the function.
I know its simple stuff to most of you guys, but that's one of my big weak points to my small pea brain.
-- I took a two hour nap, thinking that would help, but how to use that function, or Paul's function ABOVE, is eluding me :wot

frank_m
12-21-2011, 02:40 PM
Since I've already done the job of listing all misspelled variation's and listed them in the column B cell,
I'm thinking it would be fairly straight forward to modify P45cal's code to only make a change if the column B cell contains the misspelled variation?

In other words.

In my posted workbook in post#1, the first item in Column A is:
U.S FASTENER

in the Column B cell of that same row, I have, a list starting with the correct spelling, followed by all of the misspelled variations that exist:
U.S. FASTENER, U.S FASTENER, US FASTENER, U.S FASTENERS, U.S. FASTENERS

What I'm suggesting is that I change my column B string to only contain the misspells
U.S FASTENER, US FASTENER, U.S FASTENERS, U.S. FASTENERS

so when the code finds something it wants to change it would first try to match it in the column B cell. If a match is not there, the variation correction is skipped.

Edit: Sorry, more brain freeze, i see now that I gave some misinformation in my last post, as it is not the column B cell of the same row that has the variations, but I could list the correct spelling in the column A cell of another sheet and the mispell variation's in the column B cell of that same row, if that helps.

frank_m
12-21-2011, 03:51 PM
I've attached a new workbook that has what I call a dictionary sheet.

Hope that helps

Paul_Hossler
12-21-2011, 08:15 PM
The thing that you might consider is if there is a slight variation in the data afterwhile that you did not plan for



U.S. FASTENER, U.S FASTENER, US FASTENER, U.S FASTENERS, U.S. FASTENERS


but what about things like U.S.<space><space>FASTENER or U.S.FASTENER

Paul

frank_m
12-21-2011, 09:05 PM
HI Paul,

In the cases you described I would just add those variations to my variation's list. - I'm simultaneously working on trying to develop a validation system to prevent future misspells. - This here is a preemptive measure as the existing data needs to be fixed. - Any new errors will be even easier to spot, as new entries are at the top of the list.

I'm able to pick out the variations visually, then I manually place them in the variations list. I've done this already done with 18,000 records, using the advanced filter unique values, and sorting, as the result is only about 125 different names, 25 of which are misspelled
In fact, being that there are only 25, I could do a manual find and replace for each variation, but sure would like to have a tool that I can re-use for both this and similar data fixing in other columns.

GTO
12-21-2011, 09:53 PM
Hi All,

This probably includes some unnecessary "safety's", but seems to handle cases such as mention by Pascal at post #5.

Tested against Rev2_SpellingCorrectionNeeded.xls and later against Rev3_SpellingCorrectionNeeded.xls (attachment at post #1).

I also tried against an altered 'Correct the Spelling Worksheet' where I tacked in "FINITY, finity, Finity,Finity Eng" in Col B and tacked in:

finity
finity ENG
finity
finity ENG
AFFINITY
AFFINITY
AFFINITY ENG
AFFINITY ENG

...in Col A.


Option Explicit

Sub AStart()
Call ReplaceSpecificMissSpelling
End Sub

Function ReplaceSpecificMissSpelling()
Dim REX As Object ' RegExp
Dim wks As Worksheet
Dim rngData As Range
Dim rngAnomalies As Range
Dim aryData As Variant
Dim aryAnomalies As Variant
Dim aryAdvise As Variant
Dim SplitVals As Variant
Dim lData As Long
Dim lAnamolies As Long
Dim i As Long
Dim ReplaceWith As String
Dim LookForPattern As String

Const STARTING_ROW As Long = 3

'----TEMP CODE
Dim HACK As Double: HACK = Timer
'----END TEMP CODE

'// This can be ditched when setting wks to a worksheet based on tab name or if //
'// simply using the sheet's codename. //
If Not TypeName(ActiveSheet) = "Worksheet" Then
Exit Function
ElseIf Not ActiveSheet.Type = xlWorksheet Then ' &HFFFFEFB9
Exit Function
End If

Set wks = ActiveSheet
With wks
'// Played with code long enough to handle empty cells in rngData, did not test //
'// against empty cells in rngAnomalies. //
Set rngData = .Range(.Cells(STARTING_ROW, 1), .Cells(.Rows.Count, 1).End(xlUp))
Set rngAnomalies = .Range(.Cells(STARTING_ROW, 2), .Cells(.Rows.Count, 2).End(xlUp))
End With

'// Plunk the ranges' vals into arrays. Add a "Column" to aryData to hold a flag //
'// and size an array for output in Column C. //
aryData = rngData.Value
aryAnomalies = rngAnomalies.Value
ReDim Preserve aryData(1 To UBound(aryData, 1), 1 To 2)
ReDim aryAdvise(1 To UBound(aryData, 1), 1 To 1)

'// Trim data in Col A, and set a flag indicating that the value has not been //
'// changed. //
For lData = 1 To UBound(aryData, 1)
aryData(lData, 1) = Trim(aryData(lData, 1))
aryData(lData, 2) = False
Next

Set REX = CreateObject("VBScript.RegExp")

'// For ea cell in amonalies... //
For lAnamolies = 1 To UBound(aryAnomalies, 1)
'// May need a better test, hopefully no cells contain a single comma. //
If InStr(1, aryAnomalies(lAnamolies, 1), ",") >= 1 Then
'// Split on commas, grab first element for good value, and loop through //
'// goofy values, building a pattern. Tacked in IF test after finding error//
'// in returns due to trailing comma. (like: "LPI, L.P.I,") //
SplitVals = Split(aryAnomalies(lAnamolies, 1), ",")
ReplaceWith = Trim(SplitVals(0))
LookForPattern = vbNullString
For i = LBound(SplitVals) + 1 To UBound(SplitVals)
If Not Trim(SplitVals(i)) = vbNullString Then
LookForPattern = LookForPattern & "^" & Trim(SplitVals(i)) & "$|"
End If
Next

'// Sfter fix above for trailing commas, this MAY no longer be necessary. //
'// I can see no harm, so left in. //
With REX
.Global = True
.IgnoreCase = True
.Pattern = "\|{2,}"
If .Test(LookForPattern) Then
LookForPattern = .Replace(LookForPattern, "|")
End If
End With

'// Strip final Alternate character (vertical bar) for the RegExp.Pattern //
If Right(LookForPattern, 1) = "|" Then
LookForPattern = Left(LookForPattern, Len(LookForPattern) - 1)
End If

With REX
.Global = False
.IgnoreCase = True
'// Update .Pattern for each anomally list //
.Pattern = LookForPattern

'// Loop through first "Column" of aryData... //
For lData = 1 To UBound(aryData, 1)
'// IF the element (cell value in Col A) is found in ANY of the //
'// alternations, AND, the cell val is NOT = to ReplaceWith (to take//
'// care of unnecessary replacing), AND the element has not already //
'// been changed, THEN plunk the advisement in the correct "row" of //
'// aryAdvise, correct the element in aryData and flip the flag. //
If .Test(aryData(lData, 1)) _
And Not aryData(lData, 1) = ReplaceWith _
And Not aryData(lData, 2) Then
'Todays Date, U.S. FASTENER, changed from: US FASTENERS
aryAdvise(lData, 1) = Format(Expression:=Date, _
Format:="dd mmm yyyy") & _
", " & _
ReplaceWith & _
", changed from: " & _
aryData(lData, 1)

aryData(lData, 1) = .Replace(aryData(lData, 1), ReplaceWith)
aryData(lData, 2) = True
End If
Next
End With
End If
Next

'// The second column of aryData "falls off"
rngData.Value = aryData
rngData.Offset(, 2).Value = aryAdvise

'----TEMP CODE
Dim TmpString As String
TmpString = "RegExp (late-bound) took: " & _
FormatNumber(Timer - HACK, 3, vbTrue, vbTrue, vbFalse) & " seconds."
Debug.Print TmpString
MsgBox TmpString
'----END TEMP CODE
End Function


Hope that helps,

Mark

GTO
12-21-2011, 10:02 PM
Sorry, forgot about tags mis-indenting continued lines. Here's wb.

frank_m
12-22-2011, 03:19 AM
HI Mark,

Awesome, this is processing things much in the way I was trying to ask for..

I'm trying to learn now to do a lot more testing before I give the final word on success, but so far your code seems to only change words that are included in my Column B predefined variations, (JUST AS I WANTED) and leaves everything else alone, including the examples that everyone here has made us aware of.

Edit: Found that it is missing changing UTI-MATE to ULTI-MATE, so I would have to guess it might miss other situations, but it ceratinly seems very close. - During the next day I'll try to come up with some new variations and/or items it may miss.

What can I say sir. You certainly have my bow of appreciation for your efforts :bow:

:friends:

I've attached a new workbook version 4, that has a few more examples and has a button to recover the pre-change data from a backup sheet, to make results more evident and multiple testing easier.

GTO
12-22-2011, 07:31 AM
The incorrect "UTI-MATE" shows up in rows: 29, 67 and 94 of Col A.

ULTI-MATE's aberrations are listed twice in Col B, at rows 13 and 16. Even with that, at least in my testing, all are corrected.

At what row is the incorrect "UTI-MATE" missed in your testing of workbook Rev3_SpellingCorrectionNeeded.xls from post #1?

Mark

PS - I took a super-quick look at mws01_Rev4.0_SpellingCorrectionNeeded.xls. "ULT-IMATE" is not listed as a correction to make.

frank_m
12-22-2011, 10:20 AM
Hi Mark,

You are correct.. This sure is one "thought to be discrepancy", that I'm so very happy to wrong about.

I had ULT-IMATE as the misspell in Col A and UTI-MATE in Col B.
--- I stared at them both several times, nearly side by side and the spelling looked the same to me :wot

So sorry partner

Great piece of code :thumb - Thankyou very much.

I've posted another workbook, so that if anyone happens to find it useful, my list is now correct, as to not be confusing..

frank_m
12-22-2011, 08:06 PM
Figured I'd share the final version, just incase anyone finds it useful.

I tweaked it a little, just to make it a more universal tool, by incorporating Column Constants.

By using constants at the begining of the code, the Offset does not need to be adjusted for Advise, or the Columns for ranges.
For instance, in my actual workbook the data is in Column 3, the Anomalies in Column 31 and Advise in Column 32

Option Explicit

Sub AStart()
Call ReplaceSpecificMissSpelling
End Sub

Function ReplaceSpecificMissSpelling()
Dim REX As Object ' RegExp
Dim wks As Worksheet
Dim rngData As Range
Dim rngAnomalies As Range
Dim aryData As Variant
Dim aryAnomalies As Variant
Dim aryAdvise As Variant
Dim SplitVals As Variant
Dim lData As Long
Dim lAnamolies As Long
Dim i As Long
Dim ReplaceWith As String
Dim LookForPattern As String

Const STARTING_ROW As Long = 3 '16 in my actual workbook
Const Data_Col As Long = 1 '3 in my actual workbook
Const Anomalies_Col As Long = 2 '31 in my actual workbook
Const AdviseCol As Long = 3 '32 in my actual workbook
'----TEMP CODE
Dim HACK As Double: HACK = Timer
'----END TEMP CODE

'// This can be ditched when setting wks to a worksheet based on tab name or if //
'// simply using the sheet's codename. //
If Not TypeName(ActiveSheet) = "Worksheet" Then
Exit Function
ElseIf Not ActiveSheet.Type = xlWorksheet Then ' &HFFFFEFB9
Exit Function
End If

Set wks = ActiveSheet
With wks
'// Played with code long enough to handle empty cells in rngData, did not test //
'// against empty cells in rngAnomalies. //
Set rngData = .Range(.Cells(STARTING_ROW, Data_Col), _
.Cells(.Rows.Count, Data_Col).End(xlUp))
Set rngAnomalies = .Range(.Cells(STARTING_ROW, Anomalies_Col), _
.Cells(.Rows.Count, Anomalies_Col).End(xlUp))
End With

'// Plunk the ranges' vals into arrays. Add a "Column" to aryData to hold a flag //
'// and size an array for output in the Advise Column (in this case Column C) //
aryData = rngData.Value
aryAnomalies = rngAnomalies.Value
ReDim Preserve aryData(1 To UBound(aryData, 1), 1 To 2)
ReDim aryAdvise(1 To UBound(aryData, 1), 1 To 1)

'// Trim Data_Col, (in the case the Data_Col Constant is set to 1 for Column A)
'// and set a flag indicating that the value has not been changed. // //
For lData = 1 To UBound(aryData, 1)
aryData(lData, 1) = Trim(aryData(lData, 1))
aryData(lData, 2) = False
Next

Set REX = CreateObject("VBScript.RegExp")

'// For ea cell in amonalies... //
For lAnamolies = 1 To UBound(aryAnomalies, 1)
'// May need a better test, hopefully no cells contain a single comma. //
If InStr(1, aryAnomalies(lAnamolies, 1), ",") >= 1 Then
'// Split on commas, grab first element for good value, and loop through //
'// goofy values, building a pattern. Tacked in IF test after finding error//
'// in returns due to trailing comma. (like: "LPI, L.P.I,") //
SplitVals = Split(aryAnomalies(lAnamolies, 1), ",")
ReplaceWith = Trim(SplitVals(0))
LookForPattern = vbNullString
For i = LBound(SplitVals) + 1 To UBound(SplitVals)
If Not Trim(SplitVals(i)) = vbNullString Then
LookForPattern = LookForPattern & "^" & Trim(SplitVals(i)) & "$|"
End If
Next

'// After fix above for trailing commas, this MAY no longer be necessary. //
'// I can see no harm, so left in. //
With REX
.Global = True
.IgnoreCase = True
.Pattern = "\|{2,}"
If .Test(LookForPattern) Then
LookForPattern = .Replace(LookForPattern, "|")
End If
End With

'// Strip final Alternate character (vertical bar) for the RegExp.Pattern //
If Right(LookForPattern, 1) = "|" Then
LookForPattern = Left(LookForPattern, Len(LookForPattern) - 1)
End If

With REX
.Global = False
.IgnoreCase = True
'// Update .Pattern for each anomally list //
.Pattern = LookForPattern

'// Loop through first "Column" of aryData... //
For lData = 1 To UBound(aryData, 1)
''For lData = 1 To UBound(aryData, 1)
'// IF the element (cell value in the Data_Col,(in this case Col A)) is found in ANY of //
'// the alternations, AND, the cell val is NOT = to ReplaceWith (to take//
'// care of unnecessary replacing), AND the element has not already //
'// been changed, THEN plunk the advisement in the correct "row" of //
'// aryAdvise, correct the element in aryData and flip the flag. //
If .Test(aryData(lData, 1)) _
And Not aryData(lData, 1) = ReplaceWith _
And Not aryData(lData, 2) Then
'Example: Todays Date, U.S. FASTENER, changed from: US FASTENERS
aryAdvise(lData, 1) = Format(Expression:=Date, _
Format:="dd mmm yyyy") & _
", " & _
ReplaceWith & _
", changed from: " & _
aryData(lData, 1)

aryData(lData, 1) = .Replace(aryData(lData, 1), ReplaceWith)
aryData(lData, 2) = True
End If
Next
End With
End If
Next

'// The second column of aryData "falls off"
rngData.Value = aryData
'AdviseCol(in this case 3) - Data_Col(in this case 1) determines the offset to be 2
rngData.Offset(, AdviseCol - Data_Col).Value = aryAdvise

'----TEMP CODE
Dim TmpString As String
TmpString = "RegExp (late-bound) took: " & _
FormatNumber(Timer - HACK, 3, vbTrue, vbTrue, vbFalse) & " seconds."
Debug.Print TmpString
MsgBox TmpString
'----END TEMP CODE
End Function

Thanks again Mark :friends: - Your comprehensive work will save me countless hours, as well as headaches...

[]

GTO
12-23-2011, 04:30 PM
Hi Frank,

Glad that seems to be working :-)

Mark