PDA

View Full Version : How to search for phonetic similarities in Addresses



psionic
09-30-2007, 09:20 PM
G'Day,

This would be my first post to the forum.

I am having some trouble working out what would be the best solution to implement for VBA Excel version 2002. :dunno

I receive a dynamic list of addresses from a database but found some manual typing errors in the address (some people are SO lazy not to fix them! :banghead: ) Thus I use the Excel to anaylse and priortise what needs fixing.

I wanted to produce a report that either lists on a number of possible matches made.

The trick is, they are phonetic repeat PHONETIC! :bug: in nature or typing according to what it sounds like.

Let me give you an example so you know what I am getting at. :yes

There are a list of fictional address for people in the spreadsheet in the suburb town of blue.


A B C
1 Name Street Suburb
2 Red Smith Street Blue
3 Black Hop Drive Blue
4 Grey Smi Street Blue
3 Pink Johnson Cresent Blue
5 Navy Smih Street Blue

From the list mentioned, how I can search the address column in this case B for any occurance of Smith Street (mind you I don't know what street I need to find to start with) and the output produces a match of Rows 2, 4, 5 because some letters are missing but the name is correct as Smith.

I welcome and am open to any discussion on this.

Thanks and warm regards,
Psionic

herzberg
09-30-2007, 10:04 PM
I think it's gonna be real tough, given the near countless ways a word can be misspelt. Perhaps you can look at arresting the problem at the source, i.e. during data input and ensuring that the entries are correct? Like they always say, "garbage in, garbage out". If the input is accurate, then you can skip this check entirely.

Of course, I know my suggestion goes somewhat in a circular logic, i.e. you will need a way to verify the input and to do so, would require some form of automation in checking again. Quite frankly, if I'm given this task, my proposed solution will be to hire a few interns and get them to manually eyeball through each entry!:devil2:

That said, I'm, too, interested on other's take on this problem.

psionic
10-04-2007, 10:26 PM
I think it's gonna be real tough, given the near countless ways a word can be misspelt. Perhaps you can look at arresting the problem at the source, i.e. during data input and ensuring that the entries are correct? Like they always say, "garbage in, garbage out". If the input is accurate, then you can skip this check entirely.

Of course, I know my suggestion goes somewhat in a circular logic, i.e. you will need a way to verify the input and to do so, would require some form of automation in checking again. Quite frankly, if I'm given this task, my proposed solution will be to hire a few interns and get them to manually eyeball through each entry!:devil2:

That said, I'm, too, interested on other's take on this problem.

I ought to provide some feedback so far. :mkay I noticed there are alot of questions coming into this forum and my questions are sort of pushed down the queue.

Doesn't this frustrates alot of people who might be waiting on an answer??!?! Hint: me! :eek:

Anyway, thanks for the programming debate on the who does first, the chicken or the egg or just pure logic that makes Mr Spook wonder too much (a credit to Science Fiction fans!)

Is there a way to ask the experts to have a go at it?

Charlize
10-05-2007, 05:10 AM
Don't claim to be an expert, but you could give this a try to start with. It will put a checkmark in column D for similar street names. Tried it with Smith street, smit street, smithe street, smite street ...Sub test_it()
Dim cell As Range
Dim rng As Range
Dim vSearch As String
vSearch = Application.InputBox("Give streetname (or part) to search for.", _
"Lookup similar streetnames ...", Type:=2)
Set rng = Range("B2:B" & Range("B" & Rows.Count).End(xlUp).Row)
For Each cell In rng
If SoundEx(cell.Value, Len(cell.Value), 1) = _
SoundEx(vSearch, Len(vSearch), 1) Then
cell.Offset(, 2).Font.Name = "Marlett"
cell.Offset(, 2).Value = "a"
End If
Next cell
End Sub

'
' v 1.0d TESTED-OK 20061009
' -----------------------
'
' The following SoundEx function is:
'
' (C) Copyright 2002 - 2006, Creativyst, Inc.
' ALL RIGHTS RESERVED
'
' For more information go to:
' http://www.Creativyst.com
' or email:
' Support@Creativyst.com
'
' Redistribution and use in source and binary
' forms, with or without modification, are
' permitted provided that the following conditions
' are met:
'
' 1. Redistributions of source code must
' retain the above copyright notice, this
' list of conditions and the following
' disclaimer.
'
' 2. Redistributions in binary form must
' reproduce the above copyright notice,
' this list of conditions and the
' following disclaimer in the
' documentation and/or other materials
' provided with the distribution.
'
' 3. All advertising materials mentioning
' features or use of this software must
' display the following acknowledgement:
' This product includes software developed
' by Creativyst, Inc.
'
' 4. The name of Creativyst, Inc. may not be
' used to endorse or promote products
' derived from this software without
' specific prior written permission.
'
' THIS SOFTWARE IS PROVIDED BY CREATIVYST CORPORATION
' ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES,
' INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
' WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
' PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL
' THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT,
' INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
' DAMAGES (INCLUDING, BUT NOT LIMITED TO,
' PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS
' OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
' HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY,
' WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
' (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY
' WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF
' ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
'
'
'
Function SoundEx( _
ByVal WordString As String, _
ByVal LengthOption As Integer, _
ByVal CensusOption As Integer _
)
Dim WordStr As String
Dim b, b2, b3, SoundExLen, FirstLetter As String
Dim i As Integer
' Sanity
'
If (CensusOption > 0) Then
LengthOption = 4
End If
If (LengthOption > 0) Then
SoundExLen = LengthOption
End If
If (SoundExLen > 10) Then
SoundExLen = 10
End If
If (SoundExLen < 4) Then
SoundExLen = 4
End If
If (Len(WordString) < 1) Then
Exit Function
End If
' Copy to WordStr
' and UpperCase
'
WordStr = UCase(WordString)
' Convert all non-alpha
' chars to spaces. (thanks John)
'
For i = 1 To Len(WordStr)
b = Mid(WordStr, i, 1)
If (Not (b Like "[A-Z]")) Then
WordStr = Replace(WordStr, b, " ")
End If
Next i
' Remove leading and
' trailing spaces
'
WordStr = Trim(WordStr)
' sanity
'
If (Len(WordStr) < 1) Then
Exit Function
End If
' Perform our own multi-letter
' improvements
'
' double letters will be effectively
' removed in a later step.
'
If (CensusOption < 1) Then
b = Mid(WordStr, 1, 1)
b2 = Mid(WordStr, 2, 1)
If (b = "P" And b2 = "S") Then
WordStr = Replace(WordStr, "PS", "S", 1, 1)
End If
If (b = "P" And b2 = "F") Then
WordStr = Replace(WordStr, "PF", "F", 1, 1)
End If
WordStr = Replace(WordStr, "DG", "_G")
WordStr = Replace(WordStr, "GH", "_H")
WordStr = Replace(WordStr, "KN", "_N")
WordStr = Replace(WordStr, "GN", "_N")
WordStr = Replace(WordStr, "MB", "M_")
WordStr = Replace(WordStr, "PH", "F_")
WordStr = Replace(WordStr, "TCH", "_CH")
WordStr = Replace(WordStr, "MPS", "M_S")
WordStr = Replace(WordStr, "MPT", "M_T")
WordStr = Replace(WordStr, "MPZ", "M_Z")
End If
' end if(Not CensusOption)
'
' Sqeeze out the extra _ letters
' from above (not strictly needed
' in VB but used in C code)
'
WordStr = Replace(WordStr, "_", "")
' This must be done AFTER our
' multi-letter replacements
' since they could change
' the first letter
'
FirstLetter = Mid(WordStr, 1, 1)
' in case first letter is
' an h, a w ...
' we'll change it to something
' that doesn't match anything
'
If (FirstLetter = "H" Or FirstLetter = "W") Then
b = Mid(WordStr, 2)
WordStr = "-" + b
End If
' In properly done census
' SoundEx, the H and W will
' be squezed out before
' performing the test
' for adjacent digits
' (this differs from how
' 'real' vowels are handled)
'
If (CensusOption = 1) Then
WordStr = Replace(WordStr, "H", ".")
WordStr = Replace(WordStr, "W", ".")
End If
' Perform classic SoundEx
' replacements
' Here, we use ';' instead of zero '0'
' because of MS strangeness with leading
' zeros in some applications.
'
WordStr = Replace(WordStr, "A", ";")
WordStr = Replace(WordStr, "E", ";")
WordStr = Replace(WordStr, "I", ";")
WordStr = Replace(WordStr, "O", ";")
WordStr = Replace(WordStr, "U", ";")
WordStr = Replace(WordStr, "Y", ";")
WordStr = Replace(WordStr, "H", ";")
WordStr = Replace(WordStr, "W", ";")
WordStr = Replace(WordStr, "B", "1")
WordStr = Replace(WordStr, "P", "1")
WordStr = Replace(WordStr, "F", "1")
WordStr = Replace(WordStr, "V", "1")
WordStr = Replace(WordStr, "C", "2")
WordStr = Replace(WordStr, "S", "2")
WordStr = Replace(WordStr, "G", "2")
WordStr = Replace(WordStr, "J", "2")
WordStr = Replace(WordStr, "K", "2")
WordStr = Replace(WordStr, "Q", "2")
WordStr = Replace(WordStr, "X", "2")
WordStr = Replace(WordStr, "Z", "2")
WordStr = Replace(WordStr, "D", "3")
WordStr = Replace(WordStr, "T", "3")
WordStr = Replace(WordStr, "L", "4")
WordStr = Replace(WordStr, "M", "5")
WordStr = Replace(WordStr, "N", "5")
WordStr = Replace(WordStr, "R", "6")
'
' End Clasic SoundEx replacements
'
' In properly done census
' SoundEx, the H and W will
' be squezed out before
' performing the test
' for adjacent digits
' (this differs from how
' 'real' vowels are handled)
'
If (CensusOption = 1) Then
WordStr = Replace(WordStr, ".", "")
End If
' squeeze out extra equal adjacent digits
' (don't include first letter)
'
b = ""
b2 = ""
' remove from v1.0c djr: b3 = Mid(WordStr, 1, 1)
b3 = ""
For i = 1 To Len(WordStr) ' i=1 (not 2) in v1.0c
b = Mid(WordStr, i, 1)
b2 = Mid(WordStr, (i + 1), 1)
If (Not (b = b2)) Then
b3 = b3 + b
End If
Next i
WordStr = b3
If (Len(WordStr) < 1) Then
Exit Function
End If
' squeeze out spaces and zeros (;)
' Leave the first letter code
' to be replaced below.
' (In case it made a zero)
'
WordStr = Replace(WordStr, " ", "")
b = Mid(WordStr, 1, 1)
WordStr = Replace(WordStr, ";", "")
If (b = ";") Then ' only if it got removed above
WordStr = b + WordStr
End If
' Right pad with zero characters
'
b = String(SoundExLen, "0")
WordStr = WordStr + b
' Replace first digit with
' first letter
'
WordStr = Mid(WordStr, 2)
WordStr = FirstLetter + WordStr
' Size to taste
'
WordStr = Mid(WordStr, 1, SoundExLen)
' Copy WordStr to SoundEx
'
SoundEx = WordStr
End Function

Charlize
10-08-2007, 06:59 AM
? Are you still waiting for an answer or a different approach ...

I'm interested to know if the answer that I gave you, brings you closer to a solution.

psionic
10-09-2007, 06:13 PM
? Are you still waiting for an answer or a different approach ...

I'm interested to know if the answer that I gave you, brings you closer to a solution.

Hi Charlize,

Many thanks for your powerful message in VBA code. :bow:

I had a registered day off yesterday hence the delay in replying.

I am in the office so to speak which I will check out the solution and will let you know either at the end of the day or tomorrow on how it went.

Watch this space and thanks for your help to date.

Psionic.

psionic
10-10-2007, 06:38 PM
:buttkick: Ok, I had a go with the code and it's not bad at all!

I love the way you use the tick mark so I used the Auto Filter on the ones I need to correct them.

It worked for some searches as long it is short and only at the begining.
My testing environment has very long text. Average length is approx 46 characters long. I recognised that the success of the search will find the first 4 characters to the length of up to 10.

The more I worked on this I realised may need to adjust my specifications.

Charlize, :nooo:

Is there a option or a way to use from your code to use a normal Excel search for full text search as this would help me find proper strings in the middle of long texts or the provided search similar routine (currently not very good with long text).

I realised the search similar option doesn't search for text in the middle of a long string. Only at the beginning from 4 to 10 characters. :creator:

Thus from the example above I created a long text

1 Length=52 114 Red Smith Street Blue Centre at Clinic BLUE 2345
2 Length=40 2/50 Black Hop Drive Blue Med. BLUE 2345
3 Length=52 5423 Pink Johnson Cresent Blue Homer Drive BLUE 2345
4 Length=33 23 Grey Smi Street Blue BLUE 2345


So as an example, I can run the new code that does a normal search for string "Med." that results in line 2 ticked and then run the search as search similar on string "Smith" I ought to receive ticks on line 1 and 4.

As an afterthought, unless you might suggest to me to break down the text into numbers, Addresses, Suburb and then do a similar search on the addresses for better success, I would like to be able to find the similar search in the middle of the long text.

May the force be with you Charlize! :)

Psionic