PDA

View Full Version : Fuzzy matching scoring algorithm



Sorin Sion
10-25-2011, 01:48 PM
A common problem in geomarketing (and not only) is matching sets of addresses/names from various sources. A fuzzy matching algorithm aids in matching "dirty" data with some form of "standard" data, based on a similarity score. The length of the strings and of the compared lists greatly influences the matching speed, so you need fast algorithms to do the core job, that of scoring pairs of strings.
After trying several approaches I am now mildly content regarding the speed of the algorithm I developed but I am sure that there might be some other ways to tackle this problem and to further accelerate the matching process.
The algorithm in it’s current form computes the frequency of common characters between the two input strings and also the frequency of identical tuples (two-character sequences), weights them and builds a normalized score in the range of [0…1].
I released the source code under GNU Lesser GPL at http://code.google.com/p/fast-vba-fuzzy-scoring-algorithm/source/browse/trunk/Fuzzy1.
The project's main objective is to build the fastest possible similarity scoring algorithm and migrate it's logic in a DLL to be called in Excel/Access VBA modules.
Please visit the project’s page, check the code and, if interested, contribute in some way to it’s development.

Kind regards,

Sorin

Sorin Sion
11-02-2011, 09:21 AM
A week passed since I posted the code of the original fuzzy matching scoring function on Google Code, so I felt it was a good moment for an update on this development.
Using various techniques we were able cut the execution time almost in half (~44%), which I see as impressive!

The various partial effects are detailed in the logs posted at http://code.google.com/p/fast-vba-fuzzy-scoring-algorithm/source/browse/OutputLogs. Please use the AllTests sub included in the source and add your results. It will make for an interesting report on the performance of different processors and VBA platforms.

My thanks and profound appreciation goes to the following developers that contributed with code and ideas to this project:
* Felipe Costa Gualberto for the compiler directive that makes the timing code work with all VBA versions
* Doug Bliss and cHARLES_wILLIAMS for replacing Mid, Left, Ucase with their string-returning counterparts, Mid$, Left$, Ucase$
* Stuart McLachlan (PNG) for replacing the string filtering (cleansing) routine with an InStr version
* Malcolm Dixon for replacing the string filtering (cleansing) routine with an integer evaluating one
* Stuart McLachlan (PNG) and Doug Bliss again, for replacing the [s2 = Left(s2, p - 1) & "~" & Mid(s2, p + 1)] line with [Mid$(s2, p, 1) = "~"]

Special thanks for all of you that manifested their interest in this project. I hope you will find it instructive and useful.

What you'l find in the code posted at http://code.google.com/p/fast-vba-fuzzy-scoring-algorithm/source/browse/trunk/Fuzzy2:
- AllTests - the whole battery of tests
- TestFuzzy - tests various versions of Fuzzy function
- TestFilterVSWhole - tests the amount of time spent in filtering/cleansing input strings vs. the whole original function
- TestFilters - test various strategies of speeding the filtering code
- TestInStr - test the hypotesis that character order with InStr(1, "ABCD...789", x) searches is relevant
- TestInStrVSLike - tests the performance of InStr searches vs. using Like operator for filtering input strings
- Filter0..4 - various implementations of the filtering code
- Fuzzy0..6 and HotFuzz - various implementations of the complete algorithm, testing the different optimizing techniques

The last function, HotFuzz, embeds the best-performing techniques suggested by the developers during this week:
- using Mid$, Left$, Ucase$ instead of their variant output forms, Mid, Left, Ucase
- using the Mid$ statement for string replacement
- I (re)discovered the Like operator and I used it instead of the InStr searches in the filtering routine. It's not as straightforward and easy to customize as InStr but it definitely adds some speed.

I still owe a reply to Jim Cone (he suggested testing Microsoft's version of Fuzzy Lookup...
http://blogs.msdn.com/b/business_intelligence_labs/archive/2011/04/27/fuzzy-lookup-add-in-for-excel.aspx) and to Steve Bull (he sent me some data to test the algorithm's accuracy in matching lists of "clean" and "dirty" strings).
I'll come back to you guys as soon as I get the time to test.

Coding tips:
- use Mid$, Left$, Right$, Ucase$ instead of Mid, Left, Right, Ucase functions whenever possible
- learn to use efficiently the (often neglected) Like operator in VBA
- If using InStr(1, "ABCD...789", x) searches, optimize the character list/order based on character probability in your input strings
- use the Mid$ statement instead of other string replacement techniques whenever possible

I tried moving the HotFuzz code with Visual Studio Express 2010 into a DLL to be used by Excel and Access in VBA but I'm a total noob at this and I caught my ears in it so, please, if you are able to do this, upload a DLL with it to http://code.google.com/p/fast-vba-fuzzy-scoring-algorithm/downloads/entry . It will be very interesting to test the compiled vs interpreted performance.

Until the next update,
Kind regards,
Sorin