Introduction
Every computer scientist has heard of SoundEx. It is perhaps the most infamous text processing/searching algorithm around. SoundEx promises a great deal - that of matching words with similar sounding words, but actually delivers, at best, a large number of inaccurate matches. Even though SoundEx was patented, variations have arisen, whether through poor understanding of the algorithm or through attempts to improve its accuracy. Thus, this article presents four popular implementations of SoundEx written in C# and .Net to allow you to perform your own benchmarking on your own data sets.
SoundEx was originally invented to find alternative spellings for surnames. The problem came from mis-spellings on US census returns and therefore the SoundEx algorithms are biased towards English speaking names. There are further variants for some western European languages, but in general the algorithm has never been considered a panacea for all fuzzy text matching.
Most SoundEx algorithms are a variant on the following
- Retain the first letter of the name
- Drop all vowels, and h, y in all but the first letter
- Replace b,f,p,v with the number 1
- Replace c,g,j,k,q,s,x,z with the number 2
- Replace d and t with the number 3
- Replace l with 4, m and n with 5 and r with 6
- Finally convert to a 4 character code like {letter}{number}{number}{number} by truncation
Some algorithms have grouping and adjacent letter rules in addition.
Implementation Details
Several implementations of the same algorithm implies a strategy pattern, so a common base class named ISoundEx was created. The four implementations are:
- Miracode - The modified SoundEx from 1910
- Simplified - The original SoundEx from late 1800s
- KnuthEd2 - The SoundEx algorithm from The Art of Computer Programming
- SQLServer - SQL Server's variant on SoundEx
Each implements
GenerateSoundEx()
and
ValidateAlgorithm()
. A virtual member of ISoundEx is
EncodeChar()
which provides the basic character to number mapping specified by SoundEx.
protected virtual string EncodeChar(char c) {
switch(Char.ToLower(c)) {
case 'b':
case 'f':
case 'p':
case 'v':
return "1";
case 'c':
case 'g':
case 'j':
case 'k':
case 'q':
case 's':
case 'x':
case 'z':
return "2";
case 'd':
case 't':
return "3";
case 'l':
return "4";
case 'm':
case 'n':
return "5";
case 'r':
return "6";
default:
return string.Empty;
}
}
An aside into MSIL
How C# compiles switch
statements such as EncodeChar()
could be an article in itself. To demonstrate the optimizations it applies, look at the CIL disassembly below.
.method family hidebysig newslot virtual
instance string EncodeChar(char c) cil managed
{
.maxstack 2
.locals ([0] string CS$00000003$00000000,
[1] char CS$00000002$00000001)
IL_0000: ldarg.1
IL_0001: call char [mscorlib]System.Char::ToLower(char)
IL_0006: stloc.1
IL_0007: ldloc.1
IL_0008: ldc.i4.s 98
IL_000a: sub
IL_000b: switch (
IL_0076,
IL_007e,
IL_0086,
IL_00a6,
IL_0076,
IL_007e,
IL_00a6,
IL_00a6,
IL_007e,
IL_007e,
IL_008e,
IL_0096,
IL_0096,
IL_00a6,
IL_0076,
IL_007e,
IL_009e,
IL_007e,
IL_0086,
IL_00a6,
IL_0076,
IL_00a6,
IL_007e,
IL_00a6,
IL_007e)
IL_0074: br.s IL_00a6
IL_0076: ldstr "1"
IL_007b: stloc.0
IL_007c: br.s IL_00ae
IL_007e: ldstr "2"
IL_0083: stloc.0
IL_0084: br.s IL_00ae
IL_0086: ldstr "3"
IL_008b: stloc.0
IL_008c: br.s IL_00ae
IL_008e: ldstr "4"
IL_0093: stloc.0
IL_0094: br.s IL_00ae
IL_0096: ldstr "5"
IL_009b: stloc.0
IL_009c: br.s IL_00ae
IL_009e: ldstr "6"
IL_00a3: stloc.0
IL_00a4: br.s IL_00ae
IL_00a6: ldsfld string [mscorlib]System.String::Empty
IL_00ab: stloc.0
IL_00ac: br.s IL_00ae
IL_00ae: ldloc.0
IL_00af: ret
}
Basically, the C# compiler has re-ordered the characters in the switch statement so they are in ascending order according to the Unicode code points. It adds extra 'case' entries for missing letters as a sort of padding. The CIL switch statement is a lookup table. The value at the top of the stack when switch is executed is used as a zero-based index into the list of labels to jump to. This gives us a nicely optimized lookup table without us trying too hard!
Back to SoundEx
ISoundEx.ValidateAlgorithm()
provides a test of various names against the known SoundEx code for that name.
Dictionary Lookup
The code also provides a means to load a dictionary of words or surnames and perform a fast and efficient lookup of similar words. CustomDictionary
adds words to internal lookup tables with the following code:
public void AddWord(string s) {
if(_oAllWords.ContainsKey(s)==false) {
string zSoundEx=_oSoundExAlgorithm.GenerateSoundEx(s);
_oAllWords.Add(s, zSoundEx);
if(_oAllSoundEx.ContainsKey(zSoundEx)==false) {
_oAllSoundEx.Add(zSoundEx,new StringCollection());
}
((StringCollection)_oAllSoundEx[zSoundEx]).Add(s);
}
}
_oAllWords
is a StringDictionary containing a unique list of all the words in the dictionary with it's SoundEx code. _oAllSoundEx
is a HashTable containing a unique list of all the SoundEx values. Each entry in the HashTable contains a StringCollection of words with that SoundEx. Thus we have a two-way link between the SoundEx code and the words which have that SoundEx.
Two text files are provided. The first is the /usr/dict/words
database from all Unix systems. As this is a list of commonly used English language words and not surnames, it shows how inappropriate SoundEx is for these words. The second file is OldPend Surnames 2002-01.txt
which is a list of over 15,000 unique surnames from the Old Pendleton Database. This is a database of individuals whose families have roots in the Old Pendleton District of South Carolina and can be obtained from oldpendleton.homestead.com. This provides better SoundEx matches, but still shows how inaccurate the algorithm can be.
Conclusion
As can be seen from the above screenshot, the results vary. Names which in no-way can be considered related are returned while, conversely, names which should be related fail to be returned. Knuth's examples were 'Rogers' and 'Rodgers', 'Sinclair' and 'St. Clair' which fail.
Other similar sounding match algorithms have been invented in the 120 years following the introduction of SoundEx. Metaphone is one of these, which should be implementable easily within the code structure presented here.
Finally, if you intend on using SoundEx within a commercial application be aware of its pitfalls.