Abstract
Simple information searches -- name lookups, word searches, etc. -- are often implemented in terms of an exact match criterion. However, given both the diversity of homophonic (pronounced the same) words and names, as well as the propensity for humans to misspell surnames, this simplistic criterion often yields less than desirable results, in the form of reduced result sets, missing records that differ by a misplaced letter or different national spelling.
This article series discusses Lawrence Phillips' Double Metaphone phonetic matching algorithm, and provides several useful implementations, which can be employed in a variety of solutions to create more useful, effective searches of proper names in databases and other collections.
Introduction
This article series discusses the practical use of the Double Metaphone algorithm to phonetically search name data, using the author's implementations written for C++, COM (Visual Basic, etc.), scripting clients (VBScript, JScript, ASP), SQL, and .NET (C#, VB.NET, and any other .NET language). For a discussion of the Double Metaphone algorithm itself, and Phillips' original code, see Phillips' article in the June 2000 CUJ, available here.
Part I introduces Double Metaphone and describes the author's C++ implementation and its use. Part II discusses the use of the author's COM implementation from within Visual Basic. Part III demonstrates use of the COM implementation from ASP and with VBScript. Part IV shows how to perform phonetic matching within SQL Server using the author's extended stored procedure. Part V demonstrates the author's .NET implementation. Finally, Part VI closes with a survey of phonetic matching alternatives, and pointers to other resources.
Background
Part I of this article series discussed the Double Metaphone algorithm, its origin and use, and the author's C++ implementation. While this section summarizes the key information from that article, readers are encouraged to review the entire article, even if the reader has no C++ experience.
The Double Metaphone algorithm, developed by Lawrence Phillips and published in the June 2000 issue of C/C++ Users Journal, is part of a class of algorithms known as "phonetic matching" or "phonetic encoding" algorithms. These algorithms attempt to detect phonetic ("sounds-like") relationships between words. For example, a phonetic matching algorithm should detect a strong phonetic relationship between "Nelson" and "Nilsen", and no phonetic relationship between "Adam" and "Nelson."
Double Metaphone works by producing one or possibly two phonetic keys, given a word. These keys represent the "sound" of the word. A typical Double Metaphone key is four characters long, as this tends to produce the ideal balance between specificity and generality of results.
The first, or primary, Double Metaphone key represents the American pronunciation of the source word. All words have a primary Double Metaphone key.
The second, or alternate, Double Metaphone key represents an alternate, national pronunciation. For example, many Polish surnames are "Americanized", yielding two possible pronunciations, the original Polish, and the American. For this reason, Double Metaphone computes alternate keys for some words. Note that the vast majority (very roughly, 90%) of words will not yield an alternate key, but when an alternate is computed, it can be pivotal in matching the word.
To compare two words for phonetic similarity, one computes their respective Double Metaphone keys, and then compares each combination:
- Word 1 Primary - Word 2 Primary
- Word 1 Primary - Word 2 Alternate
- Word 1 Alternate - Word 2 Primary
- Word 1 Alternate - Word 2 Alternate
Obviously if the keys in any of these comparisons are not produced for the given words, the comparisons involving those keys are not performed.
Depending upon which of the above comparisons matches, a match strength is computed. If the first comparison matches, the two words have a strong phonetic similarity. If the second or third comparison matches, the two words have a medium phonetic similarity. If the fourth comparison matches, the two words have a minimal phonetic similarity. Depending upon the particular application requirements, one or more match levels may be excluded from match results.
.NET implementation
The .NET implementation of Double Metaphone is very similar in design and use to the C++ implementation presented in Part I. To use the .NET implementation, simply add the Metaphone.NET.dll assembly to your project's references in Visual Studio. NET, import the nullpointer.Metaphone
namespace into the source files, and instantiate the DoubleMetaphone
or ShortDoubleMetaphone
classes, for string and unsigned short Metaphone keys, respectively.
For example, to compute the Metaphone keys for the name "Nelson", code similar to that listed below may be used (C# code listed; the .NET implementation is callable from VB.NET, J#, and all other .NET languages):
using nullpointer.Metaphone;
DoubleMetaphone mphone = new DoubleMetaphone("Nelson");
System.Console.WriteLine(String.Format("{0} {1}",
mphone.PrimaryKey,
mphone.AlternateKey));
Note that the Metaphone keys are obtained via the PrimaryKey
and AlternateKey
properties.
As with the C++ implementation, an existing instance of a DoubleMetaphone
or ShortDoubleMetaphone
class can be used to compute the Metaphone keys for a new word, by calling the computeKeys
method:
using nullpointer.Metaphone;
DoubleMetaphone mphone = new DoubleMetaphone();
mphone.computeKeys("Nelson");
System.Console.WriteLine(String.Format("{0} {1}",
mphone.PrimaryKey,
mphone.AlternateKey));
As with all of the implementations presented in this article series, a sample application�CS Word Lookup--written in C# is presented to demonstrate the use of the .NET implementation. CS Word Lookup uses a Hashtable
collection class to map Metaphone phonetic keys to an ArrayList
class, containing the words which produce the said Metaphone keys.
Performance notes
While the .NET CLR performs reasonably well, it must be stated that the C++ implementation of Double Metaphone will likely perform significantly faster than the .NET version, due primarily to the fact that the C++ version judiciously avoids memory allocation and buffer copies, while the .NET implementation is unable to avoid such constructs. The ambitious reader is encouraged to optimize the .NET implementation, perhaps through the use of the unsafe
keyword, to perform direct memory access, at the expense of CLR compliance.
Conclusion
This brief article introduced the author's .NET implementation of Double Metaphone, including code snippets and a brief discussion of performance issues. Continue to Part VI for a review of alternative phonetic matching techniques, and a list of phonetic matching resources, including links to other Double Metaphone implementations.
History
- 7-22-03 Initial publication
- 7-31-03 Added hyperlinks between articles in the series
Article Series