Introduction
This article describes a simple implementation of the fuzzy (string) search. It can be used for approximate string matching (for more information, see http://en.wikipedia.org/wiki/Fuzzy_string_searching).
Other algorithms for approximate string searching exist (e.g., Soundex), but those aren't as easy to implement. The algorithm in this article is easy to implement, and can be used for tasks where approximate string searching is used in an easy way.
A List<string>
is used for searching, and therefore it's quite easy to search a database.
Background
The algorithm used the Levenshtein-distance for determining how exact a string from a word list matches the word to be found. Information about the Levenshtein-distance can be found at http://en.wikipedia.org/wiki/Levenshtein_distance.
Using the code
The following example will show how simply the class can be used.
static void Main(string[] args)
{
string word = "Code Project";
List<string> wordList = new List<string>
{
"Code Project",
"Code project",
"codeproject",
"Code Projekt",
"Kode Project",
"Other Project"
};
List<string> foundWords = FuzzySearch.Search(
word,
wordList,
0.70);
foundWords.ForEach(i => Console.WriteLine(i));
Console.ReadKey();
}
Output:
Code Project
Code project
codeproject
Code Projekt
Kode Project
Implementation
A basic approach is shown. Instead of the Levenshtein-distance, a more optimized algorithm could be used - but here, a quite simple implementation is given for clarity reasons.
Levenshtein-distance
For computing the Levenshtein-distance, I use the following algorithm:
public static int LevenshteinDistance(string src, string dest)
{
int[,] d = new int[src.Length + 1, dest.Length + 1];
int i, j, cost;
char[] str1 = src.ToCharArray();
char[] str2 = dest.ToCharArray();
for (i = 0; i <= str1.Length; i++)
{
d[i, 0] = i;
}
for (j = 0; j <= str2.Length; j++)
{
d[0, j] = j;
}
for (i = 1; i <= str1.Length; i++)
{
for (j = 1; j <= str2.Length; j++)
{
if (str1[i - 1] == str2[j - 1])
cost = 0;
else
cost = 1;
d[i, j] =
Math.Min(
d[i - 1, j] + 1,
Math.Min(
d[i, j - 1] + 1,
d[i - 1, j - 1] + cost));
if ((i > 1) && (j > 1) && (str1[i - 1] ==
str2[j - 2]) && (str1[i - 2] == str2[j - 1]))
{
d[i, j] = Math.Min(d[i, j], d[i - 2, j - 2] + cost);
}
}
}
return d[str1.Length, str2.Length];
}
The Searching
In the search process, for each word in the wordlist, the Levenshtein-distance is computed, and with this distance, a score. This score represents how good the strings match. The input argument fuzzyness
determines how much the strings can differ.
public static List<string> Search(
string word,
List<string> wordList,
double fuzzyness)
{
List<string> foundWords = new List<string>();
foreach (string s in wordList)
{
int levenshteinDistance =
LevenshteinDistance(word, s);
int length = Math.Max(word.Length, s.Length);
double score = 1.0 - (double)levenshteinDistance / length;
if (score > fuzzyness)
foundWords.Add(s);
}
return foundWords;
}
LINQ-variant
This piece of code could be written in LINQ too.
public static List<string> Search(
string word,
List<string> wordList,
double fuzzyness)
{
List<string> foundWords =
(
from s in wordList
let levenshteinDistance = LevenshteinDistance(word, s)
let length = Math.Max(s.Length, word.Length)
let score = 1.0 - (double)levenshteinDistance / length
where score > fuzzyness
select s
).ToList();
return foundWords;
}
History
- 2009 June 1st: Initial release.