Introduction
A while ago, we wrote an application for a client who was running a competition that included a tie breaker question. The tie breaker had to be unique so there was a need to deduplicate the entries to remove multiple matches (due to a quaint little law where if you awarded a prize to one person based on their tie breaker and somebody else could prove they had entered with the same tie breaker, then you had to award them the prize as well; providing they satisfied all other prize winning criteria). The remaining entries would be judged by a panel to see which ones were the best.
As you can imagine, the fact that the tie breakers were input manually made this potentially fraught due to the possibility of data capture miskeys or incorrect spelling by the entrant. To get round this, the client requested a naive comparer which used some simple rules to identify potential duplicate entries. This article is a version of that comparer.
The Basics
The first thing you need to do is load in the string
you want to compare against and set up whether or not you want to strip HTML tags from the text when performing comparisons. This is a little extension that I added to the original design so that it could be used to detect whether or not an article here on CodeProject was just a copy of the base template. In the original, HTML tags weren't removed.
The basic algorithm is to remove HTML tags using the following regular expression:
</?\w+((\s+\w+(\s*=\s*(?:"".*?""|'.*?'|[^'"">\s]+))?)+\s*|\s*)/?>
This expression simply grabs all text inside angle brackets, and the application replaces them with an empty string using a Regex.Replace
.
Using a similar technique, all vowels, whitespace and special characters are removed from the text. The text can then be compared.
The method to actually remove the "offending" tags is shown here:
private string BreakdownText(string source)
{
if (string.IsNullOrEmpty(source))
throw new ArgumentNullException("source");
if (_tagType == StripTagType.StripTags)
{
source = ReplaceCharacters(source, _tagRemover, string.Empty);
}
source = ReplaceCharacters(source, _chars, string.Empty);
return ReplaceCharacters(source, _vowels, string.Empty).ToLowerInvariant();
}
private string ReplaceCharacters(string source, string regexText, string replaceText)
{
Regex regex = new Regex(regexText,
RegexOptions.IgnoreCase | RegexOptions.Multiline |
RegexOptions.IgnorePatternWhitespace);
return regex.Replace(source, replaceText);
}
The order in which items are removed is important. You want to strip the HTML tags out before you remove the none-alphabetic and vowel characters. By the way, the _vowels
and _char
regular expressions are defined as [eiouy]
and [^b-z]
respectively.
If the two sanitized pieces of text you want to compare match, then you have a duplicate. However, as we have said, there's always the possibility of there being a misspelling so the comparer goes further and looks through the two pieces of text to identify characters that don't match. If the number of characters falls outside a particular threshold, then we know that the two pieces of text are sufficiently different to count them as unique.
private bool Compare(string source, string target)
{
int targetLength = target.Length;
StringBuilder unmatched = new StringBuilder();
while (target.Length > 0 && source.Length > 0 && unmatched.Length <= _diffCharacters)
{
if (target[0] == source[0])
{
target = target.Remove(0, 1);
source = source.Remove(0, 1);
}
else
{
int index = source.IndexOf(target[0]);
if (index > 0)
{
unmatched.Append(source.Substring(0, index));
source = source.Remove(0, index);
}
}
}
return unmatched.Length <= _diffCharacters;
}
And that's it - a naive comparer. To run it, all you need to do is this:
static void Main(string[] args)
{
string s = "<H1>This is a sample.</H1>";
NaiveComparer c = new NaiveComparer(s, StripTagType.StripTags, 10);
if (c.Compare("Hello" + s + "aeiou123"))
{
Console.WriteLine("It matches");
}
if (!c.Compare("Hellot there from this completely arbitrary " +
"and totally different way of matching <<<< <<<<<< " + s +
">>>>>>>>>>>>>>-----"))
Console.WriteLine("No match");
Console.ReadLine();
}
I hope that this is useful for you, on those occasions where you need to compare to see if items are sufficiently different to allow them through.
Edit History
- 07 May 08 Fixed bug found by Chris Maunder. Thanks Chris.