A Naive String Comparer

Pete O'Hanlon

4.63/5 (18 votes)

6 May 2008CPOL2 min read

222

A class to perform a naive comparison of two chunks of text to see if they look to be the same.

Download source - 4.35 KB

Introduction

A while ago, we wrote an application for a client who was running a competition that included a tie breaker question. The tie breaker had to be unique so there was a need to deduplicate the entries to remove multiple matches (due to a quaint little law where if you awarded a prize to one person based on their tie breaker and somebody else could prove they had entered with the same tie breaker, then you had to award them the prize as well; providing they satisfied all other prize winning criteria). The remaining entries would be judged by a panel to see which ones were the best.

As you can imagine, the fact that the tie breakers were input manually made this potentially fraught due to the possibility of data capture miskeys or incorrect spelling by the entrant. To get round this, the client requested a naive comparer which used some simple rules to identify potential duplicate entries. This article is a version of that comparer.

The Basics

The first thing you need to do is load in the string you want to compare against and set up whether or not you want to strip HTML tags from the text when performing comparisons. This is a little extension that I added to the original design so that it could be used to detect whether or not an article here on CodeProject was just a copy of the base template. In the original, HTML tags weren't removed.

The basic algorithm is to remove HTML tags using the following regular expression:

</?\w+((\s+\w+(\s*=\s*(?:"".*?""|'.*?'|[^'"">\s]+))?)+\s*|\s*)/?>

This expression simply grabs all text inside angle brackets, and the application replaces them with an empty string using a Regex.Replace.

Using a similar technique, all vowels, whitespace and special characters are removed from the text. The text can then be compared.

The method to actually remove the "offending" tags is shown here:

/// <summary>
/// Break the relevant text item down into a string 
/// that is stripped of all HTML tags, vowels and 
/// whitespace characters.
/// </summary>
/// <param name="source">The item that needs to be "cleansed" 
/// using the relevant regular expressions.</param>
/// <returns>The "cleansed" string.</returns>
/// <remarks>This implementation starts off by removing all HTML tags, 
/// before removing any vowels and
/// whitespace characters.</remarks>
private string BreakdownText(string source)
{
  if (string.IsNullOrEmpty(source))
    throw new ArgumentNullException("source");
  // First of all, strip the tags out of the source.
  if (_tagType == StripTagType.StripTags)
  {
    source = ReplaceCharacters(source, _tagRemover, string.Empty);
  }
  source = ReplaceCharacters(source, _chars, string.Empty);
  return ReplaceCharacters(source, _vowels, string.Empty).ToLowerInvariant();
}
/// <summary>
/// Method to replace one set of characters with another.
/// </summary>
/// <param name="source">The source text to replace the characters in.</param>
/// <param name="regexText">The regular expression defining the characters to replace.
/// </param>
/// <param name="replaceText"></param>
/// <returns></returns>
private string ReplaceCharacters(string source, string regexText, string replaceText)
{
  Regex regex = new Regex(regexText,
  RegexOptions.IgnoreCase | RegexOptions.Multiline | 
		RegexOptions.IgnorePatternWhitespace);
  return regex.Replace(source, replaceText);
}

The order in which items are removed is important. You want to strip the HTML tags out before you remove the none-alphabetic and vowel characters. By the way, the _vowels and _char regular expressions are defined as [eiouy] and [^b-z] respectively.

If the two sanitized pieces of text you want to compare match, then you have a duplicate. However, as we have said, there's always the possibility of there being a misspelling so the comparer goes further and looks through the two pieces of text to identify characters that don't match. If the number of characters falls outside a particular threshold, then we know that the two pieces of text are sufficiently different to count them as unique.

/// <summary>
/// This method calculates the number of unmatched characters 
/// in the source string and determines
/// whether or not it falls within the target threshold.
/// </summary>
/// <param name="source">The source to check.</param>
/// <param name="target">The target string to check against.</param>
/// <returns>True if the string matches (or matches to within the tolerance), 
/// false otherwise.</returns>
private bool Compare(string source, string target)
{
  int targetLength = target.Length; 
  StringBuilder unmatched = new StringBuilder();
  while (target.Length > 0 && source.Length > 0 && unmatched.Length <= _diffCharacters)
  {
    if (target[0] == source[0])
    {
      target = target.Remove(0, 1);
      source = source.Remove(0, 1);
    }
    else
    {
      // Now, look for the next instance of this letter in source.
      int index = source.IndexOf(target[0]);
      if (index > 0)
      {
        unmatched.Append(source.Substring(0, index));
        source = source.Remove(0, index);
      }
    }
  }
  return unmatched.Length <= _diffCharacters;
}

And that's it - a naive comparer. To run it, all you need to do is this:

static void Main(string[] args)
{
  string s = "<H1>This is a sample.</H1>";
  NaiveComparer c = new NaiveComparer(s, StripTagType.StripTags, 10);
  if (c.Compare("Hello" + s + "aeiou123"))
  {
    Console.WriteLine("It matches");
  }
  if (!c.Compare("Hellot there from this completely arbitrary " + 
    "and totally different way of matching <<<< <<<<<< " + s + 
    ">>>>>>>>>>>>>>-----"))
    Console.WriteLine("No match");
  Console.ReadLine();
}

I hope that this is useful for you, on those occasions where you need to compare to see if items are sufficiently different to allow them through.

Edit History

07 May 08 Fixed bug found by Chris Maunder. Thanks Chris.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)