Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles
(untagged)

Boyer-Moore and related exact string matching algorithms

0.00/5 (No votes)
20 Jan 2006 1  
Implementations of Boyer-Moore and several related string matching algorithms in C# 2.0.

Introduction

This article presents a straightforward implementation of the Boyer-Moore string matching algorithm and a number of related algorithms.

Background

When searching for one string within another, .NET programmers normally reach for String.IndexOf. For the vast majority of applications, that's just fine. However, when searching very long strings, there are other algorithms that deliver more performance than the basic algorithm that .NET provides. This article gives simple implementations of several of those algorithms.

It's worth noting that String.IndexOf performs a much more complex comparison than these algorithms provide, since String.IndexOf performs a culture-sensitive comparison - this is a highly non-trivial matter when a case insensitive match is desired! If you need culture sensitive matching, stick to String.IndexOf.

What this article doesn't cover

This article does not attempt to teach you how to write high performance string matching algorithms, nor does it try to explain the algorithms that are presented. If you're looking for that kind of information, consult an algorithms book such as Knuth or Cormen, et al. An excellent online reference for string matching algorithms is the Handbook of Exact String Matching Algorithms.

Using the code

The BoyerMooreSearch class provides implementations of the original Boyer-Moore algorithm plus a number of related algorithms. An implementation that uses String.IndexOf is also provided for comparison.

The text to be found is passed to the BoyerMooreSearch constructor. Each searching algorithm is presented using a common IEnumerable<int> model.

namespace StringMatch
{
  public class BoyerMoore
  {
    /// <summary>

    /// Constructor

    /// </summary>

    /// <param name="pattern">Pattern for search</param>

    public BoyerMoore(string pattern) { ... }

    /// <summary>

    /// Return all matched of the pattern

    /// in the specified text using algorithm name

    /// </summary>

    /// <param name="text">text to be searched</param>

    /// <param name="startingIndex">Index at 

    ///       which search begins</param>

    /// <returns>IEnumerable which returns

    /// the indexes of pattern matches</returns>

    public IEnumerable<int> AlgorithmNameMatch(string text, 
                                 int startingIndex) { ... }

    /// <summary>

    /// Return all matched of the pattern in the

    /// specified text using algorithm name

    /// </summary>

    /// <param name="text">text to be searched</param>

    /// <returns>IEnumerable which returns the indexes of pattern matches</returns>

    public IEnumerable<int> AlgorithmNameMatch(string text) { ... }
  }
}

Each algorithm returns an IEnumerable<int> that yields all of the indexes where the pattern (passed to the constructor) is found in the text (passed to the algorithm). This design allows the pre-computation that all of the Boyer-Moore related algorithms make use of to be shared across any number of searches for a specific pattern.

BoyerMoore bm = new BoyerMoore("abcd");
foreach (int index in bm.BoyerMooreMatch("abcdsflkjwertabc" + 
         "dfglkjdfgcdjweertopabjsdvacbwer" + 
         "kjhsdfkljhacbsdflkjsdflkjabcd"))
  Console.WriteLine("Matched at index {0}", index);

This example will list matches at 0, 13, and 72.

The Algorithms

The BoyerMoore class implements the following algorithms:

  • BCLMatch - search using String.IndexOf.
  • HorspoolMatch - search using the Horspool algorithm. This is functionally identical to the "Quick Search" algorithm and is presented in some algorithms books as being the Boyer-Moore algorithm. Due to its simplicity, this is often the highest performing of the Boyer-Moore-like algorithms even though others have better theoretical performance.
  • BoyerMooreMatch - search using the Boyer-Moore algorithm.
  • TurboBoyerMooreMatch - search using the Turbo Boyer-Moore algorithm.
  • ApostolicoGiancarloMatch - search using the Apostolico-Giancarlo algorithm. Although this algorithm has the best theoretical performance (when considering only character comparisons, as is traditional when evaluating matching algorithms), it is often the slowest in practice due to the complexity of the code.

The demo program

The supplied demo program is a simple console application that searches a specified file for all occurrences of a string, reporting the time taken by each algorithm. Running the demo application on my machine (3Ghz P4) searching a large text file for a frequently occurring string, yielded the following times:

Finding all occurrences of pattern 
         in path (7324800 characters in length)
String.indexOf found 87473 matches in 0.0579839 seconds
Horspool found 21832 matches in 0.0180492 seconds
Boyer-Moore found 87473 matches in 0.0196312 seconds
Turbo Boyer-Moore found 87473 matches in 0.0282827 seconds
Apostolico-GiancarloMatch found 87473 matches in 0.0563225 seconds

I my experience, the ratios shown here are typical, but your mileage will vary depending on what you're looking for and how many partial matches there are in the searched text.

History

  • 17-Jan-2006 - Initial version.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here