Introduction
This article presents a straightforward implementation of the Boyer-Moore string matching algorithm and a number of related algorithms.
Background
When searching for one string within another, .NET programmers normally reach for String.IndexOf
. For the vast majority of applications, that's just fine. However, when searching very long strings, there are other algorithms that deliver more performance than the basic algorithm that .NET provides. This article gives simple implementations of several of those algorithms.
It's worth noting that String.IndexOf
performs a much more complex comparison than these algorithms provide, since String.IndexOf
performs a culture-sensitive comparison - this is a highly non-trivial matter when a case insensitive match is desired! If you need culture sensitive matching, stick to String.IndexOf
.
What this article doesn't cover
This article does not attempt to teach you how to write high performance string matching algorithms, nor does it try to explain the algorithms that are presented. If you're looking for that kind of information, consult an algorithms book such as Knuth or Cormen, et al. An excellent online reference for string matching algorithms is the Handbook of Exact String Matching Algorithms.
Using the code
The BoyerMooreSearch
class provides implementations of the original Boyer-Moore algorithm plus a number of related algorithms. An implementation that uses String.IndexOf
is also provided for comparison.
The text to be found is passed to the BoyerMooreSearch
constructor. Each searching algorithm is presented using a common IEnumerable<int>
model.
namespace StringMatch
{
public class BoyerMoore
{
public BoyerMoore(string pattern) { ... }
public IEnumerable<int> AlgorithmNameMatch(string text,
int startingIndex) { ... }
public IEnumerable<int> AlgorithmNameMatch(string text) { ... }
}
}
Each algorithm returns an IEnumerable<int>
that yields all of the indexes where the pattern (passed to the constructor) is found in the text (passed to the algorithm). This design allows the pre-computation that all of the Boyer-Moore related algorithms make use of to be shared across any number of searches for a specific pattern.
BoyerMoore bm = new BoyerMoore("abcd");
foreach (int index in bm.BoyerMooreMatch("abcdsflkjwertabc" +
"dfglkjdfgcdjweertopabjsdvacbwer" +
"kjhsdfkljhacbsdflkjsdflkjabcd"))
Console.WriteLine("Matched at index {0}", index);
This example will list matches at 0, 13, and 72.
The Algorithms
The BoyerMoore
class implements the following algorithms:
BCLMatch
- search using String.IndexOf
.
HorspoolMatch
- search using the Horspool algorithm. This is functionally identical to the "Quick Search" algorithm and is presented in some algorithms books as being the Boyer-Moore algorithm. Due to its simplicity, this is often the highest performing of the Boyer-Moore-like algorithms even though others have better theoretical performance.
BoyerMooreMatch
- search using the Boyer-Moore algorithm.
TurboBoyerMooreMatch
- search using the Turbo Boyer-Moore algorithm.
ApostolicoGiancarloMatch
- search using the Apostolico-Giancarlo algorithm. Although this algorithm has the best theoretical performance (when considering only character comparisons, as is traditional when evaluating matching algorithms), it is often the slowest in practice due to the complexity of the code.
The demo program
The supplied demo program is a simple console application that searches a specified file for all occurrences of a string, reporting the time taken by each algorithm. Running the demo application on my machine (3Ghz P4) searching a large text file for a frequently occurring string, yielded the following times:
Finding all occurrences of pattern
in path (7324800 characters in length)
String.indexOf found 87473 matches in 0.0579839 seconds
Horspool found 21832 matches in 0.0180492 seconds
Boyer-Moore found 87473 matches in 0.0196312 seconds
Turbo Boyer-Moore found 87473 matches in 0.0282827 seconds
Apostolico-GiancarloMatch found 87473 matches in 0.0563225 seconds
I my experience, the ratios shown here are typical, but your mileage will vary depending on what you're looking for and how many partial matches there are in the searched text.
History
- 17-Jan-2006 - Initial version.