Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles
(untagged)

Efficient Boyer-Moore Search in Unicode Strings

0.00/5 (No votes)
30 Jun 2005 1  
An article on implementing Boyer-Moore algorithm for Unicode strings in C#.

Boyer-Moore search demo

Introduction

Suppose you have to perform a case-insensitive text search over a Unicode string. You probably know that Boyer-Moore algorithm is the most efficient algorithm for such a task. This article shows how to implement the Boyer-Moore algorithm for Unicode, using C#.

Background

The efficiency of a search algorithm can be estimated by the number of symbol comparisons it takes to scan the whole text in search for the pattern match. The Boyer-Moore algorithm is efficient because it avoids some unnecessary comparisons and produces longer shifts of the pattern along the text. With Boyer-Moore, it is also easy to perform efficient case-insensitive searches. I will not go into details of the Boyer-Moore algorithm as you can find them elsewhere. It is interesting to note however that there are several implementations of the main idea of the algorithm that differ in complexity. Simple implementations that use one-dimensional shift tables are not very efficient when searching for complex patterns that contain several repeating sub-patterns. The implementation I show here uses a two-dimensional shift table. The best way to explain an algorithm is to explain the data structures it uses, and the best way to explain a data structure is to demonstrate it using a simple example. Following is a simple example of the two-dimensional shift table for Boyer-Moore. Suppose we search a string that consists only of characters a, b, c, d, e (in any number and sequence) for a pattern "abdab" (note that "ab" sequence occurs twice in this pattern). The shift-table for the pattern "abdab" would look like this:

Simple shift table

The number of columns in the table is equal to the number of characters in the pattern and the number of rows is equal to the number of characters in the charset. As comparison in Boyer-Moore starts from the last character of the pattern, the table is built from right to left. Each table cell contains values by which the pattern should be shifted given that all the previous pattern characters (the characters to the right of the current character) match corresponding text characters. When we start to compare pattern characters to substring characters from right to left we find the shift value for the pattern in the cell whose column corresponds to the pattern character and whose row corresponds to the substring character.

The shift values are calculated for every character that may occur in the string where we search. You may have noticed that zero shifts occur in the cells for which pattern characters match the charset characters. This means that while the pattern matches the substring, the pattern shouldn't be shifted. If the table was scanned from right to left and in every column we get zero shift for the corresponding substring then we have found the substring that matches the pattern. Other shift values are calculated so as to produce maximum shifts without skipping sequences that might be parts of the matching substring. The shift of five, for example, is the "full" shift of the five-character pattern and the shift of three reflects the fact that the two last characters of the pattern coincide with the two first characters (so if the two last characters of the substring match the pattern while others do not, we shift the pattern by three, and not by five, because these two characters may be the beginning of the matching substring). From a broader perspective, we may view the shift table as a representation of a finite automaton whose states are the values stored in the table.

It is easy to build a shift table with the number of rows equal to the number of charset characters if the charset is single-byte. But for Unicode charset, such a table would be quite large and inefficient. It is easy to solve this problem if we notice that table rows differ from each other only for the characters that actually appear in the pattern. In the table shown above, you can see that the rows for the characters c and e contain the same set of shift values and this set would be valid for any other character that doesn't occur in the pattern. So we might store table rows corresponding to characters found in the pattern in some fast-access structure, say a hash-table, and keep a separate set of shift values for all other characters that might appear in the string.

The code

Now when we have discussed how the shift table is built and how it works, let's have a look at the code that builds the thing. There is BMSearcher class in the demo project that performs the actual search. The class' constructor takes a pattern and builds a shift table for it.

public class BMSearcher
{
  // Shift table for chars present in the pattern

  protected Hashtable PatternCharShifts;
  // Shifts for all other chars

  private int[] OtherCharShifts;
  // Length of the search pattern

  private int PatternLength;

  public BMSearcher(string Pattern)
  {
    PatternCharShifts = new Hashtable();
    // Building shift table

    PatternLength = Pattern.Length;
    int MaxShift = PatternLength;
    // Constructing the table where number

    // of columns is equal to PatternLength

    // and number of rows is equal to the

    // number of distinct chars in the pattern

    for (int i = 0; i < PatternLength; i++)
    if (!PatternCharShifts.ContainsKey(Pattern[i]))
    PatternCharShifts.Add(Pattern[i], new int[PatternLength]);
    OtherCharShifts = new int[PatternLength];
    // Filling the last column of the

    // table with maximum shifts (pattern length)

    foreach(DictionaryEntry Row in PatternCharShifts)
    ((int[])Row.Value)[PatternLength - 1] = MaxShift;
    OtherCharShifts[PatternLength - 1] = MaxShift;
    // Calculating other shifts (filling each column

    // from PatternLength - 2 to 0 (from right to left)

    for(int i = PatternLength - 1; i >= 0; i--)
    {
      // Suffix string contains the characters

      // right to the character being processsed

      string Suffix = new String(Pattern.ToCharArray(), 
                          i + 1,  PatternLength - i - 1);
      // if Pattern begins with Suffix

      // the maximum shift is equal to i + 1

      if (Pattern.StartsWith(Suffix))
      MaxShift = i + 1;
      // Store shift for characters not present in the pattern

      OtherCharShifts[i] = MaxShift;
      // We shorten patter by one char in NewPattern.

      string NewPattern = new string(Pattern.ToCharArray(), 
                                     0, Pattern.Length -1);
      if ((NewPattern.LastIndexOf(Suffix) > 0) || (Suffix.Length == 0))
      foreach(DictionaryEntry Row in PatternCharShifts)
      {
        string NewSuffix  = (char)Row.Key + Suffix;
        // Calculate shifts:

        //Check if there are other occurences 

        //of the new suffix in the pattern

        // If several occurences exist, we need the rightmost one

        int NewSuffixPos = NewPattern.LastIndexOf(NewSuffix);
        if (NewSuffixPos >= 0) 
          ((int[])Row.Value)[i] = i - NewSuffixPos;
        else 
          ((int[])Row.Value)[i] = MaxShift;
        // Storing 0 if characters

        // in a row and a columnt are the same

        if ((char)Row.Key == Pattern[i])
        ((int[])Row.Value)[i] = 0;
      }
      else
      foreach(DictionaryEntry Row in PatternCharShifts)
      {
        // if Suffix doesn't occure in NewPattern

        // we simply use previous shift value

        ((int[])Row.Value)[i] = MaxShift;
        if ((char)Row.Key == Pattern[i])
           ((int[])Row.Value)[i] = 0;
      }
   }
}

Constructor stores the table in a PatternCharShifts object while shift values for characters not present in the pattern are stored in an OtherCharShifts array. PatternCharShifts member is declared protected. The reason for this will be shown later. There is a GetTable method in BMSearcher that returns the string representation of the table built by the constructor (the rows in the table are separated by line breaks). This method is used in the attached demo. The code shown above uses System.Collections.Hashtable for storing shifts as I did in the first version of the demo. Now I reimplemented the demo with the custom hash table class BMHashTable to avoid performance penalty caused by boxing the arguments. It doesn't make any difference to the algorithm itself. Let's now look at the Search method that does the actual search. This method takes two arguments: the string where the search should be performed and position in the string where the search should start from. The method returns the index of the first (starting from the Pos) matching substring. The index is relative to the beginning of the string. The method returns -1 if no match is found.

public int Search(string Text, int StartPos)
{
  int Pos = StartPos;
  while (Pos <= Text.Length - PatternLength)
  for(int i = PatternLength - 1; i >= 0; i--)
  {
    if(PatternCharShifts.ContainsKey((object)Text[Pos + i]))
    {
      // pattern contains char Text[Pos + i]

      int Shift = 
       ((int[])PatternCharShifts[(object)Text[Pos + i]])[i];
      if (Shift != 0)
      {
        Pos += Shift; // shifting

        break;
      }
      else
      if (i == 0)
      // we came to the leftmost pattern

      // character so pattern matches

        return Pos;
        // return matching substring start index

    }
    else
    {
      Pos += OtherCharShifts[i]; // shifting

      break;
    }
  }
  // Nothing is found

  return -1;
}

The BMSearcher class performs case-sensitive search. Implementing case-insensitive search with Boyer-Moore algorithm is quite easy - all we have to do is modify our shift table. We need to add new rows to the table so that the table would contain rows representing pattern characters in both upper and lower case. For example, if we have a pattern "Abc", we need to build the table as described above and then add three rows for a, B, and C to implement the case-insensitive search. Since the shift values should be the same for the same characters in different cases, we may build a table for case-sensitive search and then simply copy rows for the case-complement characters. If the shift table is modified this way, we can use the same search routine for the case-insensitive search.

This technique is implemented in a class CIBMSearcher derived from BMSearcher. Here is the constructor for the class:

public CIBMSearcher(string Pattern, bool CaseSensitive) : base(Pattern)
{
  if(!CaseSensitive)
  {
    Hashtable TmpHT = new Hashtable();
    foreach(DictionaryEntry Row in PatternCharShifts)
    {
      if(!TmpHT.ContainsKey(Row.Key))
        TmpHT.Add(Row.Key, Row.Value);
      char ch = (char)Row.Key;
      if (char.IsLower(ch))
        ch = char.ToUpper(ch);
      else
        ch = char.ToLower(ch);
      if(!TmpHT.ContainsKey(ch))
        TmpHT.Add(ch, Row.Value);
    }
    PatternCharShifts = TmpHT;
  }
}

This constructor calls the base constructor for building case-sensitive Boyer-Moore table and then, if CaseSensitive value is false, adds new rows to the table.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here