Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles / Languages / C#

A .NET implementation for the Knuth-Moris-Pratt (KMP) algorithm

4.67/5 (5 votes)
4 Apr 2009CPOL3 min read 48.3K   1.3K  
A .NET implementation for the Knuth-Moris-Pratt (KMP) algorithm.

Introduction

This article about the Knuth-Moris-Pratt algorithm (KMP). KMP is a string matching algorithm that allows you to search patterns in a string in O(n) time and O(m) pre-proccesing time, where n is the text length and m is the pattern length.

Background

The KMP algorithm first calculates a transition array that tells how many shifts the pattern is shifted when a mismatch occurs.

Implementation

The PrefixArray class takes a string parameter, the pattern, and is responsible for calculating the prefix function and returning an array that contains the transition indexes.

In calculating the prefix array, care has been taken to give maximum performance; hence, the general implementation has been tweaked in several places.

C#
for(int i = 1 ; i < pattern.Length ; i++) 
{ 
    temp        = SubCharArray(i, patternArray);
    hArray[i]   = GetPrefixLegth(temp, firstChar);
}

The code above shows computing the prefix function; the loop in the code iterates through the pattern and calculates the prefix function for each index. (Note: the iteration starts from 1 as we know that the transition at index 0 is 0.)

The temp array represents all the characters from the 0th index of the pattern to the index of the current loop. This character array is passed into the GetPrefixLength() function, which actually computes the prefix function.

Next, we have the code for the GetPrefixLength() function:

C#
private static int GetPrefixLegth(char[] array, char charToMatch)
{ 
    for(int i = 2; i < array.Length; i++)
    {
     //if it is a match
       if(array[i] == charToMatch)
       { 
              if(IsSuffixExist(i, array))
              {
              //Return on the first prefix which is the largest
              return array.Length - i;
              }
       }
    }
    return 0;
}

This function takes in the array we discussed as a parameter, and also the first character of the pattern (charToMatch).

The array is iterated and searched for a match with the first character of the pattern (the first character needs to be matched, the prefix starts with the first character). If the first character exist in the array, then we calculate the longest suffix that is a prefix of the pattern.

Obviously, the first match gives us the longest suffix that is a prefix of the pattern.

The code below depicts the function IsSufficExists():

C#
private static bool IsSuffixExist(int index, char[] array)
{                    
  //Keep track of the prefix index
  int k = 0;
  for(int i = index; i < array.Length ; i++)
  {
      //A mismatch so return
       if(array[i] != array[k]){ return false;}
       k++;
  }
  return true;
}

Next, we will look into the code snippet that actually finds the occurrence of the pattern in a string using the transition array for the pattern.

C#
for(int i = 0; i < charArray.Length; i++)
{
  //If there is a match
  if(charArray[i] == patternArray[k])
  { 
       //Move the pattern index by one
       k++; 
  }
  else
  {
       //There is a mismatch..so move the pattern 

       //The amount to move through the pattern 
       int prefix = transitionArray[k];
       //if the current char does not match
       //the char in the pattern that concides
       //when moved then shift the pattern entirley, so 
       //we dont make a unnecssary comparision
       if(prefix + 1 > patternArray.Length && 
          charArray[i] != patternArray[prefix + 1])
       {
              k = 0;
       }
       else
       {
              k = prefix;
       }
   }

The loop iterates through the string to search the pattern. If the current character matches the character in the pattern index (the character index that should be matched in the iteration), then there is a match, and we increment the index of the pattern and continue.

If there is a mismatch, then we get the transition index for the specific index, and we see if the character at the transition index + 1 matches with the character that is being matched (this is done so we don’t need to unnecessarily match the character again). If there is a match, then we move the pattern index with the transition index; if not, we move the pattern index with 0.

Next, we see if the pattern index is equal to the pattern length; if so, then we have a match. This code snippet is shown below:

C#
//A complet match, if kis
//equal to pattern length
if(k == patternArray.Length)
{
    //Add it to our result
    result.Add(i - (patternArray.Length - 1));
    //Set k as if the next character is a mismatch
    //therefore we don’t miss out any other containing pattern
    k  = transitionArray[k - 1];
}

Using the code

In order to use the code, all you got to do is build the files in the source provided, and use the static method GetAllOccurences(string pattern, string text) of the KMPUtil class like this:

KMPUtil.GetAllOccurences("ab", "abhsdsabsbabaa");

This method will return an ArrayList of indexes where the pattern occurs in the text (zero based).

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)