Introduction
This is a tip about how I use Morris Pratt string searching algorithm in byte arrays to get the same efficient result. I'm assuming that you've understood the basic concept of this algorithm so it won't be repeated in this topic. Also, you should already know extension methods before you get started with this topic.
Background
In 2013, I built an extension method of Morris Pratt searching algorithm to search specific string in text file. But later, I had an idea and tried to apply it on every enumerable-related type if each element in it is equatable. So I expanded my method and then get the same efficient result if I search it in byte array, which means I can directly search string in raw text file and don't have to use ASCII or Unicode to decode it at first. The program became more efficient in text searching.
Using the Code
It's easy to use these extensions. First of all, let's take a look of their prototypes:
public static int MorrisPrattSearchFirst<T>(this T[] t, T[] p)
where T : IEquatable<T> { }
public static int MorrisPrattSearchFirst<T>(this T[] t, T[] p, int startIndex)
where T : IEquatable<T> { }
where t
is the parent array and p
is the specified array for searching. The second method allows users to search p
in t
starting at non-zero index. Constraint on type parameter T
makes sure that each element in both arrays is equatable. Returned value represents the index of first matched result (-1
if not found). Both of them are defined under the CollectionAssistant.IEnumerableExtentions
class.
Here's a complete example of using them. First, we search key word "Ishmael
" in character arrays from the text file MobyDick.txt. And next search it in ASCII-encoded byte arrays. Then compare the efficiency.
using System;
using System.Diagnostics;
using System.IO;
using System.Text;
namespace CollectionAssistant.Demo
{
class Program
{
static void Main(string[] args)
{
string keyWord = "Ishmael";
char[] originalCharArray = File.ReadAllText("MobyDick.txt").ToCharArray();
char[] keyWordCharArray = keyWord.ToCharArray();
int index = -1;
int pos = 0;
Console.WriteLine("Start searching key word in character array.");
Stopwatch watch = Stopwatch.StartNew();
do
{
index = originalCharArray.MorrisPrattSearchFirst(keyWordCharArray, pos);
if (index >= 0)
{
Console.WriteLine("Key word \"{0}\" found. Index: {1}", keyWord, index);
pos += (index + keyWordCharArray.Length);
}
} while (index >= 0 && pos < originalCharArray.Length);
watch.Stop();
Console.WriteLine("Elapsed ticks: {0}", watch.ElapsedTicks);
Console.ReadKey(true);
pos = 0;
Console.WriteLine("Start searching key word in byte array.");
byte[] originalByteArray = File.ReadAllBytes("MobyDick.txt");
byte[] keyWordByteArray = Encoding.Default.GetBytes(keyWord);
watch.Restart();
do
{
index = originalByteArray.MorrisPrattSearchFirst(keyWordByteArray, pos);
if (index >= 0)
{
Console.WriteLine("Key word \"{0}\" found. Index: {1}", keyWord, index);
pos += (index + keyWordByteArray.Length);
}
} while (index >= 0 && pos < originalByteArray.Length);
watch.Stop();
Console.WriteLine("Elapsed ticks: {0}", watch.ElapsedTicks);
Console.ReadKey(true);
}
}
}
The result shows that both of them performed almost same efficiently: about 3000 ~ 8000 ticks in this case.
Conclusion
Searching text in character arrays and in byte arrays are both efficient while using Morris Pratt algorithm. But sometimes, it's better in byte arrays especially for some situations such as text file or multipart/form-data - which means parent byte array possibly contains both text and non-text data and can't be decoded into character array or string.
You can obtain the complete source code from here.