Introduction
The problem of searching for a string
within another string is important in Computer Science. Most programmers are
familiar with the C strstr function which provides the caller with an ability
to search for a string needle within a string haystack. This article is about a
similar function which is generated using a Linux shell script and provides a
similar facility.
Background
The approach taken for implementing this
version of string searching is to generate a search function with a variable
degree of search branches. The code searches linearly within the string
haystack using the multiple search logic each with different offsets into the
string within the main search loop.
The design of the search function is as
follows:
#include <string.h>
#include <iostream>
using namespace std;
#define STRSTR(S,T,P) \
{ \
register const char *s = (S), *t = (T); \
while(*s && *t && *s++ == *t++); \
if(!*t) { \
(P)=(S); \
} else { \
(P)=NULL; \
} \
}
int
_strstr(const char *str, const char *f)
{
int len = strlen(str);
int m = 4;
int i = len / 4 + 1;
int *k= new int[m];
const char *orig = str, *p;
memset(k,0x0,sizeof(int) * m );
for(int j = 1; j < m; j++) {
k[j] += (k[j-1]+i);
cout << "k[j]: "<< k[j] << endl;
}
cout << "i : " << i << endl;
while(i--) {
cout << "k[0] " << ((str + k[0]) - orig) << " len " << len << endl;
if(*str && ((str + k[0]) - orig) < len && *(str + k[0]) == *f) {
STRSTR(str+k[1],f,p);
if(p) {
cout << "k[0]: " << p << endl;
return p - orig;
}
}
cout << "k[1] " << ((str + k[1]) - orig) << " len " << len << endl;
if(*str && ((str + k[1]) - orig) < len && *(str + k[1]) == *f) {
STRSTR(str+k[1],f,p);
if(p) {
cout << "k[1]: " << p << endl;
return p - orig;
}
}
cout << "k[2] " << ((str + k[2]) - orig) << " len " << len << endl;
if(*str && ((str + k[2]) - orig) < len && *(str + k[2]) == *f) {
STRSTR(str+k[2],f,p);
if(p) {
cout << "k[2]: " << p << endl;
return p - orig;
}
}
cout << "k[3] " << ((str + k[3]) - orig) << " len " << len << endl;
if(*str && ((str + k[3]) - orig) < len && *(str + k[3]) == *f) {
STRSTR(str+k[3],f,p);
if(p) {
cout << "k[3]: " << p << endl;
return p - orig;
}
}
str++;
}
}
Where in the above code, STRSTR
is a macro that tests
a possible match by comparing the entire string haystack S with the string
needle T to be found. If found, the results are placed into pointer P,
otherwise P is NULL. The variable m represents the number of search logics that
are generated. In this example, m is equal to four, and there are four if
statements of the form:
if(*str && ((str + k[0]) - orig) < len && *(str + k[0]) == *f)
These if statements check the input string haystack
represented by pointer 'str
' to see if the first character is a match to the
first character of the string needle represented by pointer 'f
'. However, care
must be taken to assert that the test is not out of bounds from the length of
string 'str
'. This is done in the first part of the test:
*str && ((str + k[0]) - orig) < len
The array k
contains m offsets into string 'str
'. In
this way, during a linear scan of the string 'str
', multiple offsets into the
string can be scanned in the same pass. This approach is similar to how a loop
would be unraveled for optimization. Once there is a possible match,
represented by the following condition within the if statement:
*(str + k[0]) == *f
The macro STRSTR
is invoked and a test to see if a
match is found is made.
Clearly, in the worst case, the algorithm will scan
the length of the entire string haystack which is N, and perform a constant m
such comparisons where the worst case for the comparisons is given by scanning
the length of the entire string needle which is M. Thus, the algorithm is O(N *
M). This will happen in cases where the string haystack is of the form 'aaaa'
and the string needle is of the form 'ab'. A pass through 'aaaa' will have each
of the m search logics match the first character of 'ab' causing STRSTR to
execute and scan all of the string needle and failing each time. However, if
the string needle does exist within the string haystack, then the algorithm
should execute much faster than that as it would likely find a possible match
within a pass of the string haystack which is enhanced by the multiple search
logics scanning at different positions within the string haystack. In the
average case, the code will likely take O(N/2) scan of the haystack (similar to
linear search) with all proceeding constant m tests failing and one succeeding
taking O(M) time for an average running time of O(N/2 + M). The best case is
that the algorithm matches the first characters of the string haystack with the
string needle, then immediately goes to STRSTR to scan for the needle in the
haystack and returns success which would be O(M) time.
Using the Code
A Bash script called gen_string_match.sh
is provided to emit the C++ code. It allows the user to generate as large or
as small a number of internal
test branches to run within the main search loop. To
run the script and
compile the emitted C++ code, type (on Linux):
./gen_string_matching.sh
g++ strstr.C -D_TEST_STRING_MATCH -o test_string_match
The shell script does not generate a header file, so
to use the '_strstr' function, declare in a header:
extern "C" char *_strstr(const char *s, const char *t);
Points of Interest
The code is generated to allow the
user to use a function that has many search logics within the linear pass of
the string haystack. If this is equal to the length of the string haystack,
then the code will match the needle within the first pass.