Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles
(untagged)

single_pass_search: Generic sequence searching through single-pass iterators

0.00/5 (No votes)
11 Oct 2008 2  
This article presents a generic sequence searching template function, which is more versatile than std::search

Contents

Introduction

std::search is a notable STL algorithm very useful and powerful. How interesting would it be to have an even more versatile sequence searching algorithm? Can we actually have something more generic than std::search? More efficient maybe? Or is it possible to have something that is more generic than std::search and more efficient at the same time?

Primary qualities

single_pass_search is a template function which performs generic sequence searching similarly to the std::search [a5] STL algorithm. The most important quality of single_pass_search and its main advantage over std::search is its capability to search for a sub-sequence in a search-range that is accessed through a pair of single-pass (input) iterators, [a6,b1] while std::search requires that the search-range must be accessed via forward iterators [a7,b2] at least.

For example, single_pass_search can easily search a file directly through a pair of istream [a9] iterators, while std::search requires the interference of an intermediate container in order to obtain the file data via a pair of forward iterators at least. With data sources like istream, [a10] using single_pass_search instead of std::search [a5] is by far simpler, more efficient and less memory consuming as well, because the std::search algorithm additionally requires an intermediate container and some stream buffering logic.

Secondary qualities

Furthermore, another interesting advantage of the single_pass_search is its superior time complexity compared to the std::search STL algorithm. [a5] The internal algorithm used in single_pass_search is an amelioration of the original Knuth-Morris-Pratt algorithm [c3], implemented by David R. Musser and Gor V. Nishanov [c5,c6] and further modified, in order to work seamlessly with single-pass iterators as input. For a search-range which contains n elements and a pattern sequence of m elements, single_pass_search will have worst case time complexity O(n + m). [c3,c5] Under the same circumstances, std::search is expected to have worst case time complexity no better than O(n x m), since in all the most notable STL implementations, [a1,a2,a3,a4] std::search is implemented using merely the straight-forward (brute-force) search algorithm. [c1]

Limitations

It is essential to clarify though, that single_pass_search can only access via input iterators [a6,b1] the search-range and not the pattern sequence as well. For the pattern sequence forward iterators [a7,b2] are still required, similarly to the std::search STL algorithm. [a5] However, accessing the pattern sequence through input iterators is not that significant at all, because the pattern sequence typically contains a very small number of elements, compared to the elements contained in the search-range, and copying all these elements into an intermediate container is rather inexpensive in terms of processing power and memory consumption.

Oddities and drawbacks

On the other hand, the main oddity and drawback of single_pass_search is related to its return value. Immediately after detecting in the search-range the first sub-sequence which matches the search criteria, single_pass_search returns an iterator pointing to the last element of the particular sub-sequence. Under the same circumstances, std::search [a5] would return an iterator pointing to the first element of the particular sub-sequence, which is a more intuitive behavior. After all, it is a well established practice to represent and refer to a sequence by pointing to its first element.

However, this single_pass_search oddity is well justified and actually arises from the need to support the single-pass (input) iterators. More specifically, after detecting in the search-range a sub-sequence which matches the search criteria, we cannot backtrack to its first element, because the single-pass iterators will not allow us to do so! [a6,b1] Only the iterator pointing to the last element of the particular sub-sequence is still valid and can represent the sub-sequence found.

Moreover, this single_pass_search drawback is probably less inconvenient than it seems at first glance. First of all, since we perform exact pattern matching, there is no information loss due to the lack of access to the elements of the sub-sequence found. We actually know all the elements of this sub-sequence even before the search starts.

Furthermore, there are some cases when having handy an iterator which points to the last element of the sub-sequence found can be rather convenient! For example, lets consider a case in which we need to count the occurrences of a particular pattern in a given search-range. Whenever a matching sub-sequence is found, an iterator which points at the last element of the particular sub-sequence can be conveniently used to resume searching for the next sub-sequence, by just incrementing it once. (see example) If we had an iterator pointing to the first sub-sequence element instead, then we should have to deal with the additional task of skipping the currently found sub-sequence, before resuming to search for the next sub-sequence.

Syntax description

single_pass_search is an overloaded name, there are actually four single_pass_search functions and each of them provides a different set of input parameters.

template <typename InputIterator, typename ForwardIterator>
inline InputIterator single_pass_search(
        InputIterator searchRange,
        InputIterator searchRangeEnd,
        ForwardIterator pattern,
        ForwardIterator patternEnd)
[in] searchRange An input iterator, [a6] which points at the first element of the search-range.
[in] searchRangeEnd An input iterator, [a6] which denotes the end of the search-range.
[in] pattern A forward iterator, [a7] which points at the first element of the pattern sequence.
[in] patternEnd A forward iterator, [a7] which points beyond the end of the pattern sequence.
[out] return value On success the return value is an input iterator [a6] which points to the last element of the first sub-sequence of the search-range that satisfies the search criteria. On failure the return value is searchRangeEnd.

Searches for a sub-sequence within the search-range [searchRange, searchRangeEnd), which is identical to the pattern sequence [pattern, patternEnd), when compared element-by-element. Immediately after locating the first sub-sequence that matches these criteria, it returns an iterator pointing to the last element of the particular sub-sequence, otherwise it returns searchRangeEnd when the search-range is exhausted.

template <typename InputIterator, typename ForwardIterator, typename BinaryPredicate>
InputIterator single_pass_search(
        InputIterator searchRange,
        InputIterator searchRangeEnd,
        ForwardIterator pattern,
        ForwardIterator patternEnd,
        BinaryPredicate pred)
[in] searchRange An input iterator, [a6] which points at the first element of the search-range.
[in] searchRangeEnd An input iterator, [a6] which denotes the end of the search-range.
[in] pattern A forward iterator, [a7] which points at the first element of the pattern sequence.
[in] patternEnd A forward iterator, [a7] which points beyond the end of the pattern sequence.
[in] pred A binary predicate, [a8] which will used instead of the operator== whenever an equality test between two elements is performed.
[out] return value On success the return value is an input iterator [a6] which points to the last element of the first sub-sequence of the search-range that satisfies the search criteria. On failure the return value is searchRangeEnd.

Searches for a sub-sequence within the search-range [searchRange, searchRangeEnd), which is identical to the pattern sequence [pattern, patternEnd), when compared element-by-element through the user defined predicate pred. Immediately after locating the first sub-sequence that matches these criteria, it returns an iterator pointing to the last element of the particular sub-sequence, otherwise it returns searchRangeEnd when the search-range is exhausted.

template< typename SinglePassRange, typename ForwardRange>
inline typename boost::range_iterator<SinglePassRange>::type
single_pass_search(
        SinglePassRange& searchRange,
        ForwardRange& patternRange )
[in] searchRange A Boost single-pass range object reference, [b4] which defines the search-range.
[in] patternRange A Boost forward range object reference, [b4] which defines the pattern sequence.
[out] return value On success the return value is a Boost single-pass iterator [b1] which points to the last element of the first sub-sequence of the search-range that satisfies the search criteria. On failure the return value is boost::end(searchRange).

Searches for a sub-sequence within the search-range defined by searchRange, which is identical to the pattern sequence defined by pattern, when compared element-by-element. Immediately after locating the first sub-sequence that matches these criteria, it returns an iterator pointing to the last element of the particular sub-sequence, otherwise it returns boost::end(searchRange) when the search-range is exhausted.

Important: In order to work with Boost-range objects, it is required that the USE_BOOST_RANGE preprocessor flag is set to 1. (For example by writing: #define USE_BOOST_RANGE 1 before the #include "single_pass_search.h" code line.)

template< typename SinglePassRange, typename ForwardRange, typename BinaryPredicate>
inline typename boost::range_iterator<SinglePassRange>::type
single_pass_search(
        SinglePassRange& searchRange,
        ForwardRange& patternRange,
        BinaryPredicate pred)
[in] searchRange A Boost single-pass range object reference, [b4] which defines the search-range.
[in] patternRange A Boost forward range object reference, [b4] which defines the pattern sequence.
[in] pred A binary predicate, [a8] which will used instead of the operator== whenever an equality test between two elements is performed.
[out] return value On success the return value is a Boost single-pass iterator [b1] which points to the last element of the first sub-sequence of the search-range that satisfies the search criteria. On failure the return value is boost::end(searchRange).

Searches for a sub-sequence within the search-range defined by searchRange, which is identical to the pattern sequence defined by pattern, when compared element-by-element through the user defined predicate pred. Immediately after locating the first sub-sequence that matches these criteria, it returns an iterator pointing to the last element of the particular sub-sequence, otherwise it returns boost::end(searchRange) when the search-range is exhausted.

Important: In order to work with Boost-range objects, it is required that the USE_BOOST_RANGE preprocessor flag is set to 1. (For example by writing: #define USE_BOOST_RANGE 1 before the #include "single_pass_search.h" code line.)

Internal algorithm

single_pass_search is using internally the good-suffix shift (matching shift) technique. [c4,c2,c3] Traditionally the good-suffix shift is used for speed optimization, but it also has the interesting attribute to eliminate the need for multiple passes in the searched data, during the searching procedure. Hence, the core idea behind the single_pass_search function, is the use of the good-suffix shift technique, not so much for improving the performance, but mainly in order to be able to perform generic sequence searching through single-pass iterators. Of course, any performance improvements caused by the algorithmic superiority are also welcomed, although performance still remains a secondary goal.

Among the first and most notable algorithms that utilize the good-suffix shift technique are the Morris-Pratt [c2] and Knuth-Morris-Pratt [c3] algorithms. More recently, in the excellent paper A Fast Generic Sequence Matching Algorithm, [c5] David R. Musser and Gor V. Nishanov have presented Algorithm-L, which significantly improves the Knuth-Morris-Pratt [c3] algorithm, that is itself an amelioration of the older Morris-Pratt [c2] algorithm. The internal implementation of single_pass_search is actually an Algorithm-L amelioration, [c5,c6] designed to work seamlessly with single-pass iterators as input and further optimized for even better performance.

Complexity and performance

When searching a search-range which contains n elements for a pattern sequence of m elements, single_pass_search will have worst case time-complexity O(n + m), [c3,c5] which is clearly superior compared to the O(n x m) complexity of the std::search STL algorithm. [a5] In addition, David R. Musser and Gor V. Nishanov have done comprehensive performance tests in their corresponding paper, [c5,c6] which strongly indicate that Algorithm-L, used internally in the single_pass_search, outperforms the straight-forward search algorithm [c1] in most usage cases.

Moreover, I have tryed to optimize further the already efficient Algorithm-L, achieving even higher performance in practice. Namely, single_pass_search is able to detect the patterns which does not require any backtracking when a mismatch occurs and includes special code for processing these frequently occurring cases very efficiently. Consequently, although performance is still a secondary goal, single_pass_search is expected to perform quite well compared to the std::search STL algorithm, [a5] having in mind that currently, in all notable STL implementations [a1,a2,a3,a4] std::search is implemented based merely on the straight-forward (brute-force) search algorithm. [c1]

Conclusion

The most interesting thing about single_pass_search, is that it can do almost everything that the std::search STL algorithm does, in a more generic way (via single-pass iterators) and with a superior worst case complexity as well. This is a rare case when a piece of code can be redesigned to become more generic and more efficient at the same time! Of course the usage of the single_pass_search is not as intuitive as I would like it to be, but this is probably a small price to pay, if you need this kind of versatility.

Example

Bellow follows the source code of a simple example program, which counts the occurrences of a particular string-pattern in a given file. This example code is expected to compile without problems on most modern C++ compilers. After compilation the executable program will require two command-line parameters in order to run: The first parameter defines the path of the file to be searched, while the second parameter determines the string-pattern which we are searching for.

While studding the following source code, it is easy to distinguish the do-while loop which counts the pattern occurrences. The great simplicity and small size of this do-while loop is virtually indicating that single_pass_search can be quite convenient to use in practice. (see also: Syntax description)

#include "single_pass_search.h"

#include <iostream>
#include <fstream>
#include <iterator>
#include <string>

int main(int argc, char* argv[])
{
   std::ifstream infile;

   //argv[1]: file path, argv[2]: pattern string
   if (argc != 3)
   {
       std::cout << "Usage: <executable> <file-path> <pattern-string>" << std::endl;
       return -1;
   }

   if (*argv[1] != 0)
        infile.open(argv[1], std::ios::in);

   if ( infile.is_open() )
   {
      typedef std::istreambuf_iterator<char> stream_it_t;
      std::string pattern(argv[2]);

      stream_it_t it, end_it;
      long count = -1;

      do //now count the pattern occurrences (see also: Syntax description)
      {
         it = dpx::single_pass_search(stream_it_t(infile), end_it, pattern.begin(), pattern.end());
         ++count;

      } while (it != end_it);

      infile.close();

      std::cout << "Found " << count << " occurrences of \"" << pattern << "\"" << std::endl;
   }
   else
      std::cout << "Failed to open input file" << std::endl;

   getchar();

   return 0;
}

Revision History

  • 11 October 2008
    • Performance improvements.

  • 13 July 2008
    • Initial release.

References

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here