(untagged)

find_first_of: A performance pitfall among the STL algorithms

Jim Xochellis

0.00/5 (No votes)

18 Jun 2007

This article is discussing the performance problems found in the most notable find_first_of implementations and suggests useful improvements and workarounds.

Download performance tests - 10.9 Kb

Introduction
The algorithm usage and behavior
find_first_of versus strpbrk
- find_first_of seems superior in theory
- The performance comparison reveals an unpleasant surprise
A closer look at the find_first_of implementation
Performance tests
Conclusions
More about this article
- Acknowledgments
- Revision history
References

Introduction

This article discusses the efficiency problems of the find_first_of generic STL algorithm in the most notable STL implementations [1,2,3,4]. The following paragraphs will prove that, in these STL implementations, find_first_of is virtually a performance pitfall among the STL algorithms and must be avoided in performance-critical applications. Furthermore, this article also presents solutions that can be used by the STL implementers in order to improve the performance of the find_first_of and alternatives which the C++ developers can utilize instead of using this inefficient generic algorithm.

My intention is to move straight to the point, without lengthy introductions about the C++ templates or the STL itself. In order to understand this article, you need to have a good understanding of the C++ templates, the standard template library (STL) and, of course, the C++ language.

The algorithm usage and behavior

In order to sufficiently describe the usage and behavior of the algorithm, I have borrowed the following three subsections from the excellent SGI STL documentation. [5] Please see the original documentation for more.

Prototype

find_first_of is an overloaded name; there are actually two find_first_of functions.

template <class InputIter, class ForwardIter>
InputIter find_first_of(InputIter first1, InputIter last1,
    ForwardIter first2, ForwardIter last2);

template <class InputIter, class ForwardIter, class BinaryPredicate>
InputIter find_first_of(InputIter first1, InputIter last1,
    ForwardIter first2, ForwardIter last2, BinaryPredicate comp);

Description

find_first_of is similar to find in that it performs a linear search through a range of Input Iterators. The difference is that while find searches for one particular value, find_first_of searches for any of several values. Specifically, find_first_of searches for the first occurrence in the range [first1, last1) of any of the elements in [first2, last2). Note that this behavior is reminiscent of the function strpbrk from the standard C library.

The two versions of find_first_of differ in how they compare elements for equality. The first uses operator== and the second uses an arbitrary user-supplied function object, comp. The first version returns the first iterator i in [first1, last1) such that for some iterator j in [first2, last2), *i == *j. The second returns the first iterator i in [first1, last1) such that for some iterator j in [first2, last2), comp(*i, *j) is true. As usual, both versions return last1 if no such iterator i exists.

Complexity

At most, there are (last1 - first1) * (last2 - first2) comparisons.

find_first_of versus strpbrk

The find_first_of STL algorithm is a successor of the older strpbrk C library function. The main advantage of the find_first_of algorithm is its generality and flexibility. It can easily work with a variety of containers and arbitrary types of elements, while strpbrk only works with null-terminated char strings. But what if we just need to work with char strings or with std::string objects, which can be easily converted to null-terminated char strings? Should we prefer the newer find_first_of algorithm or just use the older strpbrk function?

find_first_of seems superior in theory

One of the most remarkable qualities of STL is the fact that it aims both towards generality and efficiency at the same time! More specifically, in 1995 the designer of STL, Alexander Stepanov, affirmed in his article in BYTE magazine [7] that generic STL algorithms should not impose any performance penalty compared to their non-generic versions written in C/C++ code. In the same year, Stepanov said in an interview, "Efficiency is a fundamental concern of mine. It is silly to abstract an algorithm in such a way that when you instantiate it back it becomes inefficient." [8] Consequently, there is virtually no reason for any developer to suspect that the use of the find_first_of STL algorithm may introduce any performance problems.

Furthermore, find_first_of is more secure than strpbrk because it performs boundary checks, reducing the possibility of having buffer overruns. Last but not least, most C++ developers have read and heard repeatedly that it is preferable to use the available STL algorithms, like find_first_of, instead of the equivalent C library functions like strpbrk. According to the above facts, the most reasonable choice for C++ developers is to always favor the generic, flexible and safer find_first_of STL algorithm, even in cases where the older strpbrk C library function is also applicable.

The performance comparison reveals an unpleasant surprise

Fortunately, in an article like this we can afford to spend time in performance tests, even in the least suspicious situations. Therefore we have done a performance comparison between the find_first_of STL algorithm and the strpbrk function. [s5] The results of these performance tests in two different configurations [s5] have proved that the find_first_of algorithm performs miserably when the number of the values we are searching for (M=last2-first2) starts to increase! Not only is the find_first_of STL algorithm much slower than the strpbrk C library function, but it also seems to have time complexity no better than O(N*M). In contrast, strpbrk has a superior O(N) time complexity (N=last1-first1).

Having inferior complexity is the worst kind of inefficiency, since it practically means that the performance will become constantly worse when the size of a particular problem factor increases. Unfortunately, the developer who prefers to use the find_first_of STL algorithm will have to suffer a severe performance penalty, even though he has made a very reasonable choice. Hence, find_first_of turns out to be a dangerous performance pitfall among the STL algorithms.

A closer look at the find_first_of implementation

It would be unfair not to mention that in some modern compilers the strpbrk C library function is highly optimized and tuned for maximum CPU exploitation, or even implemented using assembly instructions. However, these advantages in the strpbrk implementations cannot justify its superior complexity compared to the find_first_of implementations. The time complexity cannot be affected by the level of code optimization or even by switching between C/C++ and assembly. It is practically hardware independent, as well. Only design faults or unfortunate design decisions in the core algorithm can actually damage the time complexity. Consequently, it would be very interesting to proceed taking a closer look at the most notable find_first_of implementations in order to determine the source of the performance problems. Afterwards, we will also try to improve the efficiency of this generic algorithm.

Identifying the source of the problem

The following find_first_of implementation is part of the SGI STL version 3.3. [1,5] This is a rather typical implementation, very similar to the implementations currently used by the very popular MS-VC++ and GCC-C++ compilers [2,4] and also identical to the STLPort STL implementation. [3]

template<typename InputIter, typename ForwardIter>
InputIter find_first_of(InputIter first1, 
    InputIter last1, ForwardIter first2, ForwardIter last2)
{
    for ( ; first1 != last1; ++first1)
    {
        for (ForwardIter iter = first2; iter != last2; ++iter)
        {
            if (*first1 == *iter)
                return first1;
        }
    }

    return last1;
}

Apparently this typical find_first_of implementation consists of two nested for-loops. The outer loop picks one-by-one the elements of the search range [first1, last1) and executes the inner loop, which is displayed bold in the above code snippet. The inner loop runs for each search range element; its sole task is to discover if the element in question is included in the range [first2, last2). In case an inclusion is detected, find_first_of immediately returns the current element (first1). Otherwise, it continues until the outer loop is exhausted and finally returns last1.

Given that the [first2, last2) range virtually defines the set of elements we are searching for, the inner for-loop of the above find_first_of implementation actually tests for set-membership. However, loops are very inefficient when used for set-membership testing and this technique is evidently the source of the find_first_of performance problems. Namely, the inner-loop of the above find_first_of implementation carries out (last2 - first2) iterations in the case of a mismatch. The worst case scenario is to have a mismatch for all elements of the search range [first1, last1), performing the total of (last1 - first1) * (last2 - first2) iterations, exactly like the SGI STL documentation [5] mentions. Nevertheless, the performance comparison between find_first_of and strpbrk [s5] has already proved that this number of iterations is rather excessive. Under the same circumstances, the complexity of the strpbrk C library function is no worse than O(last1 - first1), regardless of the [last2 - first2) range size.

dpx::find_first_of: an efficient specialization

Since the source of the find_first_of performance problems has been identified, it is now very important to search for techniques that can be used by the STL implementers in order to improve the efficiency of this generic algorithm. Interestingly enough, in the case that the input ranges of the find_first_of contain signed or unsigned chars, we can take advantage of the fact that one-byte integers can only have up to 256 distinct values. Namely, a find_first_of implementation specialized for signed or unsigned chars can easily prepare a 256 bytes array named MapArray in such a way that, for every one-byte integer ch it will be MapArray[ch]=1 if the value of ch is included in the range [first2, last2) and MapArray[ch]=0 otherwise.

After having this array prepared, the whole inner for-loop of the typical find_first_of implementation presented above [s4.1] can be effectively replaced by the expression: if (MapArray[*first1] != 0) return first1. This replacement will reduce the main iterations to the amount of (last1 - first1), while adding (last2 - first2)+256 operations because of the MapArray preparation. Consequently, the complexity of this specialized find_first_of implementation will be (last1 - first1) + (last2 - first2). This outruns by far the most commonly used find_first_of implementations [1,2,3,4] while being comparable to the time complexity of the strpbrk C library function.

The entire code of this enhanced find_first_of specialization is presented in the following snippet. Displayed in bold are the portions that have been modified compared to the typical find_first_of implementation of the previous subsection. [s4.1] The same code has been benchmarked in two different test configurations [s5] and has been proved to be very efficient in practice, exactly like its theoretical complexity indicates.

// An efficient find_first_of implementation for 

// one-byte elements only (specialization)


template<typename InputIter, typename ForwardIter>
InputIter find_first_of(InputIter first1, 
    InputIter last1, ForwardIter first2, ForwardIter last2)
{
    unsigned char MapArray[256];
    memset(MapArray, 0, 256);

    for (ForwardIter iter = first2; iter != last2; ++iter)
        MapArray[(unsigned char) *iter] = 1;

    for ( ; first1 != last1; ++first1)
    {
        if (MapArray[(unsigned char) *first1] != 0)
            return first1;
    }

    return last1;
}

/*
 *  Copyright (c) 2007
 *  Jim Xochellis
 *
 *  Permission to use, copy, modify, distribute and 
 *  sell this software for any purpose
 *  is hereby granted without fee, provided that the 
 *  above copyright notice appears in
 *  all copies and that both that copyright notice 
 *  and this permission notice appear in
 *  supporting documentation. Jim Xochellis makes 
 *  no representations about the suitability
 *  of this software for any purpose. It is provided 
 *  "as is" without express or implied
 *  warranty.
 *
 */

Although the dpx::find_first_of specialization is only applicable with one-byte integers, this fact does not reduce its significance at all. The specialization is still more generic than the strpbrk C library function. Hence, it can effectively moderate the drawbacks of the find_first_of algorithm compared to its successor. Furthermore, it will eliminate the frustration that C++ developers will feel when they discover the performance penalty that they have to pay each time they prefer find_first_of instead of strpbrk. In addition, if a specialization like this is included in an STL implementation, the performance of find_first_of will be improved automatically, without requiring any special handing by the developer.

Moreover, I could add to the above reasoning that many C++ developers are still very hesitant when they are about to use a generic STL algorithm, even though most of these algorithms are quite elegant, very versatile and STL guarantees their performance. Having STL algorithms that perform much worse than their C library counterparts -- i.e. find_first_of vs strpbrk [s3.2] -- will make programmers even more reluctant to exploit these excellent programming tools. In addition, since Alexander Stepanov affirmed a long time ago [7] that the generic STL algorithms should not impose any performance penalty compared to their non-generic versions written in C/C++ code, it would be important for the reputation of the STL to eliminate any severe performance drawbacks that may still exist among its generic algorithms. For all these reasons, the use of a find_first_of specialization for one-byte integers could be quite beneficial.

However, there is also a price to pay when using the dpx::find_first_of specialization. Namely, the byte-array which is used by this specialization to dramatically speed up the set-membership testing consumes 256 bytes of RAM when the algorithm is running. This fact may be prohibitive in cases when RAM is limited, such as embedded systems. A possible workaround is to replace the array of 256 bytes with an array of 256 bits. The bit array will consume only 32 bytes of RAM -- 8 times less -- while reducing the performance of the algorithm only slightly and without affecting its complexity. The final choice clearly depends on RAM availability.

Some other alternatives

Apart from the fact that the find_first_of algorithm can be significantly improved, in the case of the one-byte integers, C++ developers can also utilize some other efficient alternatives that C++ and STL offer, provided that they are aware of the find_first_of performance problems. For instance, they have at their disposal associative containers like set [9] and hash_set [10], which are specifically designed for efficient set-membership testing. Consequently, if the elements of the [first2, last2) range are loaded in an associative container, then an adequate functor in conjunction with the find_if [11] STL algorithm can effectively imitate the find_first_of behavior in a much more efficient way. The following code snippet presents two examples that demonstrate how to use the set and hash_set containers respectively in order to efficiently replicate the find_first_of algorithm.

//---------------------------------------------------

//SetMembershipFunctor: a functor used in conjunction

//with find_if to produce find_first_of alternatives

//---------------------------------------------------


template<typename Set>
struct SetMembershipFunctor : public std::unary_function<
    typename Set::value_type, bool>
{
    public:
        SetMembershipFunctor(Set *setPtr) 
    { 
        mSetPtr = setPtr; 
    }

    inline bool operator() (const typename Set::value_type val) const
    { 
        return mSetPtr->find(val) != mSetPtr->end(); 
    }

    protected:
        Set *mSetPtr;
};

//-----------------------------------------------

//A find_first_of alternative based on a std::set

//-----------------------------------------------


typedef std::set<element_type> set_of_elements;
set_of_elements s1(search_elements.begin(), search_elements.end());
SetMembershipFunctor<set_of_elements> memberOfSet1(&s1);

search_range_type::const_iterator it1 =
    std::find_if(search_range.begin(), search_range.end(), memberOfSet1);

//-----------------------------------------------

//A find_first_of alternative based on a hash_set

//-----------------------------------------------


typedef hash_set<element_type> hash_set_of_elements; 
hash_set_of_elements s2(search_elements.begin(), search_elements.end());
SetMembershipFunctor<set_of_elements> memberOfSet2(&s2);

search_range_type::const_iterator it2 =
    std::find_if(search_range.begin(), search_range.end(), memberOfSet2);

Furthermore, we can also effectively imitate the behaviour of the find_first_of algorithm by just using sorted ranges. Namely, if we transform the [first2, last2) elements into a sorted range, we can afterwards use an adequate functor in conjunction with the find_if [11] STL algorithm, like the following code snippet demonstrates, and outrun by far the most notable find_first_of implementations.

//-----------------------------------------------------------

//SortedRangeMembershipFunctor: a functor used in conjunction

//with find_if to produce find_first_of alternatives

//-----------------------------------------------------------


template<typename ForwardIter>
struct SortedRangeMembershipFunctor
    : public std::unary_function<
        typename std::iterator_traits<ForwardIter>::value_type, bool>
{
    public:
        SortedRangeMembershipFunctor(ForwardIter first, ForwardIter last)
        : mFirst(first), mLast(last) {}

    inline bool operator() (
        const typename std::iterator_traits<ForwardIter>::value_type val) const
    { 
        return std::binary_search(mFirst, mLast, val); 
    }

    protected:
        ForwardIter mFirst, mLast;
};

//----------------------------------------------------

//A find_first_of alternative based on a sorted ranges

//----------------------------------------------------


typedef std::vector<test_string::value_type> sorted_range;
sorted_range elementsRange(search_elements.begin(), search_elements.end());
std::sort(elementsRange.begin(), elementsRange.end());

SortedRangeMembershipFunctor<sorted_range::const_iterator>
    memberOfRange(elementsRange.begin(), elementsRange.end());

search_range_type::const_iterator it =
    std::find_if(search_range.begin(), search_range.end(), memberOfRange);

Theoretically, the set-membership testing has only constant time complexity on the hash_set container, while the complexity of the same procedure on the std::set container and on sorted ranges is logarithmic. Consequently, the expected time complexities of the find_first_of alternatives are O(last1-first1) for the hash_set based solution and O((last1-first1)*log(last2-first2)) for the other two approaches. The performance tests [s5] have actually confirmed the above complexity estimations, proving in practice the superiority of these approaches against the most notable find_first_of implementations. [2,4]

Unfortunately, the above workarounds are only compatible with a subset of the data types which the find_first_of algorithm supports. Namely, they can only work with items which are either sortable or good candidates for hash_set elements. Consequently, these alternatives cannot completely and unconditionally replace the more generic find_first_of STL algorithm.

Performance tests

In the previous sections, this article discussed the performance problems found in the most notable find_first_of implementations [s4.1]. This article has also presented an efficient find_first_of specialization [s4.2] and some other options. [s4.3] In this next section, we are going to measure and compare the speed of these techniques in practice, by actually running and timing them. The performance tests have been based upon the following rules:

The test scenario is rather simple. Two notable find_first_of implementations [2,4], the strpbrk C library function [6], the dpx::find_first_of specialization [s4.2] and some other find_first_of alternatives presented in this article [s4.3] have been timed while processing large amounts of data. Namely, every tested algorithm has scanned and overtaken exactly N items, looking for the first occurrence of any one out of M different values. In order to ensure accurate time measurements, the input data have been fabricated in such a way that the search criteria are only met at the N+1 item. N=last1-first1 & M=last2-first2 in find_first_of [s2.1].
The number N of the items that are overtaken during the searching procedure has been kept the same in all tests. Thus, this factor can be safely excluded from the reasons that can cause timing variations.
The only factor that actually varies during the tests is the number M of the values we are searching for. The primary mission of these performance tests is to demonstrate how the increment of the M factor affects the elapsed times in each tested algorithm.
One-byte integers are the only item type that has been used in these tests. This was actually a rather imperative decision, since both the strpbrk C library function and the dpx::find_first_of specialization are only applicable with one-byte integers.
Two distinct configurations have been used in the tests. In each configuration the hardware, operating system, compiler, find_first_of (STL) implementation and strpbrk (C library) implementation are intentionally different. All tests have been performed twice: once in each configuration, using exactly the same test scenario and code.
The code used for the tests is included at the top of this article and can be used both for verification of the test results and also for further experimentation.

The test results have been visualized in the following two graphs: one graph for each of the two test configurations and one graph-line for each tested algorithm. The vertical axes in both graphs represent the elapsed time, while the horizontal axes represent the number M of the values we are searching for. The exact values of the elapsed times have been omitted, since they are configuration-dependent and not particularly useful. What matters the most is the variation of the elapsed times when M is being increased.

Performance test A

Configuration

HW	Intel Pentium III - 800 Mhz
OS	Fedora Core 5 Linux
SW	GCC 4.1.0

Description

A	std::find_first_of	[s4.1] [2]
B	C library strpbrk	[s4.1] [6]
C	dpx::find_first_of	[s4.2]
D	std::set	[s4.3] [9]
E	hash_set	[s4.3] [10]
F	sorted range	[s4.3]

Performance test B

Configuration

HW	Intel(R) Core(TM)2 6400 - 2.13 GHz
OS	Windows XP Pro
SW	Microsoft Visual C++ 2005

Description

A	std::find_first_of	[s4.1] [4]
B	C library strpbrk	[s4.1] [6]
C	dpx::find_first_of	[s4.2]
D	std::set	[s4.3] [9]
E	hash_set	[s4.3] [10]
F	sorted range	[s4.3]

Points of interest

Evidently, the elapsed times of all the tested algorithms depend on the number N of the items that are overtaken during the searching procedure. On the other hand, the number M of the values we are searching for has an impact on the performance of the above algorithms. This can vary from insignificant to severe, depending on the design of each algorithm. These facts can be visually observed in the graphs. Thus, the six tested algorithms can actually be divided into three basic categories according to their time complexities:

Insignificant impact of M, time complexity O(N). Both graphs demonstrate that the performance of the strpbrk (line B) function, the dpx::find_first_of specialization (line C) and the find_first_of alternative that is based on the hash_set (line E) remain practically unaffected when M is increased. However, the hash_set based alternative is significantly slower than the other two counterparts.
Mild impact of M, time complexity O(N*log(M)). The find_first_of alternatives which are based on std::set (line D) and on sorted ranges (line F) are slowing down a little bit when M is increased.
Severe impact of M, time complexity O(N*M). The find_first_of implementations (line A) of both the test configurations are slowing down very much when M is increased. These two implementations are by far the most inefficient algorithms in these performance tests.

Conclusions

Undoubtedly the find_first_of generic STL algorithm is a generalization of the older strpbrk C library function, more versatile and superior in theory. [s3.1] Unfortunately, in the most notable STL implementations [1,2,3,4] the find_first_of algorithm has been abstracted disregarding its efficiency drawbacks [s4.1]. This makes it quite dangerous in performance-critical applications! [s3.2]

The worst part of the problem is that only a few C++ developers are likely to suspect the performance drawbacks of the find_first_of generic algorithm compared to the strpbrk C library function. [s3] The average programmer will endure a completely unexpected performance penalty, falling practically into a performance pitfall. Fortunately, there is a way out of this pitfall, provided that the STL implementers are willing to incorporate in their libraries some efficient find_first_of specializations for one-byte integers, like the one presented in this article. [s4.2]

However, things can be much better in the rare case that the programmer is fully aware of the find_first_of efficiency problems. Many efficient alternatives and workarounds [s4.3] can be utilized by C++ developers in order to avoid using this hopelessly inefficient generic STL algorithm. It turns out that find_first_of is probably an STL algorithm that we should learn how to avoid!

My general conclusion, after studying many notable STL implementations, is that although the generic STL algorithms can theoretically compete with the performance of their non-generic counterparts, there is still much to be done before this can actually happen in practice. Unfortunately, I am not particularly pleased with the relatively slow rate that the STL generic algorithms evolve regarding their efficiency. This is ultimately why I write articles like this one and like my previous article about the inefficiency of the search_n implementations. [12] As an epilogue, I would like to repeat once more the words of Alexander Stepanov [8], "It is silly to abstract an algorithm in such a way that when you instantiate it back it becomes inefficient."

References

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here

find_first_of: A performance pitfall among the STL algorithms

Contents

Performance test A

Performance test B

Points of interest

License