(untagged)

A Multiple Substring Search Class: CIVStringSet

Scot Brennecke

0.00/5 (No votes)

25 Apr 2010

A string array class using MFC or STL that performs very fast multiple string searches

Introduction

There are probably dozens of reasons a programmer might want to search a text string for the presence of any of several substrings simultaneously. The brute force approach taken all too often is to simply loop through a list of tokens, then search the string sequentially, beginning to end, for each one. Needless to say, this is very inefficient, and can even become outrageously long if the list of tokens is long and/or the searched string is long. I have created a class to simplify the task and perform it very efficiently. The class is offered in two versions now: one using MFC CString and CStringArray, and another using STL std::string and std::vector.

Acknowledgement

In the June 2002 issue of C/C++ Users Journal, Moishe Halibard and Moshe Rubin published an article entitled "A Multiple Substring Search Algorithm", in which they greatly detailed their algorithm for an efficient finite state machine (FSM) to assist in searching for multiple substrings in one pass. My class is based on this algorithm.

Update Details

For those of you who have seen the versions of these classes that I published in 2002, here is a summary of the major changes I've made:

Support for a string of delimiter characters.
Acceleration of the FindNext method to skip over delimiters at the start of a search, and skip to the next delimiter when a word is rejected.
Replacement of the FSM implementation to use a std::vector of a new SEntry struct type instead of a dynamic array of DWORDs.
More use of the more appropriate size_t type instead of int.
The sample project uses VC++ 9.0 (VS2008) instead of the archaic and buggy VC++ 6.0.

Class Interface

The CIVStringSet class is derived from either CStringArray or std::vector, so as to easily inherit the capability of a dynamic array of CStrings or std::strings.

However, many of the base class methods have been hidden by my class to prevent any insertion or removal of strings except at the end of the array. It is critical that the array index for a given word not be changed after it is added to the FSM.

The public interfaces to the classes are as follows:

struct SEntry
{
    SEntry( DWORD dwState = 0UL, DWORD dwIndex = 0UL )
        : m_dwState( dwState ), m_dwIndex( dwIndex ) {}

    DWORD m_dwState ;
    DWORD m_dwIndex ;
} ;

class CIVStringSet : public CStringArray
{
public:
    CIVStringSet( DWORD dwInitialWidth = 64 ) ;  		// Initial width of FSM
    virtual ~CIVStringSet() ;

    bool     Add( LPCTSTR pszWord ) ;                     	// a single word
    bool     Add( const CString & rstrWord ) ;            	// a single word
    size_t   Add( LPCTSTR pszWords, LPCTSTR pszDelims ) ; 	// multiple words, 
					// delimited with chars from pszDelims
    size_t   Add( LPCTSTR pszzWords, size_t nWords ) ;    	// nWords words, 
					// each 0 term'd, with extra 0 at end
    size_t   Add( const std::set<CString< & sstrWords ) ; 	// all the elements of a 
							// set of CStrings
    size_t   Add( const CStringArray & astrWords ) ;      	// all the elements 
							// of a CStringArray
    size_t   Add( const CStringList & lstrWords ) ;       	// all the elements of a 
							// CStringList

    void     SetDelims( LPCTSTR pszDelims ) { m_strDelims = pszDelims ; } // delimiters 
							// to use in word search
    UINT     FindFirstIn( CString strText, size_t & rnFirst ) ; // Begin iteration
    UINT     FindNext( size_t & rnNext ) ;                      // Continue iteration

    typedef std::pair<size_t,UINT>              CWordPosPair ;     // first is index 
				// of word in array, second is position in text
    typedef std::list< std::pair<size_t,UINT< < CWordPosPairList ; // list of pairs to 
						// be returned by FindAllIn
    size_t   FindAllIn( CString strText, CWordPosPairList & rlstrWords ) ; // Iterate 
						// all at once and make list
} ;

#ifdef _UNICODE
typedef std::wstring _tstring ;
#else
typedef std::string _tstring ;
#endif

class CIVStringSet : public std::vector<_tstring>

{
public:
    CIVStringSet( unsigned long dwInitialWidth = 64 ) ;  // Initial width of FSM
    virtual ~CIVStringSet() ;

    bool        Add( LPCTSTR pszWord ) ;                         // a single word
    bool        Add( const _tstring & rstrWord ) ;               // a single word
    size_t      Add( LPCTSTR pszWords, LPCTSTR pszDelims ) ;     // multiple words, 
					// delimited with chars from pszDelims
    size_t      Add( LPCTSTR pszzWords, size_t nWords ) ;        // nWords words, 
					// each 0 term'd, with extra 0 at end
    size_t      Add( const std::set<_tstring> & sstrWords ) ;    // all the elements of 
							 // a set of strings
    size_t      Add( const std::vector<_tstring> & astrWords ) ; // all the elements 
						// of an array of strings
    size_t      Add( const std::list<_tstring> & lstrWords ) ;   // all the elements 
							 // of a list of strings

    void        SetDelims( LPCTSTR pszDelims ) { m_strDelims = pszDelims ; } //delimiters
							// to use in word search
    UINT        FindFirstIn( const _tstring & strText, size_t & rnFirst ) ;  // Begin 
								// iteration
    UINT        FindNext( size_t & rnNext ) ;                                // Continue 
								// iteration

    typedef std::pair<size_t,UINT>              CWordPosPair ;     // first is index 
				// of word in array, second is position in text
    typedef std::list< std::pair<size_t,UINT> > CWordPosPairList ; // list of pairs 
						// to be returned by FindAllIn
    size_t      FindAllIn( _tstring strText, CWordPosPairList & rlstrWords ) ; // Iterate
						// all at once and make list
} ;

The constructor takes a single, optional argument that determines the initial width of the FSM. The width will need to be at least as much as the combined lengths of all the strings in the array. The FSM will be enlarged, as needed as strings are added to the collection. A good guess of the total length when constructing may save a trivial reallocation later.

There are seven overloads of the Add method, allowing you to add one word at a time, or an entire set, list, or array of words at once. See the demo project for an example of using one form of Add, as well as FindFirstIn and FindNext.

As strings are added to the collection, the FSM is updated. Once all strings are added, no additional preparation is required to do a search. There are three methods for searching: FindFirstIn, FindNext, and FindAllIn. The first two are used to iterate through the found substrings one at a time. The third method uses the first two to build a CWordPosPairList for you. Simply pass in the text string to be searched as the first parameter to FindFirstIn or FindAllIn, and the class will find all instances of any strings in the array.

FindFirstIn and FindNext both return the offset within the searched text at which a token was found. The size_t & parameter will be filled upon return with the index in the string array of the string token that was found. FindAllIn returns the number of tokens found, and takes a CWordPosPairList & as its second parameter. A CWordPosPairList & is an STL list of pairs. Each pair returned contains the index and offset of a found token. See the demo program for an example of using this.

Demo Program

By request, I have also provided a simple dialog-based demo program that shows the usage of the Add and Find methods. It uses the form of Add that takes a delimited string, parses it to create the list of seek tokens, then fills a list box with those found.

Source Highlights

For those who are curious, but don't want to bother downloading the source, here are some of the most interesting functions. If the code and comments aren't satisfactory documentation, let me know.

//~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
// InsertWord
//          put the new word into the FSM
//~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
bool CIVStringSet::InsertWord( LPCTSTR pszWord, unsigned long dwIndex )
{
    bool bRetVal = false ;

    size_t nLen = _tcslen( pszWord ) ;
    if ( m_dwMaxUsedState < MAXDWORD )    	// if we've got over 4 billion states, 
					// something's wrong
    {
        unsigned long dwCurState = 0UL ;
        for ( UINT nChar = 0U ; nChar < nLen ; nChar++ )
        {
            size_t          nIdxChar = 
		min(static_cast<size_t>(pszWord[nChar]), c_nMaxChar) ;
            SEntry        & rEntry   = m_apnFSM[dwCurState][nIdxChar] ;  // the state, 
						// and possibly also an index
            unsigned long   dwState  = rEntry.m_dwState ;
            bool            bNew     = (dwState == 0UL) ;   	// no previous words 
							// have been here

            if ( bNew )
                dwState = ++m_dwMaxUsedState ;  // use a new state number

            // special case - last character of substring
            if ( nChar == ( nLen-1 ) )
            {
                // Put both the state number and the word index in the entry
                rEntry.m_dwState = dwState ;
                rEntry.m_dwIndex = dwIndex ;
                break ;
            }
            else if ( bNew ) // if current entry for this char value and state 
				// is still zero, add to FSM
                rEntry.m_dwState = dwState ;

            dwCurState = dwState ;  // prepare for next iteration
            if ( m_apnFSM.size() <= dwCurState )
                m_apnFSM.resize( ((dwCurState * 2) + 63) & 0xFFC0, 
			vector( c_nMaxChar ) ) ;
        }

        bRetVal = true ;
    }

    return bRetVal ;
}

//~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
// FindFirstIn
//~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
UINT CIVStringSet::FindFirstIn( const _tstring & strText, size_t & rnFirst )
{
    // The real work is done by FindNext, but we must initialize the starting
    //    character index and string buffer first
    m_nCurTextChar = 0U ;
    m_strSearch = strText ;

    return FindNext( rnFirst ) ;
}

//~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
// FindNext
//~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
UINT CIVStringSet::FindNext( size_t & rnNext )
{
    unsigned long dwCurState = 0UL ; // start in state 0
    UINT nIdx ;
    UINT nStartChar = MAXDWORD ;
    size_t nLen = m_strSearch.length() ;

    // Start with the character following the last one examined
    for ( nIdx = m_nCurTextChar ; nIdx < nLen ; nIdx++ )
    {
        size_t nNextIdx = m_strSearch.find_first_not_of( m_strDelims, nIdx ) ;
        if ( nNextIdx != nIdx )
        {
            // Skip over delimiters
            nStartChar = MAXDWORD ;
            if ( ( nNextIdx != _tstring::npos )
              && ( nNextIdx < nLen )
               )
            {
                dwCurState = 0UL ;
                nIdx = nNextIdx ;
            }
            else
                break ;
        }

        // if we are in state 0, save the index of the starting character,
        //   in case we must backtrack or for the return value
        if ( dwCurState == 0UL )
            nStartChar = nIdx ;

        // follow up the table states
        size_t nIdxChar = min(static_cast<size_t>(m_strSearch[nIdx]), c_nMaxChar) ;
        const SEntry & rEntry = m_apnFSM[dwCurState][nIdxChar] ;

        // if we reach an entry with a an index, we may have found a full match
        //    if so, return the word we found
        if ( ( rEntry.m_dwIndex != 0UL )
          && ( ( nIdx == nLen-1 )
            || ( m_strDelims.find_first_of( m_strSearch[nIdx+1] ) != _tstring::npos )
             )
           )
        {
            size_t nIndex = rEntry.m_dwIndex - 1 ;
            ASSERT(nIndex < size());
            rnNext = nIndex ;
            m_nCurTextChar = nIdx + 1 ;  // set the index for the next time
            break ;
        }
        else
            dwCurState = rEntry.m_dwState ;

        if ( dwCurState == 0 )
        {
            nStartChar = MAXDWORD ;  // in case we're about to terminate the loop
            nIdx = m_strSearch.find_first_of( m_strDelims, nIdx ) ; // advance to 
							// the next delimiter
        }
    }

    // return the index of the start of the word, or -1 if no word found
    return (( nIdx < nLen ) ? nStartChar : MAXDWORD) ;
}

Updates

1 July 2002 - Added STL version
24 April 2010 - Revised MFC and STL versions

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here