Introduction
There are probably dozens of reasons a programmer might want to search a text string for the presence of any of several substrings simultaneously. The brute force approach taken all too often is to simply loop through a list of tokens, then search the string
sequentially, beginning to end, for each one. Needless to say, this is very inefficient, and can even become outrageously long if the list of tokens is long and/or the searched string
is long. I have created a class to simplify the task and perform it very efficiently. The class is offered in two versions now: one using MFC CString
and CStringArray
, and another using STL std::string
and std::vector
.
Acknowledgement
In the June 2002 issue of C/C++ Users Journal, Moishe Halibard and Moshe Rubin published an article entitled "A Multiple Substring Search Algorithm", in which they greatly detailed their algorithm for an efficient finite state machine (FSM) to assist in searching for multiple substrings in one pass. My class is based on this algorithm.
Update Details
For those of you who have seen the versions of these classes that I published in 2002, here is a summary of the major changes I've made:
- Support for a
string
of delimiter characters.
- Acceleration of the
FindNext
method to skip over delimiters at the start of a search, and skip to the next delimiter when a word is rejected.
- Replacement of the FSM implementation to use a
std::vector
of a new SEntry struct
type instead of a dynamic array of DWORD
s.
- More use of the more appropriate
size_t
type instead of int.
- The sample project uses VC++ 9.0 (VS2008) instead of the archaic and buggy VC++ 6.0.
Class Interface
The CIVStringSet
class is derived from either CStringArray
or std::vector
, so as to easily inherit the capability of a dynamic array of CString
s or std::string
s.
However, many of the base class methods have been hidden by my class to prevent any insertion or removal of string
s except at the end of the array. It is critical that the array index for a given word not be changed after it is added to the FSM.
The public
interfaces to the classes are as follows:
struct SEntry
{
SEntry( DWORD dwState = 0UL, DWORD dwIndex = 0UL )
: m_dwState( dwState ), m_dwIndex( dwIndex ) {}
DWORD m_dwState ;
DWORD m_dwIndex ;
} ;
class CIVStringSet : public CStringArray
{
public:
CIVStringSet( DWORD dwInitialWidth = 64 ) ; virtual ~CIVStringSet() ;
bool Add( LPCTSTR pszWord ) ; bool Add( const CString & rstrWord ) ; size_t Add( LPCTSTR pszWords, LPCTSTR pszDelims ) ; size_t Add( LPCTSTR pszzWords, size_t nWords ) ; size_t Add( const std::set<CString< & sstrWords ) ; size_t Add( const CStringArray & astrWords ) ; size_t Add( const CStringList & lstrWords ) ;
void SetDelims( LPCTSTR pszDelims ) { m_strDelims = pszDelims ; } UINT FindFirstIn( CString strText, size_t & rnFirst ) ; UINT FindNext( size_t & rnNext ) ;
typedef std::pair<size_t,UINT> CWordPosPair ; typedef std::list< std::pair<size_t,UINT< < CWordPosPairList ; size_t FindAllIn( CString strText, CWordPosPairList & rlstrWords ) ; } ;
#ifdef _UNICODE
typedef std::wstring _tstring ;
#else
typedef std::string _tstring ;
#endif
class CIVStringSet : public std::vector<_tstring>
{
public:
CIVStringSet( unsigned long dwInitialWidth = 64 ) ; virtual ~CIVStringSet() ;
bool Add( LPCTSTR pszWord ) ; bool Add( const _tstring & rstrWord ) ; size_t Add( LPCTSTR pszWords, LPCTSTR pszDelims ) ; size_t Add( LPCTSTR pszzWords, size_t nWords ) ; size_t Add( const std::set<_tstring> & sstrWords ) ; size_t Add( const std::vector<_tstring> & astrWords ) ; size_t Add( const std::list<_tstring> & lstrWords ) ;
void SetDelims( LPCTSTR pszDelims ) { m_strDelims = pszDelims ; } UINT FindFirstIn( const _tstring & strText, size_t & rnFirst ) ; UINT FindNext( size_t & rnNext ) ;
typedef std::pair<size_t,UINT> CWordPosPair ; typedef std::list< std::pair<size_t,UINT> > CWordPosPairList ; size_t FindAllIn( _tstring strText, CWordPosPairList & rlstrWords ) ; } ;
The constructor takes a single, optional argument that determines the initial width of the FSM. The width will need to be at least as much as the combined lengths of all the string
s in the array. The FSM will be enlarged, as needed as string
s are added to the collection. A good guess of the total length when constructing may save a trivial reallocation later.
There are seven overloads of the Add
method, allowing you to add one word at a time, or an entire set, list, or array of words at once. See the demo project for an example of using one form of Add
, as well as FindFirstIn
and FindNext
.
As string
s are added to the collection, the FSM is updated. Once all string
s are added, no additional preparation is required to do a search. There are three methods for searching: FindFirstIn
, FindNext
, and FindAllIn
. The first two are used to iterate through the found substrings one at a time. The third method uses the first two to build a CWordPosPairList
for you. Simply pass in the text string
to be searched as the first parameter to FindFirstIn
or FindAllIn
, and the class will find all instances of any string
s in the array.
FindFirstIn
and FindNext
both return the offset within the searched text at which a token was found. The size_t &
parameter will be filled upon return with the index in the string
array of the string
token that was found. FindAllIn
returns the number of tokens found, and takes a CWordPosPairList &
as its second parameter. A CWordPosPairList &
is an STL list of pairs. Each pair returned contains the index and offset of a found token. See the demo program for an example of using this.
Demo Program
By request, I have also provided a simple dialog-based demo program that shows the usage of the Add
and Find
methods. It uses the form of Add
that takes a delimited string
, parses it to create the list of seek tokens, then fills a list box with those found.
Source Highlights
For those who are curious, but don't want to bother downloading the source, here are some of the most interesting functions. If the code and comments aren't satisfactory documentation, let me know.
bool CIVStringSet::InsertWord( LPCTSTR pszWord, unsigned long dwIndex )
{
bool bRetVal = false ;
size_t nLen = _tcslen( pszWord ) ;
if ( m_dwMaxUsedState < MAXDWORD ) {
unsigned long dwCurState = 0UL ;
for ( UINT nChar = 0U ; nChar < nLen ; nChar++ )
{
size_t nIdxChar =
min(static_cast<size_t>(pszWord[nChar]), c_nMaxChar) ;
SEntry & rEntry = m_apnFSM[dwCurState][nIdxChar] ; unsigned long dwState = rEntry.m_dwState ;
bool bNew = (dwState == 0UL) ;
if ( bNew )
dwState = ++m_dwMaxUsedState ;
if ( nChar == ( nLen-1 ) )
{
rEntry.m_dwState = dwState ;
rEntry.m_dwIndex = dwIndex ;
break ;
}
else if ( bNew ) rEntry.m_dwState = dwState ;
dwCurState = dwState ; if ( m_apnFSM.size() <= dwCurState )
m_apnFSM.resize( ((dwCurState * 2) + 63) & 0xFFC0,
vector( c_nMaxChar ) ) ;
}
bRetVal = true ;
}
return bRetVal ;
}
UINT CIVStringSet::FindFirstIn( const _tstring & strText, size_t & rnFirst )
{
m_nCurTextChar = 0U ;
m_strSearch = strText ;
return FindNext( rnFirst ) ;
}
UINT CIVStringSet::FindNext( size_t & rnNext )
{
unsigned long dwCurState = 0UL ; UINT nIdx ;
UINT nStartChar = MAXDWORD ;
size_t nLen = m_strSearch.length() ;
for ( nIdx = m_nCurTextChar ; nIdx < nLen ; nIdx++ )
{
size_t nNextIdx = m_strSearch.find_first_not_of( m_strDelims, nIdx ) ;
if ( nNextIdx != nIdx )
{
nStartChar = MAXDWORD ;
if ( ( nNextIdx != _tstring::npos )
&& ( nNextIdx < nLen )
)
{
dwCurState = 0UL ;
nIdx = nNextIdx ;
}
else
break ;
}
if ( dwCurState == 0UL )
nStartChar = nIdx ;
size_t nIdxChar = min(static_cast<size_t>(m_strSearch[nIdx]), c_nMaxChar) ;
const SEntry & rEntry = m_apnFSM[dwCurState][nIdxChar] ;
if ( ( rEntry.m_dwIndex != 0UL )
&& ( ( nIdx == nLen-1 )
|| ( m_strDelims.find_first_of( m_strSearch[nIdx+1] ) != _tstring::npos )
)
)
{
size_t nIndex = rEntry.m_dwIndex - 1 ;
ASSERT(nIndex < size());
rnNext = nIndex ;
m_nCurTextChar = nIdx + 1 ; break ;
}
else
dwCurState = rEntry.m_dwState ;
if ( dwCurState == 0 )
{
nStartChar = MAXDWORD ; nIdx = m_strSearch.find_first_of( m_strDelims, nIdx ) ; }
}
return (( nIdx < nLen ) ? nStartChar : MAXDWORD) ;
}
Updates
- 1 July 2002 - Added STL version
- 24 April 2010 - Revised MFC and STL versions