(untagged)

A C++ STL String Tokenizer

George Anescu

0.00/5 (No votes)

13 Oct 2001

A C++ STL Tokenizer class capable to tokenize a string when the set of character separators is specified by another string

Download demo project - 5 Kb

Introduction

The class CTokenizer I am presenting in this article is capable of tokenizing an STL string when the set of character separators is specified by a predicate class. This is a very generally designed as a class template:

template <class Pred>
void CTokenizer { /*...*/ };

The separating (tokenizing) criteria being implemented in the argument predicate class Pred. The predicate classes are usually derived from unary_function<char, bool> and implement the () operator. I am giving only three examples of predicate classes: CIsSpace where the set of separators contains the white spaces 0x09-0x0D and 0x20, CIsComma where the separator is the comma character ',' and CIsFromString where the set of separators is specified by the characters in a STL string. Other predicate classes can be easily added as needed.

Implementation

First I will present the implemented predicates.

For the case when the separators are white spaces 0x09-0x0D and 0x20;

class CIsSpace : public unary_function<char, bool>
{
public:
  bool operator()(char c) const;
};

inline bool CIsSpace::operator()(char c) const
{
  // isspace<char> returns true if c is a white-space character 

  // (0x09-0x0D or 0x20)

  return isspace<char>(c);
}

For the case where the separator is the comma character ',':

class CIsComma : public unary_function<char, bool>
{
public:
  bool operator()(char c) const;
};

inline bool CIsComma::operator()(char c) const
{
  return (',' == c);
}

For the case where the separator is a character from a set of characters given in a STL string:

class CIsFromString : public unary_function<char, bool>
{
public:
  //Constructor specifying the separators

  CIsFromString::CIsFromString(string const& rostr) : m_ostr(rostr) {}
  bool operator()(char c) const;

private:
  string m_ostr;
};

inline bool CIsFromString::operator()(char c) const
{
  int iFind = m_ostr.find(c);
  if(iFind != string::npos)
    return true;
  else
    return false;
}

Finally the string tokenizer class implementing the Tokenize() function is a static member function. Notice that CIsSpace is the default predicate for the Tokenize() function.

template <class Pred=CIsSpace>
class CTokenizer
{
public:
  //The predicate should evaluate to true when applied to a separator.

  static void Tokenize(vector<string>& roResult, string const& rostr, 
                       Pred const& roPred=Pred());
};

//The predicate should evaluate to true when applied to a separator.

template <class Pred>
inline void CTokenizer<Pred>::Tokenize(vector<string>& roResult, 
                                            string const& rostr, Pred const& roPred)
{
  //First clear the results vector

  roResult.clear();
  string::const_iterator it = rostr.begin();
  string::const_iterator itTokenEnd = rostr.begin();
  while(it != rostr.end())
  {
    //Eat seperators

    while(roPred(*it))
      it++;
    //Find next token

    itTokenEnd = find_if(it, rostr.end(), roPred);
    //Append token to result

    if(it < itTokenEnd)
      roResult.push_back(string(it, itTokenEnd));
    it = itTokenEnd;
  }
}

How to use

The following code snippet is showing some simple usage examples, one for each one of the implemented predicates:

//Test CIsSpace() predicate

cout << "Test CIsSpace() predicate:" << endl;
//The Results Vector

vector<string> oResult;
//Call Tokeniker

CTokenizer<>::Tokenize(oResult, " wqd \t hgwh \t sdhw \r\n kwqo \r\n  dk ");
//Display Results

for(int i=0; i<oResult.size(); i++)
  cout << oResult[i] << endl;

//Test CIsComma() predicate

cout << "Test CIsComma() predicate:" << endl;
//The Results Vector

vector<string> oResult;
//Call Tokeniker

CTokenizer<CIsComma>::Tokenize(oResult, "wqd,hgwh,sdhw,kwqo,dk", CIsComma());
//Display Results

for(int i=0; i<oResult.size(); i++)
  cout << oResult[i] << endl;

//Test CIsFromString predicate

cout << "Test CIsFromString() predicate:" << endl;
//The Results Vector

vector<string> oResult;
//Call Tokeniker

CTokenizer<CIsFromString>::Tokenize(oResult, ":wqd,;hgwh,:,sdhw,:;kwqo;dk,", 
                                          CIsFromString(",;:"));
//Display Results

cout << "Display strings:" << endl;
for(int i=0; i<oResult.size(); i++)
  cout << oResult[i] << endl;

Conclusion

The project StringTok.zip attached to this article includes the source code of the presented CTokenizer class and some test code. I am interested in any opinions and new ideas about this implementation.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here