Download source files - 6 Kb
The purpose of Token Iterator is to provide an easy to use, familiar, and customizable way
in which to go through the tokens contained in a string. Our hypothetical example will be
breaking a string into words, and outputting the words as a list. Here is the code.
#include <iostream>
#include <iterator>
#include <string>
#include <algorithm>
#include "tokenizer.h"
int main(){
using namespace std;
using namespace jrb_stl_extensions;
string s;
cout << "Please enter a string with punctuation\n";
getline(cin,s);
TokenIterator<string> begin(s), end;
copy(begin,end,ostream_iterator<string>(cout,"\n"));
}
A few words of explanation. All this class and the supporting classes are packaged in
jrb_stl_extensions
namespace.
Easy to use. There are only 2 lines that do the actual work.
Let's analyze this a bit more. TokenIterator
has a default template parameter that
specifies the TokenizerFunc
which is an STL style functor. The default is the PunctSpaceTokenizer
which will tokenize separating based on whitespace and punctuation. The constructor has this signature
PunctSpaceTokenizer(bool returnPunct = false, StringType p = WT_Punctuation1,
StringType w = WT_Whitespace)
returnPunct - Tells whether we want to return the punctuation. Returning the punctuation can
be important say when building a mathematical expression parser, where while we want to skip whitespace,
we do not want to skip the +'s and -'s.
p - Punction. There are two constants called WT_Punctuation1, which is all the punctuation
on a standard American keyboard. WT_Punctuation2, is the same as WT_Punctuation1, except
that it does not have -(hyphen/dash) or '(apostrophe/single quote). The reason for this is
that some words are hyphenated or have apostrophes (like can't) and we want to keep them as one
token.
w - Whitespace. The difference between whitespace and punctuation, is that punctuation can be returned
using he returnPunct flag. WT_Whitespace is a constant that has the standard whitespace.
Now, lets look at the familiar part. TokenIterator
is an STL forward iterator and can be used
with any STL algorithm that can accept it such as copy
. In addition, copying the TokenIterator
will NOT result in the whole string being copied. Since the string is NEVER modified, a reference
counted pointer is shared among all TokenIterator
's referring to a particular string.
On to customizability. The PunctSpaceTokenizer
might not suffice for all your needs. Not to worry,
TokenIterator
is easily customizable for the TokenizerFunc
.
Here are the requirements for TokenizerFunc.
- Typedefs - TokenType
This will refer to the type of the token. For our examples this is string. This will be the
return type of operator*(). An example where more than a string token would be needed might
be a parser that returns an object that contains the string, and other identifying information.
operator()(...)
This has the following prototype
iter operator()(iter* pTokEnd,iter end,TokenType& tToken)
Return value - This returns the start of the next token in the string.
pTokEnd - This should be set to the STL style end position (ie past the end) of the token in the string
end - This is the end possition of the string, and is passed into the functor
tToken - This should be set to current token. TokenType
is string is will be [retval,pTokEnd)
If you want an example, study the CSVTokenizer
functor.
We will examine using CSVTokenizer
. CSVTokenizer
breaks a string into C(ie comma)-separated fields.
The comma is a template parameter, and can be any character. Assuming that a comma is the parameter,
The string will be broken into fields separated by commas, unless the commas are inside quotes.
In addition, the constructor takes a character that is defaulted to \ ('\\' in C syntax).
That character acts like the same character in C, namely an escape character. For example,
\" means a literal " and \\ means a literal \
Perhaps an example will help:
John \"Big John\" Doe,"1111 Anytown, USA 12345" will be broken into
John "Big John" Doe
1111 Anytown, USA 12345
An example of using it follows. Add the following lines below our previous sample.
cout << "Please enter a comma separated line of fields\n";
getline(cin,s);
TokenIterator<string,CSVTokenizer<string> > begin2(s),end2;
copy(begin2,end2,ostream_iterator<string>(cout,"\n"));
This will output the fields.
Well, there is my overview of TokenIterator
. I hope you enjoy using this class.
Note: When the sample is compiled with MSVC 6, the warnings that result are
one talking about not having a return in main, and that the template resulted in an
identifier that was truncated to 255 characters in debug.
John Bandela
Copyright 2000 John R. Bandela