The C++ source files for the string tokenisers discussed in this post and the Splitting strings post, plus the code for Removing whitespace and Static assert in C++, can be found here.
One of the more curious omissions from the C++ standard library is a string
splitter, e.g., a function that can take a string
and split it up into its constituent parts, or tokens, based on some delimiter. There is one in other popular languages ((C# – String.Split
, Java – String.split
, Python – string.split
, etc.), but C++ programmers are left to roll their own, or use one from a third-party library like the boost::tokenizer
(or the one I presented in Splitting strings).
There are many ways of doing this; the Stack Overflow question How do I tokenize a string in C++? has 23 answers at the time of writing, and those contain 20 different solutions (boost::tokenizer
and strtok
are suggested multiple times).
The strtok
recommendations, however, all have comments pointing out the problems with this function – it’s destructive, and not reentrant (it can’t be nested or run in parallel on multiple string
s). As functions go, strtok
has a rather poor reputation – there’s even a popular reentrant version, strtok_r
, available in many C library implementations, though it’s not a standard function.
So it’s a good thing that there are so many other ways of splitting string
s, isn’t it? Well, yes, but the clever thing with strtok
, which you don’t get from any of the string
splitters, not even the flexibility-obsessed Boost, is that you can change the token separator as you go along. The splitters often give the option to provide a selection of possible delimiters like this:
char punct = " ,.?!;:-";
std::string text = "Silly, right? This is - obviously - an example: Boo!";
std::vector<std::string> tokens;
std::vector<char> separators(punct, punct + strlen(punct));
tokenise_string(text, separators, tokens);
But what if there are characters that are both legal inside some tokens, and separators for others? What if you have to parse string
s with the format [Last name],[First name] – [Profession] – [Age] like this table:
Bradshawe, Adam - Colonel, retired - 73
Burton-West,Jenny - Surgeon - 37
Smith,Ben - Taxi driver - 56
While a string
splitter would stumble over this – since both space, hyphen and comma can be both delimiters and valid content – old strtok
doesn’t even raise an eyebrow:
const char* table = "Jones, Adam - Colonel, retired - 73\n"
"Burton-West,Jenny - Surgeon - 37\n"
"Smith,Ben - Taxi driver - 56";
char* changeable = (char*) malloc(strlen(table)); strcpy(changeable, table);
char *last, *first, *profession, *age;
last = strtok(changeable, " ,");
while (last)
{
first = strtok(NULL, " -");
profession = strtok(NULL, "-");
int proflen = strlen(profession);
if (profession[proflen - 1] == ' ')
profession[proflen - 1] = '\0';
age = strtok(NULL, " -\n");
last = strtok(NULL, " ,");
}
free(changeable);
In other words, strtok
offers unique functionality not found in the string
splitters. So let’s design a C++ version that is efficient, non-destructive, and reentrant. Thanks to the object-orientation support of C++, we can let each tokeniser have a const
reference to the string
we’re tokenising. This gives us both reentrancy and non-destructiveness. And while we’re at it, let’s have a flag to decide whether we should include empty tokens or not – something quite useful that’s missing from strtok
.
class string_tokeniser
{
const std::string& source_;
bool empty_;
std::string::size_type current_;
std::string::difference_type length_;
std::string::size_type next_;
public:
string_tokeniser(const std::string& source, bool empty = false);
...
We’ll also need variables to keep track of where we are, where to start looking for next token, and so on, which I’ve also included above.
Now, what do we want to do with this? Well, we actually want to do two distinct tasks – advance to the next token, and extract a token. In strtok
, those are done at the same step, but since we’re not constrained by the limitations of C, it’s better to keep it tidy. Like I did in my string splitter, I’ll overload the advancing function to allow the user to give a single character, a selection of characters, or a whole string
as a delimiter.
...
bool next(const std::string::value_type& separator);
bool next(const std::vector<char>& separators);
bool next(const std::string& separator);
...
These all return true
if a new token was found in the string
. We’ll also need a way of accessing the tokens found, and some housekeeping:
...
bool get_token(std::string& token) const;
std::string get_token() const;
void reset();
bool has_token() const;
bool at_end() const;
const std::string& get_source() const;
};
That’s the interface, let’s do some implementation. The way this will work is that we keep hold of a current location and token length, which are used to retrieve the token using std::string::substr
, while also keeping the next location in which to start the search for a token. This will have to be next_ = current_ + length_ + length of delimiter
, so the next search does not pick up the last delimiter.
When the string_tokeniser
is first created, we have no search results, so need to initialize appropriately. The same values are set on a reset, and used to get the current token and check status:
string_tokeniser::string_tokeniser(const std::string& source, bool empty )
: source_(source)
, current_(std::string::npos)
, length_(0)
, next_(source.empty() ? std::string::npos : 0)
, empty_(empty)
{}
void string_tokeniser::reset()
{
current_ = std::string::npos;
next_ = source_.empty() ? std::string::npos : 0;
length_ = 0;
}
bool string_tokeniser::has_token() const
{
return (std::string::npos != current_);
}
bool string_tokeniser::at_end() const
{
return (std::string::npos == next_);
}
const std::string& string_tokeniser::get_source() const
{
return source_;
}
bool string_tokeniser::get_token(std::string& token) const
{
if (!has_token())
return false;
if (0 == length_)
token.clear();
else
token = source_.substr(current_, length_);
return true;
}
std::string string_tokeniser::get_token() const
{
if (!has_token() || (0 == length_))
return std::string();
return source_.substr(current_, length_);
}
Right, that just leaves the implementation of the key function: next()
. Since there are three overloads, with almost identical implementation, the sensible thing is to break out most of the common stuff into a helper function. Unfortunately, we can’t easily do that, since if we do not care about empty tokens, we have to recurse and try to find the next, in the case of repeated delimiters, which means it will have to be aware of which overload to chose. Instead, we’ll break out the common handling into a template function:
In class declaration:
...
template <typename T>
bool handle_next(size_t advance, const T& separator);
public:
...
Implementation:
bool string_tokeniser::next(const std::string::value_type& separator)
{
current_ = next_;
if (at_end())
return false;
next_ = source_.find(separator, current_);
return handle_next(1, separator);
}
bool string_tokeniser::next(const std::string& separator)
{
current_ = next_;
if (at_end())
return false;
next_ = source_.find(separator, current_);
return handle_next(separator.size(), separator);
}
bool string_tokeniser::next(const std::vector<char>& separators)
{
current_ = next_;
if (at_end())
return false;
next_ = source_.find_first_of(&separators[0], current_, separators.size());
return handle_next(1, separators);
}
template <typename T>
bool string_tokeniser::handle_next(size_t advance, const T& separator)
{
if (std::string::npos == next_)
{
length_ = source_.size() - current_;
}
else
{
length_ = next_ - current_;
next_ += advance;
if ((0 == length_) && !empty_)
{
return next(separator);
}
}
if (0 < length_)
return true;
if (!empty_ || (std::string::npos == next_))
{
current_ = next_;
return false;
}
return true;
}
There, all done. And because we can use both single characters, a selection of characters, and string
s as delimiters, the equivalent of the strtok
example avoids the need to trim spaces:
std::string table = "Jones, Adam - Colonel, retired - 73\n"
"Burton-West,Jenny - Surgeon - 37\n"
"Smith,Ben - Taxi driver - 56";
std::string last, first, profession, age;
std::vector<char> comma_space;
comma_space.push_back(' ');
comma_space.push_back(',');
std::string sp_dash_sp(" - ");
char endl('\n');
string_tokeniser tok(table);
while (!tok.at_end())
{
tok.next(comma_space);
tok.get_token(last);
tok.next(sp_dash_sp);
tok.get_token(first);
tok.next(sp_dash_sp);
tok.get_token(profession);
tok.next(endl);
tok.get_token(age);
}
Because it is non-destructive, it is by necessity less efficient than strtok
, since the tokens have to be copied. On the other hand, if you have to copy the const char*
to a char*
buffer to use strtok
, maybe the efficiency loss isn’t that bad.
As always, if you found this interesting or useful, or have suggestions for improvements, please let me know.
Update: Jens Ayton has informed me that C11 introduced strtok_s
, which is even safer, but not, I believe, incorporated into C++11 .