Back in the dawn of time, when men were real men, bytes were real bytes, and floating point numbers were real, um, reals, the journeyman test of every aspiring programmer was to write their own text editor. (This was way before the concept of “life” had been invented, so no-one knew they were supposed to have one.)
Nowadays, we know better, and don’t write new code to solve problems that have already been solved. Well, unless we need an XML parser – everybody (including myself, but that’s a post for another time) has written one of those – or at least a string tokeniser (aka splitter).
Other languages get tokenisers for free (C# – String.Split
, Java – String.split
, Python – string.split
, and so on, and even C has strtok
), but not C++. Which is why it’s something almost every C++ programmer writes, at some point or other.
Of course, you can use the rather nifty boost::tokenizer, if the place where you work is okay with using Boost (a surprising number of places aren’t, for various reasons), or find one of the numerous example implementations out there. Like this one, for instance:
void tokenise_string(const std::string& str,
const std::string& separator,
std::vector<std::string>& tokens,
bool empty )
{
const std::string::size_type strlength = str.length();
const std::string::size_type seplength = separator.length();
std::string::size_type prev = 0;
std::string::size_type next = str.find(separator, prev);
while (std::string::npos != next)
{
if (empty || prev != next)
tokens.push_back(str.substr(prev, next - prev));
prev = next + seplength;
next = str.find(separator, prev);
}
if (empty || prev != strlength)
tokens.push_back(str.substr(prev, strlength - prev));
}
There’s not that much to say about this. Pass in a string
to split up into tokens, what separator to look for, and an output parameter which will hold the tokens when we’re done. What makes this implementation slightly different from some others is that the separator is a std::string
, and treated as such. Other implementations I’ve seen take a char
(or even std::string::value_type
) as a separator, or a string
which is treated as a list of possible separators (like “.!?” to split a text into sentences).
I dislike the latter, as it’s ambiguous – is the separator used as a full string or as an array of characters? Rather, I’d prefer to make it explicit by overloading the function:
void tokenise_string(const std::string& str,
const std::vector<std::string>& separators,
std::vector<std::string>& tokens,
bool empty )
{
const std::string::size_type strlength = str.length();
const std::string::size_type seplength = 1;
const std::string sep(separators.begin(), separators.end());
std::string::size_type prev = 0;
std::string::size_type next = str.find_first_of(sep, prev);
while (std::string::npos != next)
{
if (empty || prev != next)
tokens.push_back(str.substr(prev, next - prev));
prev = next + seplength;
next = str.find_first_of(sep, prev);
}
if (empty || prev != strlength)
tokens.push_back(str.substr(prev, strlength - prev));
}
However, there is a problem here, in that std::string
can be implicitly created from a native array of characters, and std::vector
can’t:
std::vector<std::string> output;
std::string input = "What, me worry? Nah.";
char separators[] = {'.','?'};
tokenise_string(input, separators, output);
std::vector<char> sep_array(&separators[0], &separators[2]);
tokenise_string(input, sep_array, output);
For now, that is. C++ 1x will have an initializer_list
constructor which will make things interesting here.
By the way, the benefit of treating a separator string as one single separator is, of course, that it lets us parse telegrams:
std::vector<std::string> output;
std::string input = "NO TIME FOR WRENCHES STOP HAMMER TIME STOP";
std::vector<std::string> separators = "STOP";
tokenise_string(input, separators, output);
I should probably mention, too, that the empty
parameter lets us specify whether to include empty tokens in the output. In most cases, I don’t want to, but there are times it’s significant, if only to indicate whether the string started or ended with a separator.
Finally, here’s a function you see implemented and talked about a lot less often than its counterpart. If you want to split, presumably you’ll also want to merge, at some point. While it’s a very simple function, I’ve found it handy to have it available, so the merging is consistently done:
void merge_tokens(const std::vector<std::string> &tokens,
const std::string& separator,
std::string& output)
{
if (!tokens.empty())
{
output = tokens.front();
for (std::vector<std::string>::const_iterator i =
++(tokens.begin());
i != tokens.end(); ++i)
{
output += separator + *i;
}
}
}
Here we see the difference the empty
flag makes in a call, by the way:
std::vector<std::string> split1, split2;
std::string input = "/usr/tmp";
std::string separator = "/";
tokenise_string(input, separator, split1);
tokenise_string(input, separator, split2, true);
std::string merged1, merged2;
merge_tokens(split1, separator, merged1);
merge_tokens(split2, separator, merged2);
assert(input != merged1); assert(input == merged2);
Filed under: Code, CodeProject Tagged: C++, string