Splitting strings

Orjan Westin

5.00/5 (1 vote)

10 Aug 2010BSD2 min read

14.5K

Splitting or tokenizing a string into substrings divided by a separator

Back in the dawn of time, when men were real men, bytes were real bytes, and floating point numbers were real, um, reals, the journeyman test of every aspiring programmer was to write their own text editor. (This was way before the concept of “life” had been invented, so no-one knew they were supposed to have one.)

Nowadays, we know better, and don’t write new code to solve problems that have already been solved. Well, unless we need an XML parser – everybody (including myself, but that’s a post for another time) has written one of those – or at least a string tokeniser (aka splitter).

Other languages get tokenisers for free (C# – String.Split, Java – String.split, Python – string.split, and so on, and even C has strtok), but not C++. Which is why it’s something almost every C++ programmer writes, at some point or other.

Of course, you can use the rather nifty boost::tokenizer, if the place where you work is okay with using Boost (a surprising number of places aren’t, for various reasons), or find one of the numerous example implementations out there. Like this one, for instance:

C++

void tokenise_string(const std::string& str,
  const std::string& separator,
  std::vector<std::string>& tokens,
  bool empty /* = false */)
{
  const std::string::size_type strlength = str.length();
  const std::string::size_type seplength = separator.length();

  std::string::size_type prev = 0;
  std::string::size_type next = str.find(separator, prev);

  while (std::string::npos != next)
  {
    if (empty || prev != next)
      tokens.push_back(str.substr(prev, next - prev));
    prev = next + seplength;
    next = str.find(separator, prev);
  }
  if (empty || prev != strlength)
    tokens.push_back(str.substr(prev, strlength - prev));
}

There’s not that much to say about this. Pass in a string to split up into tokens, what separator to look for, and an output parameter which will hold the tokens when we’re done. What makes this implementation slightly different from some others is that the separator is a std::string, and treated as such. Other implementations I’ve seen take a char (or even std::string::value_type) as a separator, or a string which is treated as a list of possible separators (like “.!?” to split a text into sentences).

I dislike the latter, as it’s ambiguous – is the separator used as a full string or as an array of characters? Rather, I’d prefer to make it explicit by overloading the function:

C++

void tokenise_string(const std::string& str,
  const std::vector<std::string>& separators,
  std::vector<std::string>& tokens,
  bool empty /* = false */)
{
  const std::string::size_type strlength = str.length();
  const std::string::size_type seplength = 1;
  const std::string sep(separators.begin(), separators.end());

  std::string::size_type prev = 0;
  std::string::size_type next = str.find_first_of(sep, prev);

  while (std::string::npos != next)
  {
    if (empty || prev != next)
      tokens.push_back(str.substr(prev, next - prev));
    prev = next + seplength;
    next = str.find_first_of(sep, prev);
  }
  if (empty || prev != strlength)
    tokens.push_back(str.substr(prev, strlength - prev));
}

However, there is a problem here, in that std::string can be implicitly created from a native array of characters, and std::vector can’t:

C++

std::vector<std::string> output;
std::string input = "What, me worry? Nah.";
char separators[] = {'.','?'};

// Will call std::string separator version, which we probably don't intend
tokenise_string(input, separators, output);

// Must set up a vector explicitly
std::vector<char> sep_array(&separators[0], &separators[2]);

// Will call std::vector separator version
tokenise_string(input, sep_array, output);

For now, that is. C++ 1x will have an initializer_list constructor which will make things interesting here.

By the way, the benefit of treating a separator string as one single separator is, of course, that it lets us parse telegrams:

C++

std::vector<std::string> output;
std::string input = "NO TIME FOR WRENCHES STOP HAMMER TIME STOP";
std::vector<std::string> separators = "STOP";
tokenise_string(input, separators, output);
// Now output has two strings

I should probably mention, too, that the empty parameter lets us specify whether to include empty tokens in the output. In most cases, I don’t want to, but there are times it’s significant, if only to indicate whether the string started or ended with a separator.

Finally, here’s a function you see implemented and talked about a lot less often than its counterpart. If you want to split, presumably you’ll also want to merge, at some point. While it’s a very simple function, I’ve found it handy to have it available, so the merging is consistently done:

C++

void merge_tokens(const std::vector<std::string> &tokens,
  const std::string& separator,
  std::string& output)
{
  if (!tokens.empty())
  {
    output = tokens.front();
    for (std::vector<std::string>::const_iterator i =
      ++(tokens.begin());
      i != tokens.end(); ++i)
    {
      output += separator + *i;
    }
  }
}

Here we see the difference the empty flag makes in a call, by the way:

C++

std::vector<std::string> split1, split2;
std::string input = "/usr/tmp";
std::string separator = "/";

tokenise_string(input, separator, split1);
tokenise_string(input, separator, split2, true);

std::string merged1, merged2;

merge_tokens(split1, separator, merged1);
merge_tokens(split2, separator, merged2);

assert(input != merged1);  // Initial / removed
assert(input == merged2);

Filed under: Code, CodeProject Tagged: C++, string

License

This article, along with any associated source code and files, is licensed under The BSD License