Contents
This article is about a tokenizer function which provides a very customizable way of breaking up strings. I made it because the std::string
doesn't supply methods to efficiently break up its contents, and I didn't want to use another class to do it. I used the methods which are provided by std::string
to implement this function.
For my CSV-like text file class, which I will present in another article, I was in need of a function that could break up strings into a series of tokens. After a search on Google for the term "tokenizer", the only useful thing I found was the boost::tokenizer
class. After tinkering a bit with it, I decided to implement my own function because I didn't want to define types for the various TokenizerFunction
models. However, I liked the features provided by the boost
class and implemented some of them into my function.
- All of the delimiter-, quote-, and escape characters are 100% customizable.
- Multiple characters can be specified for each delimiter-group.
- Quote text to protect it from being tokenized.
- Escape single characters to protect them.
- Optionally keep delimiters as a token.
To use the function, you just need to provide an input string, a vector that will receive the output and the various delimiters to the function. Optionally, you can pass quote and/or escape characters.
The defaults for the delimiters are the common CSV ones (Space, TAB, Comma, Colon, Semicolon). The default quotes are (" and '), and the default escape character is the backslash (\). No delimiter characters will be preserved, by default.
The function prototype:
void tokenize ( const string& str, vector<string>& result,
const string& delimiters, const string& delimiters_preserve,
const string& quote, const string& esc );
str |
The input string.
This is the original string that will be tokenized. |
result |
The tokens.
This vector holds all the generated tokens. |
delimiters |
The delimiters which will split the input string.
Default: The common CSV ones (Space, TAB, Comma, Colon, Semicolon) |
delimiters_preserve |
The delimiters which will split the input string.
These delimiters will appear in the result as tokens.
No default characters. |
quote |
The quote characters.
The quote characters protect the enclosed text (matching quotes).
Default: " and ' |
esc |
The escape characters.
These characters protect single characters.
Default: backslash (\) |
Example:
#include <string>
#include <vector>
#include "tokenizer.h"
string input;
string delimiter = ",\t";
string keep_delim = ";:";
string quote = "\'\"";
string esc = "\\#";
vector<string> tokens;
tokenize ( input, tokens, delimiter, keep_delim, quote, esc );
vector<string>::iterator token;
for ( token = tokens.begin(); tokens.end() != token; ++token )
{
cout << *token << endl;
}
By simply running the demo application, you will get the following output:
Demo application for the tokenizer function.
The tokens are in []:
This;string,is for demonstration.
[This]
[string]
[is]
[for]
[demonstration.]
Delimiters can be preserved: sqrt(17 * (20 + a))
[sqrt] [(] [17] [*] [(] [20] [+] [a] [)] [)]
"This;string;contains;quoted;text";and;escaped\;characters.
[This;string;contains;quoted;text]
[and]
[escaped;characters.]
You can also provide parameters, or edit and use the included batch file to use the demo file:
TokenizerDemo filename [delimiters] [preserved delimiters]
[quote chars] [escape chars]
- All parameters are optional, but you cannot skip parameters. E.g. if you don't want to provide quote chars but need the escape chars, you must pass an empty parameter, like this: "".
- You must quote the space character if you want to use it as a delimiter. E.g. if you want to use comma, semicolon, space, and colon: ",; :".
- A " must be quoted, too, like this: """.
- Only the first 15 lines of a file will be processed.
Essentially, the string is iterated character by character, and each character is appended to the token string. Every time a character belongs to a delimiter, the token string is saved in a list and cleared for the next token. Furthermore, checks for special cases, like quotes, are made.
The first part of the function clears the result vector, and initializes variables that hold the current position of the character in the string, the state of quotes, and the current token. The second part is the loop that performs the splitting, and the third part adds the remaining token, if there is one left, to the result.
The loop:
For every character in the string
Test if it is an escape character
If yes, skip all other tests
Test if it is a quote character
If yes, skip all other tests
Test if it is a delimiter
Token is complete
Test if it is a delimiter which should be preserved
Token is complete
flag the delimiter to be added
Append the character to the current token if it isn't
a special one.
If the token is complete and not empty
add the token to the results
If the delimiter is preserved
add it to the results
The loop iterates through the string character by character. It performs several tests on a character to be able to decide what to do with it. Before doing any test, it is assumed that the character isn't one of the special characters:
string::size_type len = str.length();
while ( len > pos ) {
ch = str.at(pos);
delimiter = 0;
bool add_char = true;
After extracting the character of the string, a check is done to see if the character belongs to the group of escape characters. If it is found so, the position is increased by one to get the next character, if there is at least one more left. It's unnecessary to perform any further tests because an escape character will be added to the current token regardless of what it is:
if ( string::npos != esc.find_first_of(ch) ) {
++pos;
if ( pos < len ) {
ch = str.at(pos);
add_char = true;
} else {
add_char = false;
}
escaped = true;
}
After that, and if the character belongs to the group of quote-characters, it is checked to see if there's an open quote. If the open-quote state is set, it will be closed, if not it will be set. In the "open-quote" state, no delimiter-checks will be done, and any special character will be added to the current token:
if ( false == escaped ) {
if ( string::npos != quote.find_first_of(ch) ) {
if ( false == quoted ) {
quoted = true;
current_quote = ch;
add_char = false;
} else if ( current_quote == ch ) {
quoted = false;
current_quote = 0;
add_char = false;
}
}
}
If the character doesn't match one of the above groups, it is checked to see if it belongs to the group of delimiters. If it does, and the token string isn't empty, the token is flagged to be complete:
if ( false == escaped && false == quoted ) {
if ( string::npos != delimiters.find_first_of(ch) ) {
if ( false == token.empty() ) {
token_complete = true;
}
add_char = false;
}
}
...and if the delimiter should be preserved, it will be indicated by the add-delimiter flag:
bool add_delimiter = false;
if ( false == escaped && false == quoted ) {
if ( string::npos != delimiters_preserve.find_first_of(ch) ) {
if ( false == token.empty() ) {
token_complete = true;
}
add_char = false;
delimiter = ch;
add_delimiter = true;
}
}
If the character isn't a special character, it will be appended to the end of the token-string:
if ( true == add_char ) {
token.push_back( ch );
}
If the token isn't empty and flagged to be complete, it is added to the results and reset for the next token:
if ( true == token_complete && false == token.empty() ) {
result.push_back( token );
token.clear();
token_complete = false;
}
If the delimiter is flagged as a preserved one, it will be added to the results as a token:
if ( true == add_delimiter ) {
string delim_token;
delim_token.push_back( delimiter );
result.push_back( delim_token );
}
When the loop is finished and the input string doesn't end with a delimiter, there may be a token left that hasn't been added yet because the token complete flag is only set in the delimiter tests - or there could be an unclosed quote. Whatever be the reason, if the token buffer isn't empty, it will be added to the results.
This is the second approach for the implementation. In the original function, I've put all the special characters into one string and retrieved the position of one of these characters with the string::find_first_of
method. This turned out to be unhandy because I had to double check and handle exceptions like quotes and escaped characters.
After thinking for a few minutes about it, I thought I could iterate through the string character by character in the function and look if the character belongs to any of the special character groups. The difference in the two approaches is that for the first approach, I have the positions (begin and end) of the substring to copy into the token-string, and in the second one, I just append the characters to a token string and clear it every time a delimiter is found.
I want to thank you, the reader, for your interest and feedback, and I want to thank the kind people who told me in the Lounge how to write a good introduction. However, I don't know if it turned out to be a good one.
I don't know whether the words 'tokenizer, tokenizing' exist or not - for 'tokenized', the dictionary says something like 'translated to tokens', but the meanings should have become clear anyhow ;)
Any unanswered questions? Feel free to ask. :)
The zlib/libpng license.
This software is provided 'as-is', without any express or implied warranty. In no event will the authors be held liable for any damages arising from the use of this software. Permission is granted to anyone to use this software for any purpose, including commercial applications, and to alter it and redistribute it freely, subject to the following restrictions:
- The origin of this software must not be misrepresented; you must not claim that you wrote the original software. If you use this software in a product, an acknowledgment in the product documentation would be appreciated but is not required.
- Altered source versions must be plainly marked as such, and must not be misrepresented as being the original software.
- This notice may not be removed or altered from any source distribution.
- 2006-02-10 - Initial version.
- 2006-03-05 - Bug fix and some minor article changes. Thanks Elias for pointing this one out.