Introduction
Having to interpret strings and extract information always has been necessary and is mostly related to writing complex looking code and logic.
Many things can be handled with tokenizers, however with these, there is not much room for different variants of a string or separators that vary from case to case.
Regular expressions allow more complex processing, but it is a pain create the them, debug them and understand them a few months later.
This acticle shows the usage of a small class providing a "natural" way of dealing with complex extraction patterns using a very tiny class.
Note: The article and sources are updated to handle the problems of the first version.
Background
The best way to demonstrate the need is to provide two simple examples:
Example 1
We need to process strings from a log file like the following:
------------------snip------------------
Process: Tsk Mgr.EXE - Start Date: 2008-20-01 Duration: 00:01:54 - Description: Task Manager
Process: EXPLORER.EXE - Start Date: 2008-20-01 Duration: 00:01:54 - Description: Explorer
End-of-day
Process: Tsk Mgr.EXE - Start Date: 2008-21-01 Duration: 00:00:12 - Description: Task Manager
...
------------------snap------------------
The area of interest are the parts in italics: the process name, the start date and the duration. We need to extract them.
The log file lines looks similar, there first problem is that
there are lines we don't need, like "End-of-day" or the empty line.
The second problem
we encounter is that we cannot really deal with standard tokenizers, as
we have no separation character. Space cannot be used, as the process
name of "Tsk Mgr.exe" contains a space itself, which would shift all
the following tokenized parts.
The colon (:) *could* be used, however, we would have to remove the " - Start Date" from the process name and we would like to get the "Duration" part as one.
If we would explain a human how to extract the values, we would tell him to
take the part after "Process:" until the dash "-" before "Start Date:",
then the part after "Start Date:" until "Duration:", the part after
"Duration:" until the dash before "Description:".
Written in a "masked" way, the string processing format looks like:
Process: #### - Start Date: #### Duration: #### - Description: Task Manager
What
we need to parse in our application are the "####" parts. And what
would be more natural than to be able to enter exactly this mask
pattern in our application. To further access the interesting parts,
we should be able to give them variable names, represented in the part
string by a surrounding percentage sign (%). So our mask string would look like
Process: %proc% - Start Date: %date% Duration: %duration% - Description: %desc%
Ideally, we would run each line of the log file through this mask
and if it succeeds, we would query the variable "proc", "date" and
"duration" - and ignore "desc". If it fails, i.e. if the input string
does not match with the mask format, we would just continue with the
next log-line.
Example 2
We need to extract version information from strings like:
<softwareA> v1.4
<softwareB> v5
<softwareC> v1.3.1 Beta 5
<softwareD> v8.4.87.405 Alpha
The only thing these strings have in common is the space and lowercase-"v" right before the version number. Again, we are only interested in the italic parts: the plain version numbers without the postfixed "Alpha" or "Beta 5".
In this case, we would have two different masks:
One with additional text after the version number and one with not, so the first mask would look like:
%software% v%version% %postfix%
where we have only two fixed elements:
- the " v" and
- the space between the version and the postfix
and another mask with no postfix
%software% v%version%
We would first check the one with, and if it fails (if there is no postfix) the one without postfix.
(When checking the second mask only, it would always succeed and include the words "Beta" and "Alpha" as well - which we don't want).
The idea of handling strings that way makes it must easier to adapt to many different tasks without having to reprogram any additional post-tokenizer logic that is usually involved when extracting data from strings.
Originally, a very good MP3 tag editor allows extracting data like artist and track name information from the various formatted filenames on freedb.org by using placeholder variables in the way described above, which got me quickly attracted to it, as it is very easy to understand. Many times I wished to have access to such a function, and never found anything similar to it, so I decided to code it myself.
Using the code
The string splitter contains three classes:
The main class CStringSplitter
, which will be used by the application code, and the two helper classes CSearchStringChar
and CSearchStringStr
used by the other class to parse the mask and the string.
The usage of the string splitter is demonstrated in this small piece of code.
In the example, we use the percentage symbol '%' to indicate the start and end of a variable name:
#include "StringSplitter.h"
CStringSplitter Split( _T('%'), _T('%') );;
CString strValue,
strMask = _T("Process: %proc% - Start Date: %date% Duration: %duration% - Description: %desc%");
while ( !isEndOfFile() )
{
strLogLine = getNextLine();
if ( Split.matchMask( strLogLine, strMask ) )
{
if ( Split.getValue( strValue, _T("proc") ) )
{
...
}
if ( Split.getValue( strValue, _T("date") ) )
{
...
}
if ( Split.getValue( strValue, _T("duration") ) )
{
...
}
}
}
When using the default open- and close-brace characters to indicate the placeholders, the mask in the example above would be:
Process: (proc) - Start Date: (date) Duration: (duration) - Description: (desc)
The CStringSplitter class
The following functionality is provided by the class to process a string.
Configuring the start end end characters of a variable part in the mask
The CStringSplitter
class must be constructed with two
optional parameters to specify the start and end characters for the
variable names. These default to the opening and close braces '(' and
')'.
Both, the opening and close characters can be the same (e.g. '%' as in the code example above).
Parsing a source string against a mask
The method matchMask
checks an input line against the mask. It returns true
if the processing was successful, false
if not, i.e. if the mask does not fit onto the input string or if the mask contains syntactical errors.
Querying the values of the placeholders
After successfully processing the source string, the method getValue
can be called to get the contents of the requested variable. Its first parameter is a string reference which will contain the value of the requested variable, if it exists. The return value is true
if the variable is found or false
if it does not exist.
Note: Variable names are treated case-insenstive! This can be changed in the getValue
method by using the strcmp
instead of the stricmp
function.
Processing multiple lines with the same mask
When having thousands of lines to process with the same mask, the mask needs to be pre-parsed only once. This makes the string matching even faster.
The first step is to setup the mask by calling the setMask
method with the mask string as the only parameter (which corresponds to the second parameter of matchMask
). It will return true
if the mask is valid or
false
if not.
Matching the strings can be done in a loop by calling the matchLastMask
method using the string to be matches as the only parameter (which corresponds to the first parameter of matchMask
). It returns true
if the processing was successful, false
if not, i.e. if the mask does not fit onto the input string.
After success, the placeholder values can be retrieved as described above.
Here's the earlier example with the necessary changes (in bold):
CStringSplitter Split( _T('%'), _T('%') );;
CString strValue,
strMask = _T("Process: %proc% - Start Date: %date% Duration: %duration% - Description: %desc%");
Split.setMask( strMask );
while ( !isEndOfFile() )
{
strLogLine = getNextLine();
if ( Split.matchLastMask( strLogLine ) )
{
if ( Split.getValue( strValue, _T("proc") ) )
{
...
}
...
}
}
Validity of Masks and Syntax
The parser is completely tolerant about typing errors, however, there are syntax rules to get the expected result.
1. Two variables may not follow each other
At least one fixed character must separate them:
Parsing any string using the mask "(part1)(part2)" will, from a logical point, not reveal and usable results, as there is no separation between part1 and part2.
2. Using a variable's start character in fixed text
Given the following input string:
"Humidity %89"
When using the '%' as the variable start and end characters, any double-appearance of it in the fixed text section will be interpreted as one occurence of the character.
To handle the '%' sign in the fixed text, the mask string must be:
"Humidity %%%HUM%"
Note that there are three (3) sequential percentage signs:
The bold portion belongs to the fixed text, any double-start-characters are interpreted as one - "Humidity %".
The non-bold portion indicates the variable name one '%' as the start character, "HUM" as the variable name and the next '%' as the end character.
In order to prevent annoyances, the end-character is not allowed in the variable name, even if it is doubled.
3. Improper ending
Not terminating a variable at the end of a string will automatically terminate it.
4. Mask validation (e.g. of user input)
To use this class in user input fields, the easiest way to check the validity of a mask is by simply calling setMask
with the string. If it returns true
, the mask can be used (hower, it cannot be guaranteed that the results are what the user intended - Microsoft's PSI-API is not yet ready for public use ;-)
Other platforms
The code is using some specifics or Visual Studio 2005, however it should be quickly portable to other platforms with no effort.
The first is MFC's CString
for internal storage of variables and placeholders and for returning
the variable's contents. This can be replaced by virtually any other
string class, as nothing more than simple assignment is used.
The string processing itself is using the oldie-but-goldie C library functions strlen
, strcpy
, strchr
, strstr
and stricmp
, i.e. their Visual Studio's _t
-prefixed equivalents to achieve MBCS / UNICODE compatibility using the same code.
The strcpy_s
function of VS 2005 can be replaced by the corresponding "unsafe" strcpy
function for other compilers without problems, as the memory for the string is allocated directly before copying.
Since
the basic character handling functions are used instead of the
high-level string class routines, the mask parsing should be quite
fast, as these libraries are mostly assembly-optimized in most compiler
libraries.
History
1.0 2008-01-21 Public release
1.1 2008 -01-25 Problems if 1.0 fixed:
- Variable open and close placeholders
- Treat double open-characters inside fix-text to be interpreted as single characters
- Failsafe with syntax errors in mask