A Tiny Variable String Splitter

mi-chi

4.47/5 (11 votes)

25 Jan 2008CPOL8 min read

223

Tokenize and access string contents using a format mask

Download StringSplitter_v11.zip - 3.37 KB

Introduction

Having to interpret strings and extract information always has been necessary and is mostly related to writing complex looking code and logic.

Many things can be handled with tokenizers, however with these, there is not much room for different variants of a string or separators that vary from case to case.

Regular expressions allow more complex processing, but it is a pain create the them, debug them and understand them a few months later.

This acticle shows the usage of a small class providing a "natural" way of dealing with complex extraction patterns using a very tiny class.

Note: The article and sources are updated to handle the problems of the first version.

Background

The best way to demonstrate the need is to provide two simple examples:

Example 1

We need to process strings from a log file like the following:

------------------snip------------------

Process: Tsk Mgr.EXE - Start Date: 2008-20-01 Duration: 00:01:54 - Description: Task Manager
Process: EXPLORER.EXE - Start Date: 2008-20-01 Duration: 00:01:54 - Description: Explorer
End-of-day

Process: Tsk Mgr.EXE - Start Date: 2008-21-01 Duration: 00:00:12 - Description: Task Manager
...

------------------snap------------------

The area of interest are the parts in italics: the process name, the start date and the duration. We need to extract them.

The log file lines looks similar, there first problem is that there are lines we don't need, like "End-of-day" or the empty line.

The second problem we encounter is that we cannot really deal with standard tokenizers, as we have no separation character. Space cannot be used, as the process name of "Tsk Mgr.exe" contains a space itself, which would shift all the following tokenized parts.

The colon (:) *could* be used, however, we would have to remove the " - Start Date" from the process name and we would like to get the "Duration" part as one.

If we would explain a human how to extract the values, we would tell him to take the part after "Process:" until the dash "-" before "Start Date:", then the part after "Start Date:" until "Duration:", the part after "Duration:" until the dash before "Description:".

Written in a "masked" way, the string processing format looks like:

Process: #### - Start Date: #### Duration: #### - Description: Task Manager

What we need to parse in our application are the "####" parts. And what would be more natural than to be able to enter exactly this mask pattern in our application. To further access the interesting parts, we should be able to give them variable names, represented in the part string by a surrounding percentage sign (%). So our mask string would look like

Process: %proc% - Start Date: %date% Duration: %duration% - Description: %desc%

Ideally, we would run each line of the log file through this mask and if it succeeds, we would query the variable "proc", "date" and "duration" - and ignore "desc". If it fails, i.e. if the input string does not match with the mask format, we would just continue with the next log-line.

Example 2

We need to extract version information from strings like:

<softwareA> v1.4
<softwareB> v5
<softwareC> v1.3.1 Beta 5
<softwareD> v8.4.87.405 Alpha

The only thing these strings have in common is the space and lowercase-"v" right before the version number. Again, we are only interested in the italic parts: the plain version numbers without the postfixed "Alpha" or "Beta 5".

In this case, we would have two different masks:
One with additional text after the version number and one with not, so the first mask would look like:

%software% v%version% %postfix%

where we have only two fixed elements:
- the " v" and
- the space between the version and the postfix

and another mask with no postfix

%software% v%version%

We would first check the one with, and if it fails (if there is no postfix) the one without postfix.
(When checking the second mask only, it would always succeed and include the words "Beta" and "Alpha" as well - which we don't want).

The idea of handling strings that way makes it must easier to adapt to many different tasks without having to reprogram any additional post-tokenizer logic that is usually involved when extracting data from strings.

Originally, a very good MP3 tag editor allows extracting data like artist and track name information from the various formatted filenames on freedb.org by using placeholder variables in the way described above, which got me quickly attracted to it, as it is very easy to understand. Many times I wished to have access to such a function, and never found anything similar to it, so I decided to code it myself.

Using the code

The string splitter contains three classes:

The main class CStringSplitter, which will be used by the application code, and the two helper classes CSearchStringChar and CSearchStringStr used by the other class to parse the mask and the string.

The usage of the string splitter is demonstrated in this small piece of code.

In the example, we use the percentage symbol '%' to indicate the start and end of a variable name:

C++

#include "StringSplitter.h"

    CStringSplitter Split( _T('%'), _T('%') );;
    CString strValue,
            strMask = _T("Process: %proc% - Start Date: %date% Duration: %duration% - Description: %desc%");

    while ( !isEndOfFile() )
    {
        strLogLine = getNextLine();

        if ( Split.matchMask( strLogLine, strMask ) )
        {
            if ( Split.getValue( strValue, _T("proc") ) )
            {
                // do something with strValue (proc)
                ...
            }
            if ( Split.getValue( strValue, _T("date") ) )
            {
                // do something with strValue (date)
                ...
            }
            if ( Split.getValue( strValue, _T("duration") ) )
            {
                // do something with strValue (duration)
                ...
            }
        }
    }

When using the default open- and close-brace characters to indicate the placeholders, the mask in the example above would be:

Process: (proc) - Start Date: (date) Duration: (duration) - Description: (desc)

The CStringSplitter class

The following functionality is provided by the class to process a string.

Configuring the start end end characters of a variable part in the mask

The CStringSplitter class must be constructed with two optional parameters to specify the start and end characters for the variable names. These default to the opening and close braces '(' and ')'.

Both, the opening and close characters can be the same (e.g. '%' as in the code example above).

Parsing a source string against a mask

The method matchMask checks an input line against the mask. It returns true if the processing was successful, false if not, i.e. if the mask does not fit onto the input string or if the mask contains syntactical errors.

Querying the values of the placeholders

After successfully processing the source string, the method getValue can be called to get the contents of the requested variable. Its first parameter is a string reference which will contain the value of the requested variable, if it exists. The return value is true if the variable is found or false if it does not exist.

Note: Variable names are treated case-insenstive! This can be changed in the getValue method by using the strcmp instead of the stricmp function.

Processing multiple lines with the same mask

When having thousands of lines to process with the same mask, the mask needs to be pre-parsed only once. This makes the string matching even faster.

The first step is to setup the mask by calling the setMask method with the mask string as the only parameter (which corresponds to the second parameter of matchMask). It will return true if the mask is valid or false if not.

Matching the strings can be done in a loop by calling the matchLastMask method using the string to be matches as the only parameter (which corresponds to the first parameter of matchMask). It returns true if the processing was successful, false if not, i.e. if the mask does not fit onto the input string.

After success, the placeholder values can be retrieved as described above.

Here's the earlier example with the necessary changes (in bold):

    CStringSplitter Split( _T('%'), _T('%') );;
    CString strValue,
            strMask = _T("Process: %proc% - Start Date: %date% Duration: %duration% - Description: %desc%");

    Split.setMask( strMask );
    while ( !isEndOfFile() )
    {
        strLogLine = getNextLine();

        if ( Split.matchLastMask( strLogLine ) )
        {
            if ( Split.getValue( strValue, _T("proc") ) )
            {
                // do something with strValue (proc)
                ...
            }
            ...
        }
    }

Validity of Masks and Syntax

The parser is completely tolerant about typing errors, however, there are syntax rules to get the expected result.

1. Two variables may not follow each other

At least one fixed character must separate them:

Parsing any string using the mask "(part1)(part2)" will, from a logical point, not reveal and usable results, as there is no separation between part1 and part2.

2. Using a variable's start character in fixed text

Given the following input string:

"Humidity %89"

When using the '%' as the variable start and end characters, any double-appearance of it in the fixed text section will be interpreted as one occurence of the character.

To handle the '%' sign in the fixed text, the mask string must be:

"Humidity %%%HUM%"

Note that there are three (3) sequential percentage signs:

The bold portion belongs to the fixed text, any double-start-characters are interpreted as one - "Humidity %".

The non-bold portion indicates the variable name one '%' as the start character, "HUM" as the variable name and the next '%' as the end character.

In order to prevent annoyances, the end-character is not allowed in the variable name, even if it is doubled.

3. Improper ending

Not terminating a variable at the end of a string will automatically terminate it.

4. Mask validation (e.g. of user input)

To use this class in user input fields, the easiest way to check the validity of a mask is by simply calling setMask with the string. If it returns true, the mask can be used (hower, it cannot be guaranteed that the results are what the user intended - Microsoft's PSI-API is not yet ready for public use ;-)

Other platforms

The code is using some specifics or Visual Studio 2005, however it should be quickly portable to other platforms with no effort.

The first is MFC's CString for internal storage of variables and placeholders and for returning the variable's contents. This can be replaced by virtually any other string class, as nothing more than simple assignment is used.

The string processing itself is using the oldie-but-goldie C library functions strlen, strcpy, strchr, strstr and stricmp, i.e. their Visual Studio's _t-prefixed equivalents to achieve MBCS / UNICODE compatibility using the same code.
The strcpy_s function of VS 2005 can be replaced by the corresponding "unsafe" strcpy function for other compilers without problems, as the memory for the string is allocated directly before copying.

Since the basic character handling functions are used instead of the high-level string class routines, the mask parsing should be quite fast, as these libraries are mostly assembly-optimized in most compiler libraries.

History

1.0 2008-01-21 Public release

1.1 2008 -01-25 Problems if 1.0 fixed:

Variable open and close placeholders
Treat double open-characters inside fix-text to be interpreted as single characters
Failsafe with syntax errors in mask

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)