Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles
(untagged)

How to build an in-memory state engine at runtime

0.00/5 (No votes)
28 Oct 2007 1  
A C++ template for an efficient in-memory state engine.

tag tree

Abstract

If you want to detect keywords or special tags in a data stream, you typically use a parser generator like yacc or ANTLR. Otherwise, you can use the power of Regular Expressions. In some cases, a handcrafted parser will fulfill the requirement. To simplify this task, I wrote the template tag_tree (as part of a larger library using boost) to create a kind of state engine in memory during runtime. It has a low memory footprint, and is more efficient than any tokenizer-string compare approach could be.

Introduction

A handwritten parser/scanner needs a lot of states, and looks similar to the following piece of code:

boost::tribool parser::consume(char input)
{
  switch (state_)
  {
  case state_1:
    if (input == 'P')
    {
      state_ = state_2;
      return boost::indeterminate;
    }
    else
    {
      return false;
    }
  case state_2:
    if (input == 'O')
    {
      return true;
    }
    ... <snip> ...
  default:
    return false;
  }
}

This is the implementation of a state engine, and is very error prone if you have to manage a lot of states. It works fine and fast for you, but there is a lot of work to modify the keywords. The following code snippet shows how you can solve this problem with the tag_tree template:

class parser
{
public:
    ...
    typedef tag_tree < std::string, unsigned >     TagTree_;
    typedef typename TagTree_::MyPtr_    TagTreePtr_;
    
    TagTree_    kt_;
    
};

parser::parser()
    : prev_( NULL )
{
    kt_.insertKeyWord( "POST", 10 );
    kt_.insertKeyWord( "HOST", 20 );
    kt_.insertKeyWord( "XONN", 30 );
}

unsigned parser::run()
{
    TagTreePtr_    prev = &kt_;
    for (;;)
    {
        TagTreePtr_ p = prev->findNode( next() );
        if (p == NULL)
        {
            break;
        }
        prev = p;
    }
    
    return (prev->isLeaf())
        ? prev->getValue()
        : -1;    //! nothing found
}

next() returns the next character from the data stream. While findNode() is matching, the method continues to loop. After breaking the loop, isLeaf() tests if a valid keyword was found and the method returns an identifier. It is quite possible to return the prev pointer itself. prev points in every case to a valid address, and contains all the information you need.

How does it work

The main idea is that each keyword builds a unique trace of linked nodes where each node represents a specific parser state. To build this trace, a tree of arrays will be used. The following figure shows a tree with the keywords PORT and POST:

tag tree

And here comes some comfort to you. If you prefer speed over memory, choose the maximum size for the arrays (256 for char, for example). Then choose the simplest hash algorithm that is possible - the ASCII code as the array index. To save memory, the array could be shrunken. But then we have to manage a collision list.

In the download section is a project named TagTree which contains a more sophisticated sample of usage for the tag_tree template.

Reference

tag_tree_base

Defines some typedefs and access methods for all node values. tag_tree_base is not restricted to strings. You can use any string-like container which defines a const_iterator and a value_type type.

tag_tree_base constructor and destructor

tag_tree_base::tag_tree_base( char_type c );
Parameters:
  • c: The character this node is representing.

getValue

value_type & tag_tree_base::getValue();
Return value:

Value of this node.

tag_tree constructor and destructor

tag_tree::tag_tree( char_type c );

tag_tree is designed for usage in a streaming parser. tag_tree can help you build compact state engines in memory, since each node represents a unique parser state.

Parameters:
  • c: The character this node is representing.

Setting all pointers to NULL is done by the containers.

findNode

tag_tree::MyPtr_ tag_tree::findNode( char_type c );
Parameters:
  • c: character to find.
Return value

A pointer to a node containing the specified character c or NULL.

empty

bool tag_tree::empty     (          )      const;
Return value

true if there are no outgoing node pointers.

The semantic is similar to the STL - container empty() method. If the node array nodeMap_ contains no pointers, then nodeList_ should not either. And it should be a leaf. Otherwise, something is going wrong.

leaf_count

size_t tag_tree::leaf_count( bool branch = false )     const;
Return value:

The count of keywords starting from this point.

dump

void tag_tree::dump(     ostream_type_ &      os,
        size_t      depth = 0L,
        size_t      branch = 0L
    )             const;
Parameters:
  • os: an open output stream, std::cout, for example.

Prints a plain list of all nodes and their attributes. Useful for debugging purposes.

lookUp

tag_tree::lookUp(    char_const_iterator from, char_const_iterator to ) const
Parameters:
  • from: An input iterator addressing the position of the first element in the keyword.
  • to: An input iterator addressing the position that is one past the final element in the keyword.
Return value:

The value of the keyword. If the keyword does not exist, a null value will be returned.

Looks up for the specified keyword.

History

  • 29 Oct., 2007 - Initial release.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here