Abstract
If you want to detect keywords or special tags in a data stream, you typically use a parser generator like yacc or ANTLR. Otherwise, you can use the power of Regular Expressions. In some cases, a handcrafted parser will fulfill the requirement. To simplify this task, I wrote the template tag_tree
(as part of a larger library using boost) to create a kind of state engine in memory during runtime. It has a low memory footprint, and is more efficient than any tokenizer-string compare approach could be.
A handwritten parser/scanner needs a lot of states, and looks similar to the following piece of code:
boost::tribool parser::consume(char input)
{
switch (state_)
{
case state_1:
if (input == 'P')
{
state_ = state_2;
return boost::indeterminate;
}
else
{
return false;
}
case state_2:
if (input == 'O')
{
return true;
}
... <snip> ...
default:
return false;
}
}
This is the implementation of a state engine, and is very error prone if you have to manage a lot of states. It works fine and fast for you, but there is a lot of work to modify the keywords. The following code snippet shows how you can solve this problem with the tag_tree
template:
class parser
{
public:
...
typedef tag_tree < std::string, unsigned > TagTree_;
typedef typename TagTree_::MyPtr_ TagTreePtr_;
TagTree_ kt_;
};
parser::parser()
: prev_( NULL )
{
kt_.insertKeyWord( "POST", 10 );
kt_.insertKeyWord( "HOST", 20 );
kt_.insertKeyWord( "XONN", 30 );
}
unsigned parser::run()
{
TagTreePtr_ prev = &kt_;
for (;;)
{
TagTreePtr_ p = prev->findNode( next() );
if (p == NULL)
{
break;
}
prev = p;
}
return (prev->isLeaf())
? prev->getValue()
: -1; }
next()
returns the next character from the data stream. While findNode()
is matching, the method continues to loop. After breaking the loop, isLeaf()
tests if a valid keyword was found and the method returns an identifier. It is quite possible to return the prev
pointer itself. prev
points in every case to a valid address, and contains all the information you need.
How does it work
The main idea is that each keyword builds a unique trace of linked nodes where each node represents a specific parser state. To build this trace, a tree of arrays will be used. The following figure shows a tree with the keywords PORT
and POST
:
And here comes some comfort to you. If you prefer speed over memory, choose the maximum size for the arrays (256 for char
, for example). Then choose the simplest hash algorithm that is possible - the ASCII code as the array index. To save memory, the array could be shrunken. But then we have to manage a collision list.
In the download section is a project named TagTree which contains a more sophisticated sample of usage for the tag_tree
template.
Reference
tag_tree_base
Defines some typedef
s and access methods for all node values. tag_tree_base
is not restricted to strings. You can use any string-like container which defines a const_iterator
and a value_type
type.
tag_tree_base::tag_tree_base( char_type c );
Parameters:
c
: The character this node is representing.
value_type & tag_tree_base::getValue();
Return value:
Value of this node.
tag_tree::tag_tree( char_type c );
tag_tree
is designed for usage in a streaming parser. tag_tree
can help you build compact state engines in memory, since each node represents a unique parser state.
Parameters:
c
: The character this node is representing.
Setting all pointers to NULL
is done by the containers.
tag_tree::MyPtr_ tag_tree::findNode( char_type c );
Parameters:
Return value
A pointer to a node containing the specified character c
or NULL
.
bool tag_tree::empty ( ) const;
Return value
true
if there are no outgoing node pointers.
The semantic is similar to the STL - container empty()
method. If the node array nodeMap_
contains no pointers, then nodeList_
should not either. And it should be a leaf. Otherwise, something is going wrong.
size_t tag_tree::leaf_count( bool branch = false ) const;
Return value:
The count of keywords starting from this point.
void tag_tree::dump( ostream_type_ & os,
size_t depth = 0L,
size_t branch = 0L
) const;
Parameters:
os
: an open output stream, std::cout
, for example.
Prints a plain list of all nodes and their attributes. Useful for debugging purposes.
tag_tree::lookUp( char_const_iterator from, char_const_iterator to ) const
Parameters:
from
: An input iterator addressing the position of the first element in the keyword.
to
: An input iterator addressing the position that is one past the final element in the keyword.
Return value:
The value of the keyword. If the keyword does not exist, a null value will be returned.
Looks up for the specified keyword.
History
- 29 Oct., 2007 - Initial release.