(untagged)

The biscuit parser library

mb2sync

0.00/5 (No votes)

5 Jul 2005

An object-oriented recursive-descent parser generator framework implemented using class templates.

Download demo project - 197 Kb

Preface

I was looking for a light and un-strict XML parser. Boost.Spirit and Boost.Xpressive were too big and their performance at compile-time was not good. I suspected that they were not static, but YARD written by Christopher Diggins was really static, small and fast. In the meantime, I noticed that the lazy instantiation of templates could allow us to write recursive grammars, and I found that I could bind YARD and the finite state machine found at C++ Template Metaprogramming. It was named biscuit.

Introduction

biscuit is an object-oriented recursive-descent parser generator framework implemented using class templates. The templates allow us to author Extended Backus-Normal Form (EBNF) in C++. Technical information about the same is available at YARD.

A simple EBNF grammar snippet:

group      ::= '(' expression ')'
factor     ::= integer | group
term       ::= factor (('*' factor) | ('/' factor))*
expression ::= term (('+' term) | ('-' term))*

is approximated using biscuit's facilities as seen in this code snippet:

struct expression ;
struct group      : seq< str<'('>, expression, str<')'> > { };
struct factor     : or_< integer, group > { };
struct term       : seq< factor, star< or_< seq< str<'*'>, 
                    factor >, seq< str<'/'>, factor > > > > { };
struct expression : seq< term, star< or_< seq< str<'+'>, 
                    term >, seq< str<'-'>, term > > > > { };

Through the magic of the lazy template instantiation, they are the perfect valid types. The production rule expression is in fact a type that has a static member function parse. As parse will be instantiated later by algorithms, all you have to do is not to define but to declare a type:

struct element; // forward declaration


struct content :
  seq<
    opt<CharData>,
    star<
      seq<
        or_<
          element, // the magical recursion!

          Reference,
          CDSect,
          PI,
          Comment
        >,
        opt<CharData>
      >
    >
  >
{ };

struct element :
  or_<
    EmptyElemTag, 
    seq<STag, content, ETag>
  >
{ };

Quick start

Get and install the latest release of Boost C++ Libraries. (biscuit uses only their headers.)

Include headers of biscuit:

#include "biscuit/algorithm.hpp"

#include "biscuit/parser.hpp"

using namespace biscuit;

Define your own parser type:

typedef seq<
  str<'/','*'>,
  star_until< any, str<'*','/'> >
> c_comment;

Call algorithms:

if (match<c_comment>("/* hello, biscuit */")) {
  //...

}

If you build the test project, it is required to build Boost.Test:
```
bjam -sTOOLS=vc-7_1 --with-test install
```

Basic concepts

Parser

A parser is any type that has a static member function:

template< class State, class UserState >
static bool parse(State& s, UserState& us);

As a parser is a type, it can't have any runtime-state. You can pass any UserState object to the algorithm and the object is passed to parse.

User state

Any type.

Forward range

A Forward Range is a concept similar to STL Container. More on this concept is available at Boost.Range.

Semantic action class

A semantic action class is any class of Function Object that has a member function:

void operator()(ForwardIter first, ForwardIter last, UserState& us);

Predefined parsers

Some parser templates are predefined as a means for parser composition and embedding.

Primitives

The table below lists EBNF and their equivalents in biscuit.

EBNF (or PERL)	biscuit	Meaning
.	`any`	any object
A \| B	`or_<A, B>`	alternation of A and B
A B	`seq<A, B>`	sequence of A and B
A*	`star<A>`	zero or more times, greedy
A+	`plus<A>`	one or more times, greedy
A?	`opt<A>`	zero or one time, greedy
A - B	`minus<A, B>`	matches A, but the sub-match of A doesn't match B
A{n,m}	r`epeat<A, n, m>`	between n and m times, greedy
A*? B	`star_until<A, B>`	zero or more As and B
"Diggins"	`str<'D','i','g','g','i','n','s'>`	string
^	`begin`	beginning of the sequence
$	`end`	end of the sequence
\n	`eol`	end of line
\d	`digit`	a digit
\D	`not_<digit>`	not a digit
\s	`space`	a space
\S	`not_<space>`	not a space
[0-9]	`char_range<'0','9'>`	characters in the range '0' through '9'
[abc]	`char_set<'a','b','c'>`	characters 'a','b', or 'c'
[0-9abc]	`or_< char_range<'0','9'>, char_set<'a','b','c'> >`	characters 'a','b','c' or in the range '0' though '9'
[^abc]	`not_< char_set<'a','b','c'> >`	not characters 'a','b', or 'c'
(?=A)	`before<A>`	positive look-ahead assertion
(?!A)	`not_< before<A> >`	negative look-ahead assertion

YARD and biscuit have no back-tracking on star operations. The maximum supported arity of parsers is now twenty.

Actor

actor is a parser that triggers a semantic action class object:

struct decorate_action
{
  template< class ForwardIter, class Stream >
  void operator()(ForwardIter first, ForwardIter last, 
                                          Stream& out)
  {
    out << "[";
    out << std::string(first, last);
    out << "]";
  }
};

struct xml_comment :
  seq<
    str<'<','!','-','-'>,
    star<
      or_<
        minus< any, str<'-'> >,
        seq<
          str<'-'>,
          minus< any, str<'-'> >
        >
      >
    >,
    str<'-','-','>'>
  >
{ };

struct xml_comment_action : actor< xml_comment, 
                                    decorate_action >
{ };

std::stringstream out;
std::string s0("<!-- xml comment no.1 -->");
match<xml_comment_action>(s0, out);
BOOST_CHECK( std::string("[<!-- xml comment no.1 -->]") == 
                                               out.str() );

You can pass a semantic action class to an actor, but not a function pointer. This trouble is fixed by the grammar below.

Directives

Directives are also parsers. Some ports of Boost.Spirit's directives:

Boost.Spirit	biscuit	Meaning
lexeme_d[A]	impossible	turn off white space skipping.
as_lower_d[A]	`as_lower<A>`	convert inputs to lower-case.
no_actions[A]	`no_actions<A>`	all semantic actions not to fire.
???	`definitive_actions<A>`	parse twice and suppress non-intended actions.
longest_d[A\|B]	`longest<A, B>`	choose the longest match.
shortest_d[A\|B]	`shortest<A, B>`	choose the shortest match.
limit_d[A]	`requires<A, PredicateClass>`	ensure that the result of a parser is constrained.
???	`transform<A, FunctorClass>`	convert inputs using functor.

Algorithms

Algorithms of biscuit work with Forward Range. Bear in mind that parsers don't know the value_type of the range. For instance, a parser str works fine, if the value_type of the range is comparable with char.

match

match returns true if a parser runs through the range; otherwise false:

BOOST_CHECK( match<xml_comment>("<!-- hello, xml comment -->"));
BOOST_CHECK( !match<xml_comment>
         ("<!-- not well-formed comment -- -->") );

search

search returns the first sub matched boost::iterator_range; otherwise an empty range:

std::string s0("  /* c comment no.1 */x int i; "
               "/* c comment no.2 */ i = 1; "
               "/* c comment no.3 */ ++i;  ");
boost::sub_range<std::string> sr = search<c_comment>(s0);
BOOST_CHECK( std::string(boost::begin(sr), 
    boost::end(sr)) == "/* c comment no.1 */" );

Ranges

filter_range

filter_range is a filtered boost::iterator_range by a parser:

typedef filter_range< c_comment, std::string > fr_t;
std::string s0("  /* c comment no.1 */ int i; "
               "/* c comment no.2 */ i = 1; "
               "/* c comment no.3 */ ++i;  ");
fr_t fr(s0);
BOOST_CHECK(( match< repeat< c_comment, 3 > >(fr) ));

There is no reason why chains of filter_range cannot work:

BOOST_CHECK((
  match< str<'x','y','z'> >(
    make_filter_range< alpha >(
      make_filter_range< not_<space> >(
        make_filter_range< not_<digit> >
                     ("x & 4 y . 125 %  z")
      )
    )
  )
));

match_results

match_results is a Forward Range of boost::iterator_range:

typedef match_results<c_comment, std::string> mrs_t;
typedef boost::range_const_iterator<mrs_t>::type iter_t;

std::string s0("  /* c comment no.1 */int i; "
               "/* c comment no.2 */i = 1; "
               "/* c comment no.3 */ ++i;  ");
mrs_t mrs(s0);
for (iter_t it = boost::const_begin(mrs); 
               it != boost::const_end(mrs); ++it)
{
  std::cout << std::string(boost::begin(*it), 
                boost::end(*it)) << std::endl;
}

Outputs:

/* c comment no.1 */
/* c comment no.2 */
/* c comment no.3 */

Grammar

As parsers are just types, they have no runtime-state. Non-type template argument is fairly limited. What should we do, if the value_type of the Forward Range is not an Integral Constant like char? C++ Template Metaprogramming says that member function pointers can bind templates and objects. Boost.Spirit makes an expression templates object from expression templates objects, but you can make an expression type from expression templates using biscuit.

grammar binds parsers and objects:

struct my_grammar : grammar< my_grammar, 
                  std::vector<std::string> >
{
  std::string text0() { return "hello"; };
  std::string text1() { return "grammar"; };
  std::string text2() { return "value"; };

  struct start :
    seq<
      value_<&my_grammar::text0>,
      value_<&my_grammar::text1>,
      value_<&my_grammar::text2>
    >
  { };
};

void grm_value_test()
{
  std::cout << "grm_value_test ing..." << std::endl;

  std::vector<std::string> texts;
  texts.push_back(std::string("hello"));
  texts.push_back(std::string("grammar"));
  texts.push_back(std::string("value"));
  
  my_grammar the_grammar;
  BOOST_CHECK( match<typename grammar_start<my_grammar>::type >
                                          (texts, the_grammar));
}

biscuit has no limitation on the value_type of Forward Range parsed. As std::vector<std::string> is a Forward Range of std::string, it works. Keep in mind that the UserState object is now your grammar object.

Full example

Here is a port of Boost.Spirit's calculator:

struct calculator : grammar< calculator, std::string >
{
  void do_int(iterator_type str, iterator_type end)
  {
    std::string s(str, end);
    std::cout << "PUSH(" << s << ')' << std::endl;
  }

  void do_add(iterator_type, iterator_type)   
      { std::cout << "ADD\n"; }
  void do_subt(iterator_type, iterator_type)  
      { std::cout << "SUBTRACT\n"; }
  void do_mult(iterator_type, iterator_type)  
      { std::cout << "MULTIPLY\n"; }
  void do_div(iterator_type, iterator_type)   
      { std::cout << "DIVIDE\n"; }
  void do_neg(iterator_type, iterator_type)   
      { std::cout << "NEGATE\n"; }

  struct expression;
  struct term;
  struct factor;

  struct expression :
    seq<
      term,
      star<
        or_<
          actor_< seq< str<'+'>, term >, 
                       &calculator::do_add >,
          actor_< seq< str<'-'>, term >, 
                       &calculator::do_subt >
        >
      >
    >
  { };
  
  struct term :
    seq<
      factor,
      star<
        or_<
          actor_< seq< str<'*'>, factor >, 
                        &calculator::do_mult >,
          actor_< seq< str<'/'>, factor >, 
                        &calculator::do_div >
        >
      >
    >
  { };

  struct factor :
    or_<
      actor_< plus<digit>, &calculator::do_int >,
      seq< str<'('>, expression, str<')'> >,
      actor_< seq< str<'-'>, factor >, 
                           &calculator::do_neg >,
      seq< str<'+'>, factor >
    >
  { };

  struct start : expression { };
};

actor_ makes an actor from the member function pointer. Enjoy the simplicity, compile-time performance and the small-size of the executable.

Debugger

biscuit emulates Boost.Spirit's debugging:

struct start_tag { };
struct integer_tag { };
struct factor_tag { };
struct term_tag { };
struct expression_tag { };

template< class ForwardRange >
struct calculator_debug :
  grammar< calculator_debug, ForwardRange >, 
                             private boost::noncopyable
{
  calculator_debug(std::stack<long>& eval_) : 
                                     eval(eval_) { }

  void push_int(iterator_type first, iterator_type last)
  {
    std::string s(first, last);
    // std::strtol(str, 0, 10);

    long n = boost::lexical_cast<long>(s); 
    eval.push(n);
    std::cout << "push\t" << long(n) << std::endl;
  }

  // ...


  struct integer;
  struct factor;
  struct term;
  struct expression;

  // struct start : debugger<start,

  //  also ok, but long...

  struct start : debugger<start_tag, 
    expression
  >
  { };

  struct integer : debugger<integer_tag,
              actor_< plus<digit>, 
              &calculator_debug::push_int >
  >
  { };
  
  // ...

debugger uses the type-name of the first argument for outputs. If your grammar is a class template like the above, type-name can be very long. I think you want to define start_tag etc. Well, the debugger automatically disappears on release-compile.

Outputs:

1 + 2
struct start_tag: "1+2"
  struct expression_tag: "1+2"
    struct term_tag: "1+2"
      struct factor_tag: "1+2"
        struct integer_tag: "1+2"
push    1
        /struct integer_tag: "+2"
      /struct factor_tag: "+2"
    /struct term_tag: "+2"
    struct term_tag: "2"
      struct factor_tag: "2"
        struct integer_tag: "2"
push    2
        /struct integer_tag: ""
      /struct factor_tag: ""
    /struct term_tag: ""
popped 1 and 2 from the stack. pushing 3 onto the stack.
  /struct expression_tag: ""
/struct start_tag: ""
-------------------------
Parsing succeeded
result = 3
-------------------------

Points of interest

You can find more on composing inlined algorithms at Boost.MPL TODO list. YARD and biscuit are examples of it. By the way, this article was actually a hopeless war vs. Boost.Spirit. But don't you think another war will break out?

A snippet:

using namespace cranberry;

int x = 7;
int a = apply< plus< _1, _2 > >(x)(8);
int b = apply< plus< _1, _2 > >()(7, 8);
int c = apply< plus< _1, _2 > >(7, 8)();
int d = apply< plus< _1, int_<8> > >(7)();
int e = apply< bind< _1, _2, _3 > >
                  (std::plus<int>())(7, 8);
int f = apply< bind< _1, _2, _3 > >
                  (&free_plus)(7, 8);

int ar[5] = {1,2,3,4,5};
std::transform(ar, ar+5, ar, 
   apply< plus< _1, _2 >, int const& >(x));
std::transform(ar, ar+5, ar, 
   apply< bind< _1, _2, _3 > >(std::plus<int>(), x));
std::transform(ar, ar+5, ar, 
   apply< bind< _1, _2, int_<7> > >(&free_plus));
std::for_each(ar, ar+5, apply< std_cout< _1 > >());

References

Release notes

Fixed the name confusion between limit and repeat.
Directives became first-class parsers.
Added debugger parser.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here