Preface
I was looking for a light and un-strict XML parser. Boost.Spirit and Boost.Xpressive were too big and their performance at compile-time was not good. I suspected that they were not static, but YARD written by Christopher Diggins was really static, small and fast. In the meantime, I noticed that the lazy instantiation of templates could allow us to write recursive grammars, and I found that I could bind YARD and the finite state machine found at C++ Template Metaprogramming. It was named biscuit.
Introduction
biscuit is an object-oriented recursive-descent parser generator framework implemented using class templates. The templates allow us to author Extended Backus-Normal Form (EBNF) in C++. Technical information about the same is available at YARD.
A simple EBNF grammar snippet:
group ::= '(' expression ')'
factor ::= integer | group
term ::= factor (('*' factor) | ('/' factor))*
expression ::= term (('+' term) | ('-' term))*
is approximated using biscuit's facilities as seen in this code snippet:
struct expression ;
struct group : seq< str<'('>, expression, str<')'> > { };
struct factor : or_< integer, group > { };
struct term : seq< factor, star< or_< seq< str<'*'>,
factor >, seq< str<'/'>, factor > > > > { };
struct expression : seq< term, star< or_< seq< str<'+'>,
term >, seq< str<'-'>, term > > > > { };
Through the magic of the lazy template instantiation, they are the perfect valid types. The production rule expression is in fact a type that has a static member function parse
. As parse
will be instantiated later by algorithms, all you have to do is not to define but to declare a type:
struct element;
struct content :
seq<
opt<CharData>,
star<
seq<
or_<
element,
Reference,
CDSect,
PI,
Comment
>,
opt<CharData>
>
>
>
{ };
struct element :
or_<
EmptyElemTag,
seq<STag, content, ETag>
>
{ };
Quick start
- Get and install the latest release of Boost C++ Libraries. (biscuit uses only their headers.)
- Include headers of biscuit:
#include "biscuit/algorithm.hpp"
#include "biscuit/parser.hpp"
using namespace biscuit;
- Define your own parser type:
typedef seq<
str<'/','*'>,
star_until< any, str<'*','/'> >
> c_comment;
- Call algorithms:
if (match<c_comment>("/* hello, biscuit */")) {
}
- If you build the test project, it is required to build Boost.Test:
bjam -sTOOLS=vc-7_1 --with-test install
Basic concepts
Parser
A parser is any type that has a static member function:
template< class State, class UserState >
static bool parse(State& s, UserState& us);
As a parser is a type, it can't have any runtime-state. You can pass any UserState
object to the algorithm and the object is passed to parse
.
User state
Any type.
Forward range
A Forward Range is a concept similar to STL Container. More on this concept is available at Boost.Range.
Semantic action class
A semantic action class is any class of Function Object that has a member function:
void operator()(ForwardIter first, ForwardIter last, UserState& us);
Predefined parsers
Some parser templates are predefined as a means for parser composition and embedding.
Primitives
The table below lists EBNF and their equivalents in biscuit.
EBNF (or PERL) |
biscuit |
Meaning |
. |
any |
any object |
A | B |
or_<A, B> |
alternation of A and B |
A B |
seq<A, B> |
sequence of A and B |
A* |
star<A> |
zero or more times, greedy |
A+ |
plus<A> |
one or more times, greedy |
A? |
opt<A> |
zero or one time, greedy |
A - B |
minus<A, B> |
matches A, but the sub-match of A doesn't match B |
A{n,m} |
repeat<A, n, m> |
between n and m times, greedy |
A*? B |
star_until<A, B> |
zero or more As and B |
"Diggins" |
str<'D','i','g','g','i','n','s'> |
string |
^ |
begin |
beginning of the sequence |
$ |
end |
end of the sequence |
\n |
eol |
end of line |
\d |
digit |
a digit |
\D |
not_<digit> |
not a digit |
\s |
space |
a space |
\S |
not_<space> |
not a space |
[0-9] |
char_range<'0','9'> |
characters in the range '0' through '9' |
[abc] |
char_set<'a','b','c'> |
characters 'a','b', or 'c' |
[0-9abc] |
or_< char_range<'0','9'>, char_set<'a','b','c'> > |
characters 'a','b','c' or in the range '0' though '9' |
[^abc] |
not_< char_set<'a','b','c'> > |
not characters 'a','b', or 'c' |
(?=A) |
before<A> |
positive look-ahead assertion |
(?!A) |
not_< before<A> > |
negative look-ahead assertion |
YARD and biscuit have no back-tracking on star operations. The maximum supported arity of parsers is now twenty.
Actor
actor
is a parser that triggers a semantic action class object:
struct decorate_action
{
template< class ForwardIter, class Stream >
void operator()(ForwardIter first, ForwardIter last,
Stream& out)
{
out << "[";
out << std::string(first, last);
out << "]";
}
};
struct xml_comment :
seq<
str<'<','!','-','-'>,
star<
or_<
minus< any, str<'-'> >,
seq<
str<'-'>,
minus< any, str<'-'> >
>
>
>,
str<'-','-','>'>
>
{ };
struct xml_comment_action : actor< xml_comment,
decorate_action >
{ };
std::stringstream out;
std::string s0("<!-- xml comment no.1 -->");
match<xml_comment_action>(s0, out);
BOOST_CHECK( std::string("[<!-- xml comment no.1 -->]") ==
out.str() );
You can pass a semantic action class to an actor, but not a function pointer. This trouble is fixed by the grammar below.
Directives
Directives are also parsers. Some ports of Boost.Spirit's directives:
Boost.Spirit |
biscuit |
Meaning |
lexeme_d[A] |
impossible |
turn off white space skipping. |
as_lower_d[A] |
as_lower<A> |
convert inputs to lower-case. |
no_actions[A] |
no_actions<A> |
all semantic actions not to fire. |
??? |
definitive_actions<A> |
parse twice and suppress non-intended actions. |
longest_d[A|B] |
longest<A, B> |
choose the longest match. |
shortest_d[A|B] |
shortest<A, B> |
choose the shortest match. |
limit_d[A] |
requires<A, PredicateClass> |
ensure that the result of a parser is constrained. |
??? |
transform<A, FunctorClass> |
convert inputs using functor. |
Algorithms
Algorithms of biscuit work with Forward Range. Bear in mind that parsers don't know the value_type of the range. For instance, a parser str
works fine, if the value_type of the range is comparable with char
.
match
match
returns true
if a parser runs through the range; otherwise false
:
BOOST_CHECK( match<xml_comment>("<!-- hello, xml comment -->"));
BOOST_CHECK( !match<xml_comment>
("<!-- not well-formed comment -- -->") );
search
search
returns the first sub matched boost::iterator_range
; otherwise an empty range:
std::string s0(" /* c comment no.1 */x int i; "
"/* c comment no.2 */ i = 1; "
"/* c comment no.3 */ ++i; ");
boost::sub_range<std::string> sr = search<c_comment>(s0);
BOOST_CHECK( std::string(boost::begin(sr),
boost::end(sr)) == "/* c comment no.1 */" );
Ranges
filter_range
filter_range
is a filtered boost::iterator_range
by a parser:
typedef filter_range< c_comment, std::string > fr_t;
std::string s0(" /* c comment no.1 */ int i; "
"/* c comment no.2 */ i = 1; "
"/* c comment no.3 */ ++i; ");
fr_t fr(s0);
BOOST_CHECK(( match< repeat< c_comment, 3 > >(fr) ));
There is no reason why chains of filter_range
cannot work:
BOOST_CHECK((
match< str<'x','y','z'> >(
make_filter_range< alpha >(
make_filter_range< not_<space> >(
make_filter_range< not_<digit> >
("x & 4 y . 125 % z")
)
)
)
));
match_results
match_results
is a Forward Range of boost::iterator_range
:
typedef match_results<c_comment, std::string> mrs_t;
typedef boost::range_const_iterator<mrs_t>::type iter_t;
std::string s0(" /* c comment no.1 */int i; "
"/* c comment no.2 */i = 1; "
"/* c comment no.3 */ ++i; ");
mrs_t mrs(s0);
for (iter_t it = boost::const_begin(mrs);
it != boost::const_end(mrs); ++it)
{
std::cout << std::string(boost::begin(*it),
boost::end(*it)) << std::endl;
}
Outputs:
/* c comment no.1 */
/* c comment no.2 */
/* c comment no.3 */
Grammar
As parsers are just types, they have no runtime-state. Non-type template argument is fairly limited. What should we do, if the value_type of the Forward Range is not an Integral Constant like char
? C++ Template Metaprogramming says that member function pointers can bind templates and objects. Boost.Spirit makes an expression templates object from expression templates objects, but you can make an expression type from expression templates using biscuit.
grammar
binds parsers and objects:
struct my_grammar : grammar< my_grammar,
std::vector<std::string> >
{
std::string text0() { return "hello"; };
std::string text1() { return "grammar"; };
std::string text2() { return "value"; };
struct start :
seq<
value_<&my_grammar::text0>,
value_<&my_grammar::text1>,
value_<&my_grammar::text2>
>
{ };
};
void grm_value_test()
{
std::cout << "grm_value_test ing..." << std::endl;
std::vector<std::string> texts;
texts.push_back(std::string("hello"));
texts.push_back(std::string("grammar"));
texts.push_back(std::string("value"));
my_grammar the_grammar;
BOOST_CHECK( match<typename grammar_start<my_grammar>::type >
(texts, the_grammar));
}
biscuit has no limitation on the value_type of Forward Range parsed. As std::vector<std::string>
is a Forward Range of std::string
, it works. Keep in mind that the UserState
object is now your grammar object.
Full example
Here is a port of Boost.Spirit's calculator:
struct calculator : grammar< calculator, std::string >
{
void do_int(iterator_type str, iterator_type end)
{
std::string s(str, end);
std::cout << "PUSH(" << s << ')' << std::endl;
}
void do_add(iterator_type, iterator_type)
{ std::cout << "ADD\n"; }
void do_subt(iterator_type, iterator_type)
{ std::cout << "SUBTRACT\n"; }
void do_mult(iterator_type, iterator_type)
{ std::cout << "MULTIPLY\n"; }
void do_div(iterator_type, iterator_type)
{ std::cout << "DIVIDE\n"; }
void do_neg(iterator_type, iterator_type)
{ std::cout << "NEGATE\n"; }
struct expression;
struct term;
struct factor;
struct expression :
seq<
term,
star<
or_<
actor_< seq< str<'+'>, term >,
&calculator::do_add >,
actor_< seq< str<'-'>, term >,
&calculator::do_subt >
>
>
>
{ };
struct term :
seq<
factor,
star<
or_<
actor_< seq< str<'*'>, factor >,
&calculator::do_mult >,
actor_< seq< str<'/'>, factor >,
&calculator::do_div >
>
>
>
{ };
struct factor :
or_<
actor_< plus<digit>, &calculator::do_int >,
seq< str<'('>, expression, str<')'> >,
actor_< seq< str<'-'>, factor >,
&calculator::do_neg >,
seq< str<'+'>, factor >
>
{ };
struct start : expression { };
};
actor_
makes an actor from the member function pointer. Enjoy the simplicity, compile-time performance and the small-size of the executable.
Debugger
biscuit emulates Boost.Spirit's debugging:
struct start_tag { };
struct integer_tag { };
struct factor_tag { };
struct term_tag { };
struct expression_tag { };
template< class ForwardRange >
struct calculator_debug :
grammar< calculator_debug, ForwardRange >,
private boost::noncopyable
{
calculator_debug(std::stack<long>& eval_) :
eval(eval_) { }
void push_int(iterator_type first, iterator_type last)
{
std::string s(first, last);
long n = boost::lexical_cast<long>(s);
eval.push(n);
std::cout << "push\t" << long(n) << std::endl;
}
struct integer;
struct factor;
struct term;
struct expression;
struct start : debugger<start_tag,
expression
>
{ };
struct integer : debugger<integer_tag,
actor_< plus<digit>,
&calculator_debug::push_int >
>
{ };
debugger
uses the type-name of the first argument for outputs. If your grammar
is a class template like the above, type-name can be very long. I think you want to define start_tag
etc. Well, the debugger automatically disappears on release-compile.
Outputs:
1 + 2
struct start_tag: "1+2"
struct expression_tag: "1+2"
struct term_tag: "1+2"
struct factor_tag: "1+2"
struct integer_tag: "1+2"
push 1
/struct integer_tag: "+2"
/struct factor_tag: "+2"
/struct term_tag: "+2"
struct term_tag: "2"
struct factor_tag: "2"
struct integer_tag: "2"
push 2
/struct integer_tag: ""
/struct factor_tag: ""
/struct term_tag: ""
popped 1 and 2 from the stack. pushing 3 onto the stack.
/struct expression_tag: ""
/struct start_tag: ""
-------------------------
Parsing succeeded
result = 3
-------------------------
Points of interest
You can find more on composing inlined algorithms at Boost.MPL TODO list. YARD and biscuit are examples of it. By the way, this article was actually a hopeless war vs. Boost.Spirit. But don't you think another war will break out?
A snippet:
using namespace cranberry;
int x = 7;
int a = apply< plus< _1, _2 > >(x)(8);
int b = apply< plus< _1, _2 > >()(7, 8);
int c = apply< plus< _1, _2 > >(7, 8)();
int d = apply< plus< _1, int_<8> > >(7)();
int e = apply< bind< _1, _2, _3 > >
(std::plus<int>())(7, 8);
int f = apply< bind< _1, _2, _3 > >
(&free_plus)(7, 8);
int ar[5] = {1,2,3,4,5};
std::transform(ar, ar+5, ar,
apply< plus< _1, _2 >, int const& >(x));
std::transform(ar, ar+5, ar,
apply< bind< _1, _2, _3 > >(std::plus<int>(), x));
std::transform(ar, ar+5, ar,
apply< bind< _1, _2, int_<7> > >(&free_plus));
std::for_each(ar, ar+5, apply< std_cout< _1 > >());
References
Release notes
- Fixed the name confusion between limit and repeat.
- Directives became first-class parsers.
- Added debugger parser.