(untagged)

An Introduction to the Boost Spirit Parser framework

Dave Handley

0.00/5 (No votes)

9 Oct 2004

Basic introduction to producing parsers with the boost::spirit library.

Introduction

This is the first of a two part article describing the use of the boost::spirit parser framework. Spirit is the best fully object oriented parser that I have seen, allowing a user to rapidly create fully functional parsers using highly object oriented code. This first article is intended to introduce a programmer to writing enough code to successfully parse some simple input. I will not deal with any semantic actions. The next article will actually plumb in some semantic actions and create a fully working modular arithmetic calculator.

Background

Over the years, it seems that I have been doomed to write numerous parsers. My first was a programming language parser, written directly in machine code on an 8-bit machine, written when I was about 16. Since then I've used Lex and Yacc; I've written my own generic LALR parser; done custom parsers; and pretty much faced every problem imaginable. I always liked Lex and Yacc, but have never been happy with them when I'm writing object oriented code. Even the supposedly object based variants of these tools are not completely satisfactory in my view. Last year, I decided to have a look at Spirit. Spirit had just been added to the Boost library, of which I am a huge fan, and I decided to have a go. Initially, I wrote a functional programming language, but in the article, I am using a modular arithmetic calculator. I wrote the modular arithmetic calculator to help me with a Masters course in number theory, but I don't intend to explain what modular arithmetic is, so if you can't figure out what is being returned from a divide, don't expect this article to answer the question!

Spirit is a fully object oriented LL parser and lexical analyzer. Since a lexical analyzer is actually nothing more than a parser optimized to process data into token streams, Spirit treats both parts of the process virtually identically. Essentially, in Spirit, a parser is a small object that knows how to parse a particular string. For example, real_p is actually a parser to parse a real number. This is one of a whole set of parsers that are built into Spirit. Other examples include: ch_p('+') or alpha_p, matching a single plus character or a single alphabetic character respectively.

Spirit then uses a whole variety of overloaded operators to allow a programmer to chain together various parsers using a syntax that is remarkably similar to Extended Backus Naur Form (EBNF). EBNF is a simple extension to BNF that introduces concepts such as the Kleene Star (familiar to those who use regular expressions, it means zero or more of the preceding entity). Using the Kleene Star allows many of the recursive definitions common to BNF to be replaced with iterative definitions.

Please note that this code will only compile on Visual C++ 7.1 or later, and requires the latest Boost library to be installed, and set up in your include path.

Writing a Parser

Writing a parser with Spirit could not be easier. A grammar must be described by inheriting off boost::spirit::grammar, and the parsing is carried out by calling the boost::spirit::parse function. That, to all intents and purposes, is it!

The grammar must follow a specific layout. The grammar itself inherits from the boost::spirit::grammar template, using the curiously recurring template pattern. Inside the grammar, you have to define a nested struct called definition that will contain the definition of your parser, along with a function called start() that returns the start rule. The following code snippet defines the parser used in my modular arithmetic calculator. The grammar includes two semantic actions, which I will ask you to ignore for the time being. The concept of semantic actions are intended for my next article.

struct Syntax :
    public boost::spirit::grammar<Syntax>
{
public:
    Syntax( CParser &parser );
    virtual ~Syntax();
    template <typename ScannerT>
    struct definition
    {
    public:
        definition( Syntax const &self )
        {
            integer =
                lexeme_d[ (+digit_p) ]
                ;
            factor =
                integer |
                vars |
                '(' >> expression >> ')' |
                ( '-' >> factor ) |
                ( '+' >> factor )
                ;
            term =
                factor >> *( ( '*' >> factor) | ( '/' >> factor ) )
                ;
            expression =
                term >> *( ( '+' >> term ) | ( '-' >> term ) )
                ;
            assignment =
                vars
                >> '=' >> expression
                ;
            var_decl =
                lexeme_d
                [
                    ( ( alpha_p >> *( alnum_p | '_' ) )
                    - vars )[vars.add]
                ]
                ;
            declaration =
                "int" >> var_decl >> *( ',' >> var_decl )
                ;
            baseExpression =
                str_p( "exit" )[*self.m_finishedProcessing] |
                str_p( "mod" ) >> integer |
                declaration |
                assignment |
                '?' >> expression
                ;
        }
        boost::spirit::symbols<int> vars;
        boost::spirit::rule<ScannerT> integer, factor, term,
            expression, assignment, var_decl, declaration,
            baseExpression;
        const boost::spirit::rule<ScannerT> &start() const { return baseExpression; }
    };
    friend struct definition;
private:
    Private::FinishedProcessing *m_finishedProcessing;
};

As you can see, the definition struct constructor actually describes the grammar as a series of EBNF rules. If we look in detail at a couple of these, the term rule consists of a single factor followed by zero or many (the Kleene Star, *) copies of a '*', followed by a factor or a '/', followed by a factor. It corresponds to the following EBNF rule:

term ::= factor ( ( '*' factor ) | ( '/' factor ) )*

Another rule to look a little more closely at is the integer rule. This uses the lexeme_d directive to tell the parser not to skip white space. Then, +digit_p is a parser matching one or more numeric characters.

Semantic actions are included in the var_decl and baseExpression rules, the code being: [vars.add] and [*self.m_finishedProcessing]. As I said earlier, I will not explain these in detail, although one of them is used to implement the symbol table for variables, and the other allows you to type exit to close the calculator.

Final points of note are that the rules that have been defined in the definition constructor, must be actually declared in the class, they are of the type boost::spirit::rule<ScannerT>. Also, the start function is simply going to return the entry point rule, in this case baseExpression.

Running the Parser

Running the parser is simplicity in itself. If we have a grammar defined called Syntax then a basic parser could be run on the line of text stored in std::string line using:

boost::spirit::parse_info<> info;
Syntax parser;
info = boost::spirit::parse( line.c_str(), parser, boost::spirit::space_p );

The results class boost::spirit::parse_info tells you whether the parser has successfully parsed all the input or not, and if not, it gives an indication of where failure occurred. The final parameter is a skip parser, and tells the parser to skip whitespace in the input. The parse function either takes a string, or a pair of iterators for begin and end, allowing considerable flexibility in usage.

Spirit Problems

I can't say that programming in the Spirit parser has not caused me some problems. I'll explain a few of them here to try and give you an idea of what you may face if you decide to use this library.

Compilation Errors

Compilation errors produced when using Spirit often run to multiple pages of impenetrable garbage. Unfortunately, the complex templatised code used with the Spirit library causes the Microsoft compiler to report the error as each layer of templatised code is unwrapped. In order to debug a compile problem, you quite often have to gradually work through each reported layer of errors, examining each carefully, only to eventually discover that you missed a semicolon. Also, under extreme circumstances, you can get an error that results in a C1001 internal compiler error with no idea of why that has happened. I have personally seen the dreaded C1001 simply because I had forgotten to declare a functor. Ouch! The writers of Spirit are pushing the capabilities of C++ to its limits, and unfortunately, the compiler writers are only just starting to implement many of these features. Spirit will not compile on Visual C++ 7.0, because it needs the partial template specialization features (and other features) only introduced on Visual C++ 7.1. Don't think that things will be any easier on other compilers. The errors reported by GCC can be just as impenetrable. Don't take this as a reason not to use Spirit, I think you gradually get used to the errors and you can live with them, but don't expect an easy ride straight away.

Compilation Times

Again, because Spirit pushes the compiler to its limits, expect compilation times to be slow, especially on complex grammars. Be very careful about including your parser in too many CPP files, and this issue will be manageable. The Spirit documentation covers various techniques to try and simplify your parsers and reduce compilation times should things get out of hand.

Minimal Rebuilds

As far as I can tell, the minimal rebuilds option in Visual Studio and Spirit do not mix. I have spent quite a while trying to debug a parser which isn't working as I expect, only to find that it isn't recompiling because the minimal rebuild framework fails to pick up a change in the depths of one of the template definitions. Turn minimal rebuild off as soon as you incorporate Spirit into your application, then you should have no problems.

Impenetrable Grammar

At first glance, the complex templates and huge amount of operator overloading that are all part of the Spirit library, can make the grammar impenetrable. My recommendation, at least initially, is to treat all the general plumbing code as boilerplate, and just concentrate on the main parser code. Brush up on your Extended Backus Naur Form (EBNF) and you'll see that everything looks similar. At this point, it all starts to make sense.

Conclusions

The boost::spirit library contains an impressive suite of classes to make the creation of object oriented parsers both quick and clean. It has the advantage of being very quick and easy to use without having to pre-process grammar files or auto-generate code using external tools. It also has its problems. In order to implement the library, the programmers have used many cutting edge features of ANSI Standard C++, many of which are only just starting to arrive in mainstream compilers. As such, a programmer can expect a few issues, most notably impenetrable error messages, and slow compilation times. My advice, nevertheless, is to have a go at using it, the clean definition of parsers makes it a definite candidate for writing simple parsers.

History

9 Oct 04: Version 1 released.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here