(untagged)

Crafting an interpreter Part 3 - Parse Trees and Syntax Trees

Martin.Holzherr

0.00/5 (No votes)

17 May 2005

Third article on building a language interpreter describing the generation of parse trees and syntax trees.

Introduction

In this third article on building an interpreter, I will show how to generate a physical tree from the source text, which must be parsed. Building a physical tree from the source text has many advantages: it allows detailed analysis, transformations and optimizations of the original source. A tree together with symbol tables compiled from tree visits can even serve as basis for a simple interpreter.

Tree generation is particularly easy for a recursive descent parser, since a recursive descent parser consists of a set of functions - one for each grammar rule. To generate a tree, the grammar-rule function must additionally add a new child to the already built tree, where the new child represents the source part matched by the corresponding grammar rule.

With a tree, language analysis and code generation becomes much easier. But despite this and the fact, that it is easy to generate such a physical tree with most parsers, early compilers (e.g. the first Pascal or C compilers) did not generate such a tree. This is due to the fact, that the overhead in memory usage can be overwhelming - and memory is a resource which was much more precious in the early days of computing.

In a PEG parser [1][2] as used in this article series, tree generation is complicated by backtracking. In case of a parsing failure, the part of the tree belonging to the failed rule must be removed again. Exceptions during parsing must be handled exactly like backtracking.

As a useful application of tree generation, I will present the core of a parser wizard, which takes a set of grammar rules in source text form and generates the skeleton of a parser for this grammar. Without having a parse tree, such a parser wizard (simple form of parser generator) would be difficult to build.

The Tree advantage

The following table shows some problems which can be easily solved when having a physical tree representing the source text, but which are difficult otherwise.

Task	Source	Tree Solution	Advantage of tree representation
Optimize Code	arithmetic expression: x= x + 1	arithmetic expression optimized: x += 1	Easy to recognize when looking from the parent node.
Recognize Out of Order Elements	PEG grammar rule: Int ( ('+' / '-') Int )*	template implementation of PEG rule: And< Int, OptRepeat< And< OneOfChars<'+','-'>, Int > > >	To automatically translate `Int(('+'/'-') Int)` into the template shown under Tree Solution, a tree representation can help very much. The problem here: appears at the end, but must be placed as `OptRepeat` at the front. Easy to handle when looking from the parent node.
Interpret loop	while loop: while(i>0){ j+= i;i--; }	using tree evaluating function calls while( ConditionTrue(t) ){ EvalStat(t->child); }	A tree can be easily evaluated by passing sibling and child nodes to evaluation functions.
Multiple Passes	any language source	using tree walks starting at `pContext.tree` if(G::Matches(pIn,&pContext)){ pContext.Analyze(); pContext.OptimizeTree(); return pContext.GenCode(); }	Multiple passes over the same source without the need for reparsing.

Parse Trees and Syntax Trees

A parse tree uses one physical tree node per nonterminal, what usually results in huge trees. A syntax tree, often called abstract syntax tree or abbreviated AST is a parse tree where most nonterminals have been removed. The difference is memory usage as the comparison of the parse and the syntax tree for the following PEG grammar shows:

//PEG grammar with start rule expr. 
//Note: this grammar does not accept any white space
expr : mexpr (('+'/'-')  mexpr )*;
mexpr: atom  ('*'  atom )*;
atom : [0-9]+ / '(' expr ')';

sample expr	Parse Tree	Abstract Syntax Tree
2*(3+5)	expr< mexpr< atom<'2'> '*' atom< expr< mexpr<atom<'3'>> '+' mexpr<atom<'5'>> > > > >	mexpr<'2' '*' expr<'3' '+' '5'>>
Storage usage: 8 bytes	Storage usage: >= 14*12=168 bytes	Storage usage: >= 7*12 =84 bytes

Compared to the source text, any tree representation will use much more memory, but parse trees can be especially huge. Tree building also takes much more time than just recognizing the input as shown by the following output taken from the demo program coming with this article:

-p
>> parse a 100000 char buffer filled with  2*(3+5)
>> CPU usage ( Syntax tree with 87504 nodes built) = 0.17 seconds
>> CPU usage ( Parse tree with 175003 nodes built) = 0.211 seconds
>> CPU usage (              Expression recognizer) = 0.03 seconds

A tenfold increase in parsing time compared to a simple recognizer is not uncommon.

A tree node typically has at least the following members:

two pointer members - one pointing to the first child, other pointing to the sibling.
a member used for node identification
a virtual destructor in order to support derived nodes.

My current PEG implementation has the following node interface:

template<typename It >
struct TNode{
   typedef  TNode<It>        Self;
            TNode (int id=0, It beg=0, It end=0, Self* firstChild=0);
   virtual ~TNode (){} //support derived nodes


   virtual Self* AddChild (TNode* pChild);
  
   int   id;
   Self    *next,*child;
   BegEnd<It>  match; //references the source text matched by this node

};

Integrating Tree Generation into a PEG Parser

Tree generation can be added to the template based implementation of a PEG parser just by the use of a handful of tree generating templates. To use these templates, the Context structure passed to the Matches function must be derived from the predefined TreeContext. Each of these tree generating templates has a type parameter responsible for recognizing a PEG fragment and an optional integer value parameter used as identifying key for the generated tree node. The following table lists the available tree generating templates.

Tree generating template	Meaning, Effect
`template<int id,typename TRule> struct TreeNTWithId`	Builds a tree node representing the rule `TRule`. This node gets the ID `id`. All tree nodes created inside `TRule` will be descendants of this tree node.
`template<int id,typename TRule> struct AstNTWithId`	Same effect as `TreeNTWithId`, but with automatic replacement by the child node if the child node has no sibling.
`template<typename TRule> struct TreeNT`	Same effect as `TreeNTWithId`, but the `id` set to `eTreeNT`.
`template<typename TRule> struct AstNT`	Same effect as `AstNTWithId`, with the `id` set to `eTreeNT`.
`template<int id,typename TTerminalRule> struct TreeCharsWithId`	Builds a tree node having the source text recognized by `TTerminalRule` as match string. This node gets the ID `id`.
`template<int id,typename TTerminalRule> struct TreeChars`	Same effect as `TreeCharsWithId`, with the `id` set to `eTreeSpecChar`.
`template<typename T0, typename T1, ...> struct TreeSafeAnd`	Replacement for And in case of using tree templates. The use of `TreeSafeAnd` guarantees correct behaviour in case of backtracking.

The practical usage of this tree generating templates will be shown with a grammar for expressions supporting the operations '+', '-', '*', where parentheses can be used around subexpressions.

Topic	Recognizer Grammar	Parse Tree Grammar
Grammar Specification	expr : mexpr (('+'/'-') mexpr ); mexpr: atom ('' atom )*; atom : [0-9]+ / '(' expr ')';	// ^ means: generate tree // node for next terminal expr : mexpr (^('+'/'-') mexpr ); mexpr: atom (^'' atom )*; atom : ^([0-9]+) / '(' expr ')';
Rules as Types	using namespace peg_templ; struct expr{ typedef //atom : [0-9]+ / '(' expr ')'; Or< PlusRepeat<In<'0', '9' > >, And<Char<'('>, expr, Char<')'> > > atom; typedef //mexpr: atom ('' atom ); And< atom, OptRepeat< And<Char<''>, atom > > > mexpr; typedef //expr : mexpr (('+'/'-') mexpr ); And< mexpr, OptRepeat< And< Or<Char<'+'>, Char<'-'> >, mexpr > > > expr_rule; template< typename It, typename Context > static bool Matches(It& p,Context* pC) { return expr_rule::Matches(p,pC); } };	using namespace peg_tree; enum{eAtom,eMExpr,eInt,eExpr}; struct exprT{ typedef TreeNTWithId<eAtom, Or< TreeCharsWithId<eInt, PlusRepeat<In<'0', '9' > > >, And<Char<'('>, exprT, Char<')'> > > > atom; typedef TreeNTWithId<eMExpr, And< atom, OptRepeat< TreeSafeAnd<TreeChars <Char<''> >, atom > > > > mexpr; typedef TreeNTWithId<eExpr, And< mexpr, OptRepeat< TreeSafeAnd< TreeChars<Or<Char <'+'>, Char<'-'> > >, mexpr > > > > exprT_rule; template< typename It, typename Context > static bool Matches(It& p,Context pC) { return exprT_rule::Matches(p,pC); } };
Calling the Parser	struct EmptyContext{}; bool CallRecognizingParser() { EmptyContext c; typedef const unsigned char* Iterator; const char* sTest= "2(3+5)"; Iterator it= (Iterator)sTest; return expr::Matches(it,&c)*; }	bool CallTreeBuildingParser() { typedef const unsigned char* Iterator; //either use TreeContext //or derived class peg_tree::TreeContext <Iterator> tc; const char* sTest= "2(3+5)"; Iterator it= (Iterator)sTest; return exprT::Matches(it,&tc);* }

The following chapters will introduce Parser Generators and Parser Wizards as examples of complex tree building parsers. A Parser Generator reads a grammar specification and outputs a Parser. A Parser generator typically builds a tree representation of the grammar specification provided by the user, because it must analyze and transform the user input before it can generate the appropriate parser.

Parser Generators

A well known product category in the parser business are the so called Parser Generators like e.g., YACC and its successors or to name a more recent development ANTLR. They take an annotated grammar in text form and generate a parser for it. These annotated grammars contain not only the grammar rules (using a notation close to what I introduced to explain PEG parsers), but also the code (the so called semantic actions), which shall be executed by the generated parser when it passes the annotation point. This is shown in the following excerpt from the ANTLR documentation [3] for a grammar which corresponds to the following PEG grammar:

//PEG grammar for expressions supporting 
//the operations '+','-','*' on integers.
expr : S mexpr S(('+'/'-') S mexpr S)*;
mexpr: atom  S('*' S atom S)*;
atom : [0-9]+ / '(' expr ')';
S    : [ \t\v\r\n]*;

ANTLR parser specification for expression recognizer (semantic actions bold) ANTLR parser specification for expression parser with semantic actions (semantic actions bold)

//Note, that ANTLR uses a separate scanner, PLUS below therefore corresponds to '+' which has been defined in the scanner specification.

class ExprParser extends Parser;

expr: mexpr ((PLUS|MINUS) mexpr)*
    ;

mexpr
    : atom (STAR atom)*
    ;

atom: INT
    | LPAREN expr RPAREN

class ExprParser extends Parser;

expr returns [int value=0]
{int x;}

    :      value=mexpr
        ( PLUS x=mexpr    {value += x;}
        | MINUS x=mexpr {value -= x;}

        )*
    ;

mexpr returns [int value=0]
{int x;}

    : value=atom 
      ( STAR x=atom {value *= x;} )*
    ;

atom returns [int value=0]

    : i:INT {value=
         Integer.parseInt(i.getText());}
    |   LPAREN value=expr RPAREN
    ;

Typically, a parser generator copies the code annotations (the bold parts in the ANTLR spec. above) to the appropriate places in the generated parser. The next step is to compile the generated code (using e.g., your Java compiler). In case of a compile error or an unexpected behaviour when testing the parser, you either go back editing the original grammar specification or you edit the generated parser code directly. In the second case, you'll break the relation between the specification and the generated parser. But sooner or later you will do so and when using ANTLR you will have the luck, that the generated code is quite readable. Unfortunately the same is not true for YACC type parsers, which generate really impenetrable code. Stroustrap, the inventor of C++, experienced this when using YACC for his Cfront C++ front-end. "In retrospect, for me and C++ it was a bad mistake." [to use YACC].. "Cfront has a YACC parser supplemented by much lexical trickery relying on recursive descent techniques."[4]

But I believe, that even with integrated parsers like boost::spirit or Perl 6, something similar to a parser generator can still be useful. Such a tool, which generates a starter parsing application, has a similar purpose as e.g., the MFC application wizard, namely to generate the initial objects and functions, which then will be edited by the programmer.

The PegWizard Parser Wizard

The PegWizard parser wizard has a similar job profile as the MFC Application Wizard:

The MFC Application Wizard generates an application having built-in functionality, that, when compiled, implements the basic features of a Windows executable application. The MFC starter application includes C++ source (.cpp), resource files, header (.h) files, and a project (.vcproj) file. The code generated in these starter files is based on MFC.[5]

Translated to the case of the PegWizard Parser Wizard:

The PegWizard Parser Wizard generates an application having built-in functionality, that, when compiled, implements the basic features of a parser application. The PegWizard starter application includes C++ source (.cpp) and header (.h) files. The code generated in these starter files is based on the PEG parsing methodology.

The PegWizard parser wizard should do the following:

Meaningful Parser Wizard Functionality	Implementation Status
Check completeness of grammar: for each used Nonterminal, there should be a corresponding rule.	Implemented in PegWizard 0.01
Generate Recognizers for arbitrary PEG grammar.	Implemented in PegWizard 0.01
Check for left recursive grammar rules (serious error).	Planned for version 0.02
Generate Tree Builder for arbitrary PEG grammar.	Planned for version 0.02
Support error annotations in PEG grammar.	Planned for version 0.02
Support File Iterators, Unicode Iterators..	Planned for version 1.00
Support for automatic error annotations.	Planned for version 1.00

The following table shows a sample usage of the PegWizard parser wizard V0.01.00:

command line activation	PegWizard -g TestGrammar\expr.gram -o TestGrammar
program output on command line	Peg0Wizard 0.01.00 (alpha) [20050508] INFO: file content of 'TestGrammar\expr.gram' read in INFO: saved the generated parser in 'TestGrammar\expr.cpp' program ended successfully
content of 'expr.gram' file	//PEG grammar for expressions supporting //the operations '+','-','' on integers. expr : S mexpr S(('+'/'-') S mexpr S); mexpr: atom S('' S atom S); atom : [0-9]+ / '(' expr ')'; S : [ \t\v\r\n]*;
code excerpt of generated parser	#include "peg_templ.h" #include <iostream> using namespace peg_templ; struct expr; //forward declarations struct mexpr; struct atom; struct S; struct expr{ //grammar rule implementation typedef And< S, mexpr, S, ...
test run output	'2*(3+5)' is matched by grammar

The input grammar file must adhere to the following PEG grammar:

Grammar:        S GrammarRule (S GrammarRule)* S ;
GrammarRule:    Nonterminal S ':' S RSideOfRule ;
RSideOfRule:    Alternative (S '/' S Alternative)* S ';' ;
Alternative:    Term (S Term)* ;
Term:        ('&' / '!')? S Factor S ( '*' / '+' / '?')?  ;
Factor:        (Nonterminal / Terminal / '(' S RSideOfRule S ')' ) ;
Nonterminal:    [a-zA-Z0-9_]+ ;
Terminal:    LiteralString | Charset | '.';
Charset:    '['  (CharsetRange / EscapeChar / CharsetChar   )+  ']' ;
LiteralString:  '\'' ( EscapeChar / !'\'' PrintChar )+ '\'' ;
CharsetChar:    !']' PrintChar ;
EscapeChar:    '\\' ( HexChar / PrintChar  ) ;
CharsetRange:    ( EscapeChar / CharsetChar ) '-' ( EscapeChar / CharsetChar ) ;
HexChar:    '(x|X)(' Hexdigit+ ')' ;
S:              ([ \v\t\r\n] | '//' [ \v\r\t]*  '\n')* ;
PrintChar:    //determined by 'isprint(c)' ;

Why the PegWizard internally builds a Parse Tree

The PegWizard reads a PEG grammar file and generates the corresponding C++ source file. This would be difficult without having a parse tree, because a PEG grammar uses postfix notation (e.g. the quantifiers '*', '?', '+' appear at the end of the construct to which they apply), but the generated code uses prefix operators. Furthermore, code generation is only one of many tasks of the PegWizard. Other tasks are checking for left recursive rules, checking for completeness of the grammar, automatic insertion of error handling code and so on. Without having a tree, this would require reparsing the grammar file.

Generally speaking, a parse or syntax tree makes almost anything easier. The only disadvantages of a physical tree are the high runtime and memory costs. But in the case of the PegWizard this does not matter, because grammar files are in general quite small. These days a language compiler builds almost certainly a physical tree out of the source text. Cfront [4], Stroustrap's C front end, e.g., builds tree structures for all global elements and for the current function.

Future Directions

The PegWizard presented in this article as a tree building parser will be improved and shall soon be a useful tool. Currently, it only generates recognizing parsers.

Besides teaching the subject - namely parsing - a goal of this ongoing article series is to build highly efficient parsers. This will be investigated by comparing hand crafted parsers with schematically constructed ones.

Take-Home Points

It is easy to build syntax trees or parse trees during parsing. A syntax tree makes life (parsing and the later steps) much easier, but has its costs. When parsing speed matters and it seems feasible to go without a syntax tree, this should be preferred.

History

2005 May: initial version.

References

Parsing Expression Grammars, Bryan Ford, MIT, January 2004 [^]
Parsing Expression Grammars, Wikipedia, the free encyclopedia [^]
ANTLR Parser Generator [^]
Bjarne Stroustrap, The Design and Evolution of C++, Addison Wesley 1995.
MFC Application Wizard [^]

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here