Introduction
In this third article on building an interpreter, I will show how to generate a physical tree from the source text, which must be parsed. Building a physical tree from the source text has many advantages: it allows detailed analysis, transformations and optimizations of the original source. A tree together with symbol tables compiled from tree visits can even serve as basis for a simple interpreter.
Tree generation is particularly easy for a recursive descent parser, since a recursive descent parser consists of a set of functions - one for each grammar rule. To generate a tree, the grammar-rule function must additionally add a new child to the already built tree, where the new child represents the source part matched by the corresponding grammar rule.
With a tree, language analysis and code generation becomes much easier. But despite this and the fact, that it is easy to generate such a physical tree with most parsers, early compilers (e.g. the first Pascal or C compilers) did not generate such a tree. This is due to the fact, that the overhead in memory usage can be overwhelming - and memory is a resource which was much more precious in the early days of computing.
In a PEG parser [1][2] as used in this article series, tree generation is complicated by backtracking. In case of a parsing failure, the part of the tree belonging to the failed rule must be removed again. Exceptions during parsing must be handled exactly like backtracking.
As a useful application of tree generation, I will present the core of a parser wizard, which takes a set of grammar rules in source text form and generates the skeleton of a parser for this grammar. Without having a parse tree, such a parser wizard (simple form of parser generator) would be difficult to build.
The Tree advantage
The following table shows some problems which can be easily solved when having a physical tree representing the source text, but which are difficult otherwise.
Task
| Source |
Tree Solution |
Advantage of tree representation |
Optimize Code |
arithmetic expression:
x= x + 1 |
arithmetic expression optimized:
x += 1 |
Easy to recognize when looking from the parent node. |
Recognize Out of Order Elements |
PEG grammar rule:
Int
(
('+' / '-') Int
)*
|
template implementation of PEG rule:
And<
Int,
OptRepeat<
And<
OneOfChars<'+','-'>,
Int
>
>
> |
To automatically translate Int(('+'/'-') Int)* into the template shown under Tree Solution, a tree representation can help very much. The problem here: * appears at the end, but must be placed as OptRepeat at the front. Easy to handle when looking from the parent node. |
Interpret loop |
while loop:
while(i>0){
j+= i;i--;
} |
using tree evaluating function calls
while( ConditionTrue(t) ){
EvalStat(t->child);
} |
A tree can be easily evaluated by passing sibling and child nodes to evaluation functions. |
Multiple Passes
| any language source |
using tree walks starting at pContext.tree if(G::Matches(pIn,&pContext)){
pContext.Analyze();
pContext.OptimizeTree();
return pContext.GenCode();
} |
Multiple passes over the same source without the need for reparsing. |
Parse Trees and Syntax Trees
A parse tree uses one physical tree node per nonterminal, what usually results in huge trees. A syntax tree, often called abstract syntax tree or abbreviated AST is a parse tree where most nonterminals have been removed. The difference is memory usage as the comparison of the parse and the syntax tree for the following PEG grammar shows:
//PEG grammar with start rule expr.
//Note: this grammar does not accept any white space
expr : mexpr (('+'/'-') mexpr )*;
mexpr: atom ('*' atom )*;
atom : [0-9]+ / '(' expr ')';
sample expr
| Parse Tree |
Abstract Syntax Tree |
2*(3+5) |
expr<
mexpr<
atom<'2'>
'*'
atom<
expr<
mexpr<atom<'3'>>
'+'
mexpr<atom<'5'>>
>
>
>
> |
mexpr<'2' '*' expr<'3' '+' '5'>>
|
Storage usage: 8 bytes |
Storage usage: >= 14*12=168 bytes |
Storage usage: >= 7*12 =84 bytes |
Compared to the source text, any tree representation will use much more memory, but parse trees can be especially huge. Tree building also takes much more time than just recognizing the input as shown by the following output taken from the demo program coming with this article:
-p
>> parse a 100000 char buffer filled with 2*(3+5)
>> CPU usage ( Syntax tree with 87504 nodes built) = 0.17 seconds
>> CPU usage ( Parse tree with 175003 nodes built) = 0.211 seconds
>> CPU usage ( Expression recognizer) = 0.03 seconds
A tenfold increase in parsing time compared to a simple recognizer is not uncommon.
A tree node typically has at least the following members:
- two pointer members - one pointing to the first child, other pointing to the sibling.
- a member used for node identification
- a virtual destructor in order to support derived nodes.
My current PEG implementation has the following node interface:
template<typename It >
struct TNode{
typedef TNode<It> Self;
TNode (int id=0, It beg=0, It end=0, Self* firstChild=0);
virtual ~TNode (){}
virtual Self* AddChild (TNode* pChild);
int id;
Self *next,*child;
BegEnd<It> match;
};
Integrating Tree Generation into a PEG Parser
Tree generation can be added to the template based implementation of a PEG parser just by the use of a handful of tree generating templates. To use these templates, the Context structure passed to the Matches
function must be derived from the predefined TreeContext
. Each of these tree generating templates has a type parameter responsible for recognizing a PEG fragment and an optional integer value parameter used as identifying key for the generated tree node. The following table lists the available tree generating templates.
Tree generating template |
Meaning, Effect |
template<int id,typename TRule> struct TreeNTWithId |
Builds a tree node representing the rule TRule . This node gets the ID id . All tree nodes created inside TRule will be descendants of this tree node. |
template<int id,typename TRule> struct AstNTWithId |
Same effect as TreeNTWithId , but with automatic replacement by the child node if the child node has no sibling. |
template<typename TRule> struct TreeNT |
Same effect as TreeNTWithId , but the id set to eTreeNT . |
template<typename TRule> struct AstNT |
Same effect as AstNTWithId , with the id set to eTreeNT . |
template<int id,typename TTerminalRule> struct TreeCharsWithId |
Builds a tree node having the source text recognized by TTerminalRule as match string. This node gets the ID id . |
template<int id,typename TTerminalRule> struct TreeChars |
Same effect as TreeCharsWithId , with the id set to eTreeSpecChar . |
template<typename T0, typename T1, ...> struct TreeSafeAnd |
Replacement for And in case of using tree templates. The use of TreeSafeAnd guarantees correct behaviour in case of backtracking. |
The practical usage of this tree generating templates will be shown with a grammar for expressions supporting the operations '+', '-', '*', where parentheses can be used around subexpressions.
Topic
| Recognizer Grammar |
Parse Tree Grammar |
Grammar Specification |
expr :
mexpr (('+'/'-') mexpr )*;
mexpr: atom ('*' atom )*;
atom : [0-9]+ / '(' expr ')';
|
// ^ means: generate tree
// node for next terminal
expr : mexpr (^('+'/'-') mexpr )*;
mexpr: atom (^'*' atom )*;
atom : ^([0-9]+) / '(' expr ')';
|
Rules as Types
| using namespace peg_templ;
struct expr{
typedef
Or<
PlusRepeat<In<'0',
'9' > >,
And<Char<'('>, expr,
Char<')'> >
> atom;
typedef
And<
atom,
OptRepeat<
And<Char<'*'>, atom >
>
> mexpr;
typedef
mexpr )*;
And<
mexpr,
OptRepeat<
And<
Or<Char<'+'>,
Char<'-'> >,
mexpr
>
>
> expr_rule;
template<
typename It,
typename Context
>
static bool
Matches(It& p,Context* pC)
{
return
expr_rule::Matches(p,pC);
}
};
|
using namespace peg_tree;
enum{eAtom,eMExpr,eInt,eExpr};
struct exprT{
typedef
TreeNTWithId<eAtom,
Or<
TreeCharsWithId<eInt,
PlusRepeat<In<'0', '9' > >
>,
And<Char<'('>,
exprT, Char<')'> >
>
> atom;
typedef
TreeNTWithId<eMExpr,
And<
atom,
OptRepeat<
TreeSafeAnd<TreeChars
<Char<'*'> >, atom >
>
>
> mexpr;
typedef
TreeNTWithId<eExpr,
And<
mexpr,
OptRepeat<
TreeSafeAnd<
TreeChars<Or<Char
<'+'>, Char<'-'> > >,
mexpr
>
>
>
> exprT_rule;
template<
typename It,
typename Context
>
static bool
Matches(It& p,Context* pC)
{
return
exprT_rule::Matches(p,pC);
}
}; |
Calling the Parser |
struct EmptyContext{};
bool CallRecognizingParser()
{
EmptyContext c;
typedef const
unsigned char* Iterator;
const char* sTest= "2*(3+5)";
Iterator it= (Iterator)sTest;
return expr::Matches(it,&c);
}
|
bool CallTreeBuildingParser()
{
typedef const
unsigned char* Iterator;
peg_tree::TreeContext
<Iterator> tc;
const char* sTest= "2*(3+5)";
Iterator it= (Iterator)sTest;
return exprT::Matches(it,&tc);
} |
|
The following chapters will introduce Parser Generators and Parser Wizards as examples of complex tree building parsers. A Parser Generator reads a grammar specification and outputs a Parser. A Parser generator typically builds a tree representation of the grammar specification provided by the user, because it must analyze and transform the user input before it can generate the appropriate parser.
Parser Generators
A well known product category in the parser business are the so called Parser Generators like e.g., YACC and its successors or to name a more recent development ANTLR. They take an annotated grammar in text form and generate a parser for it. These annotated grammars contain not only the grammar rules (using a notation close to what I introduced to explain PEG parsers), but also the code (the so called semantic actions), which shall be executed by the generated parser when it passes the annotation point. This is shown in the following excerpt from the ANTLR documentation [3] for a grammar which corresponds to the following PEG grammar:
//PEG grammar for expressions supporting
//the operations '+','-','*' on integers.
expr : S mexpr S(('+'/'-') S mexpr S)*;
mexpr: atom S('*' S atom S)*;
atom : [0-9]+ / '(' expr ')';
S : [ \t\v\r\n]*;
ANTLR parser specification for expression recognizer (semantic actions bold) |
ANTLR parser specification for expression parser with semantic actions (semantic actions bold) |
//Note, that ANTLR uses a separate scanner, PLUS below therefore corresponds to '+' which has been defined in the scanner specification.
class ExprParser extends Parser;
expr: mexpr ((PLUS|MINUS) mexpr)*
;
mexpr
: atom (STAR atom)*
;
atom: INT
| LPAREN expr RPAREN
|
class ExprParser extends Parser;
expr returns [int value=0]
{int x;}
: value=mexpr
( PLUS x=mexpr {value += x;}
| MINUS x=mexpr {value -= x;}
)*
;
mexpr returns [int value=0]
{int x;}
: value=atom
( STAR x=atom {value *= x;} )*
;
atom returns [int value=0]
: i:INT {value=
Integer.parseInt(i.getText());}
| LPAREN value=expr RPAREN
; |
Typically, a parser generator copies the code annotations (the bold parts in the ANTLR spec. above) to the appropriate places in the generated parser. The next step is to compile the generated code (using e.g., your Java compiler). In case of a compile error or an unexpected behaviour when testing the parser, you either go back editing the original grammar specification or you edit the generated parser code directly. In the second case, you'll break the relation between the specification and the generated parser. But sooner or later you will do so and when using ANTLR you will have the luck, that the generated code is quite readable. Unfortunately the same is not true for YACC type parsers, which generate really impenetrable code. Stroustrap, the inventor of C++, experienced this when using YACC for his Cfront C++ front-end. "In retrospect, for me and C++ it was a bad mistake." [to use YACC].. "Cfront has a YACC parser supplemented by much lexical trickery relying on recursive descent techniques."[4]
But I believe, that even with integrated parsers like boost::spirit
or Perl 6, something similar to a parser generator can still be useful. Such a tool, which generates a starter parsing application, has a similar purpose as e.g., the MFC application wizard, namely to generate the initial objects and functions, which then will be edited by the programmer.
The PegWizard Parser Wizard
The PegWizard parser wizard has a similar job profile as the MFC Application Wizard:
The MFC Application Wizard generates an application having built-in functionality, that, when compiled, implements the basic features of a Windows executable application. The MFC starter application includes C++ source (.cpp), resource files, header (.h) files, and a project (.vcproj) file. The code generated in these starter files is based on MFC.[5]
Translated to the case of the PegWizard Parser Wizard:
The PegWizard Parser Wizard generates an application having built-in functionality, that, when compiled, implements the basic features of a parser application. The PegWizard starter application includes C++ source (.cpp) and header (.h) files. The code generated in these starter files is based on the PEG parsing methodology.
The PegWizard parser wizard should do the following:
Meaningful Parser Wizard Functionality |
Implementation Status |
Check completeness of grammar: for each used Nonterminal, there should be a corresponding rule. |
Implemented in PegWizard 0.01 |
Generate Recognizers for arbitrary PEG grammar. |
Implemented in PegWizard 0.01 |
Check for left recursive grammar rules (serious error). |
Planned for version 0.02 |
Generate Tree Builder for arbitrary PEG grammar. |
Planned for version 0.02 |
Support error annotations in PEG grammar. |
Planned for version 0.02 |
Support File Iterators, Unicode Iterators.. |
Planned for version 1.00 |
Support for automatic error annotations. |
Planned for version 1.00 |
The following table shows a sample usage of the PegWizard parser wizard V0.01.00:
command line activation |
PegWizard -g TestGrammar\expr.gram -o TestGrammar |
program output on command line |
Peg0Wizard 0.01.00 (alpha) [20050508]
INFO: file content of 'TestGrammar\expr.gram' read in
INFO: saved the generated parser in 'TestGrammar\expr.cpp'
program ended successfully |
content of 'expr.gram' file |
//PEG grammar for expressions supporting
//the operations '+','-','*' on integers.
expr : S mexpr S(('+'/'-') S mexpr S)*;
mexpr: atom S('*' S atom S)*;
atom : [0-9]+ / '(' expr ')';
S : [ \t\v\r\n]*; |
code excerpt of generated parser |
#include "peg_templ.h"
#include <iostream>
using namespace peg_templ;
struct expr; //forward declarations
struct mexpr;
struct atom;
struct S;
struct expr{ //grammar rule implementation
typedef And<
S,
mexpr,
S,
... |
test run output |
'2*(3+5)' is matched by grammar |
The input grammar file must adhere to the following PEG grammar:
Grammar: S GrammarRule (S GrammarRule)* S ;
GrammarRule: Nonterminal S ':' S RSideOfRule ;
RSideOfRule: Alternative (S '/' S Alternative)* S ';' ;
Alternative: Term (S Term)* ;
Term: ('&' / '!')? S Factor S ( '*' / '+' / '?')? ;
Factor: (Nonterminal / Terminal / '(' S RSideOfRule S ')' ) ;
Nonterminal: [a-zA-Z0-9_]+ ;
Terminal: LiteralString | Charset | '.';
Charset: '[' (CharsetRange / EscapeChar / CharsetChar )+ ']' ;
LiteralString: '\'' ( EscapeChar / !'\'' PrintChar )+ '\'' ;
CharsetChar: !']' PrintChar ;
EscapeChar: '\\' ( HexChar / PrintChar ) ;
CharsetRange: ( EscapeChar / CharsetChar ) '-' ( EscapeChar / CharsetChar ) ;
HexChar: '(x|X)(' Hexdigit+ ')' ;
S: ([ \v\t\r\n] | '//' [ \v\r\t]* '\n')* ;
PrintChar: //determined by 'isprint(c)' ;
Why the PegWizard internally builds a Parse Tree
The PegWizard reads a PEG grammar file and generates the corresponding C++ source file. This would be difficult without having a parse tree, because a PEG grammar uses postfix notation (e.g. the quantifiers '*', '?', '+' appear at the end of the construct to which they apply), but the generated code uses prefix operators. Furthermore, code generation is only one of many tasks of the PegWizard. Other tasks are checking for left recursive rules, checking for completeness of the grammar, automatic insertion of error handling code and so on. Without having a tree, this would require reparsing the grammar file.
Generally speaking, a parse or syntax tree makes almost anything easier. The only disadvantages of a physical tree are the high runtime and memory costs. But in the case of the PegWizard this does not matter, because grammar files are in general quite small. These days a language compiler builds almost certainly a physical tree out of the source text. Cfront [4], Stroustrap's C front end, e.g., builds tree structures for all global elements and for the current function.
Future Directions
The PegWizard presented in this article as a tree building parser will be improved and shall soon be a useful tool. Currently, it only generates recognizing parsers.
Besides teaching the subject - namely parsing - a goal of this ongoing article series is to build highly efficient parsers. This will be investigated by comparing hand crafted parsers with schematically constructed ones.
Take-Home Points
It is easy to build syntax trees or parse trees during parsing. A syntax tree makes life (parsing and the later steps) much easier, but has its costs. When parsing speed matters and it seems feasible to go without a syntax tree, this should be preferred.
History
- 2005 May: initial version.
References
- Parsing Expression Grammars, Bryan Ford, MIT, January 2004 [^]
- Parsing Expression Grammars, Wikipedia, the free encyclopedia [^]
- ANTLR Parser Generator [^]
- Bjarne Stroustrap, The Design and Evolution of C++, Addison Wesley 1995.
- MFC Application Wizard [^]