Introduction
In this second article on building an interpreter, we discuss the implementation of an interpreter for a very simple language. We use a Parsing Expression Grammar [1] [2] to define the syntax and implement the parser. We show how evaluation can be done during parsing and how to extend the grammar, so that error messages can be issued in case of parsing errors. We show furthermore, that a Grammar sometimes must be slightly changed in order to alleviate direct evaluation during parsing.
The Calc0 toy language used for this purpose supports expressions, assignments, variables and print statements. Immediately after starting, the interpreter for Calc0 enters an infinite loop consisting of the following processing steps:
- It reads a line entered by the user.
- It parses the line, and during parsing, it evaluates the recognized expression, assignment or print statement.
- It updates an internal map of variable names and their associated values.
- It outputs the results of the evaluation.
The Calc0 Grammar
The Calc0 Interpreter expects one line of input. The most basic elements on such a line are spaces, identifiers and numbers. We allow only the white space character, tabulators and carriage return as spaces. We choose C/C++ identifiers and allow floating point or integer values as numbers. Expressions support addition, subtraction, multiplication, division and the power operator.
We need a context free grammar which is compatible with the parsing method used. Since we chose the Parsing Expression Grammar formalism, which is an LL parsing method, we must avoid left recursive rules. This results in the following Parsing Expression Grammar:
//first attempt to calc0_grammar
calc0_grammar: S (print_variables / assignement / expression) S '\0';
assignement: identifier S '=' S expression ;
print_variables: '!' S (identifier (S '-' S identifier)?)? ;
expression: multiplicative_expression
(S ('+' / '-') S multiplicative_expression)* ;
multiplicative_expression:
power_expression (S ('*' / '/') S power_expression)* ;
power_expression: primary_expression (S '^' S primary_expression)? ;
primary_expression: ('+' / '-')? S
(identifier / number / '(' S expression S ')') ;
number: ([0-9]+ '.'? / '.'[0-9])
[0-9]* (('e' / 'E') [0-9]+)? ;
identifier: [A-Za-z_][A-Za-z_0-9]* ;
S: [ \r\v\t]* ;
The start rule of this grammar is called calc0_grammar
. The PEG operational interpretation of this rule is the following: parse the input text by first matching against possible spaces, then try to match the remaining input against exactly one of the rules print_variables
, assignement
or expression
(try in this order). If successful, continue parsing by skipping possible spaces, and finally expect the input line terminator which is a zero. The other rules can be interpreted similarly.
Direct Evaluation
Direct evaluation during parsing means that we carry out the arithmetic operations, the variable assignments and variable accesses at the place where we parse them. Additions and subtractions, e.g., have to be done inside the grammar function responsible for parsing an expression
.
expression: multiplicative_expression
(S ('+' / '-') S multiplicative_expression)* ;
The evaluation of the value of expression
is done by adding/subtracting the result values of the components of this rule. Intermediate results are stored in the context structure passed to each grammar rule function. This is shown in the following table:
rule component of expression |
code executed |
effect on evaluation |
expression : multiplicative_expression |
call multiplicative_expression . Then copy the result from the context structure to a local variable. double result= pS->result; |
evaluation value of first rule component is in a local variable of the parsing function of expression |
(S ('+' / '-') S multiplicative_expression)* |
for each loop iteration, we add or subtract the value of the multiplicative_expression parsed in this loop to the local variable result |
result is accumulated in the loop |
The resulting code for actually carrying out the addition inside the parsing loop is shown in the following code fragment:
double result= pS->result;
for(;;){
S::Matches(p,pS);
if( And<Char<'+'>,S,multiplicative_expression>::Matches(p,pS)){
result+= pS->result;
}else if( And<Char<'-'>,S, multiplicative_expression>::Matches(p,pS) ){
result-= pS->result;
}else ...
The code for the rule multiplicative_expression
, where multiplications and divisions are carried out is similar. One major problem in direct evaluation arises for the grammar rule identifier
. Depending on the place where an identifier is used, we must evaluate completely different things as shown in the following table.
place of usage of identifier |
evaluation to be carried out |
identifier S '=' S expression ; |
store the name of the identifier in order to be able to store the value of expression later under this name. |
identifier / number / '(' S expression S ')' |
store the value of identifier as operand for later to be carried out additions, multiplications and assignments. |
'!' S (identifier (S '-' S identifier)?)? |
store the names of the identifiers to determine the lexicographical beginning and end of the variables to be printed out. |
In the PEG parsing method, there is a one to one correspondence between grammar rule and parsing function. But we have to do very different things inside the parsing function for identifier
depending on the caller. One solution would be to check for the caller inside the parsing function for identifier
. But there is a much simpler solution. Use different grammar rules for identifier
depending on the place where it is used. A similar problem arises for the rule expression
. When used as top element (one of the alternatives for calc0_grammar
) we should print out the calculated value, but not if the expression is on the left hand side of an assignment or is enclosed in parens ('(' expression ')')
. We can solve this problem by introducing two equivalent grammar_rules, namely expression
and full_expression
. In this way, we get the following revised calc0 grammar:
//Revised calc0_grammar
calc0_grammar: S (full_expression / assignement / print_variables) S '\0';
assignement: variable S '=' S expression ;
print_variables: '!' S (identFirst (S '-' S identLast)?)? ;
expression: multiplicative_expression
(S ('+' / '-') S multiplicative_expression)* ;
full_expression: expression ;
multiplicative_expression:
power_expression (S ('*' / '/') S power_expression)* ;
power_expression: primary_expression (S '^' S primary_expression)? ;
primary_expression: ('+' / '-')? S
(identifier / number / '(' S expression S ')') ;
number: ([0-9]+ '.'? / '.'[0-9])
[0-9]* (('e' / 'E') [0-9]+)? ;
identifier: [A-Za-z_][A-Za-z_0-9]* ;
variable: identifier ;
identFirst: identifier ;
identLast: identifier ;
S: [ \r\v\t]* ;
Handling of Parsing Errors in PEG's
An input for a parser is erroneous if an expected element is missing from the input and there is no alternative left. The PEG formalism allows to integrate error recognition into the grammar, as the following table shows:
Error Situation |
Error Recognizing Rule |
E is expected
S: A B E U ; |
S: A B (E / Error<eSymbol_E_expected>) U ; |
One of the alternatives B or C must be present
S: A ( B / C ) D ; |
S: A (B / C / Error<eOneOf_B_C_expected>) D ; |
The simplest kind of error handling in a PEG grammar is to stop the parsing process at the error position, give an error message, and throw an exception. We will use this technique here, because the input never exceeds one line. We enhance the grammar with indicators for error situations by appending an alternative Fatal
.
//calc0_grammar with error handling indicators
calc0_grammar: S
( print_variables /
assignement /
full_expression /
Fatal<eExprAssPrintExpected>;
)
S
( '\0' / Fatal<eEndOfInputExpected>) ;
assignement: variable S '=' S expression ;
print_variables: '!'
S
(
identFirst
S
( '-' S ( identLast / Fatal<eIdentExpected>) )?
)? ;
expression: multiplicative_expression
(S ('+' / '-') S multiplicative_expression)* ;
full_expression: expression ;
multiplicative_expression:
power_expression (S ('*' / '/') S power_expression)* ;
power_expression: primary_expression (S '^' S primary_expression)? ;
primary_expression: ('+' / '-')? S
( identifier /
number /
'(' S expression S ( ')' / Fatal<eRParenExpected> ) /
Fatal<eIdentNumOrExprExpected>
) ;
number: ([0-9]+ '.'? / '.' ([0-9] / Fatal<eFracDigitsExpected>) )
[0-9]*
( ('e'|'E')
( '+' / '-' )?
([0-9]+ / Fatal<eExponentDigitsExpected>)
)?;
identifier: [A-Za-z_][A-Za-z_0-9]* ;
variable: identifier ;
identFirst: identifier ;
identLast: identifier ;
S: [ \r\v\t]* ;
In the PEG implementation, there is a one to one relation between the grammar rules given above (inclusive error handling) and the template instantiation. This is shown for the start rule for both the template and procedural implementation:
calc0_grammar:
S (print_variables / assignement / full_expression / Fatal) S ('\0' / Fatal) ;
template implementation |
procedural implementation |
typedef
And<
S,
Or<
print_variables,
assignement,
full_expression,
Fatal<
eExprAssPrintExpected
>
>,
S,
Or<
Char<0>,
Fatal<eEndOfInputExpected>
>
> rule;
|
inline bool calc0_grammar_rule(
const uchar*& p, StateInfo* pS)
{ const uchar *p0,*p1;
return
PEG_AND(p0,p,
S (p,pS)
&& PEG_OR4(p1,p,
print_variables (p,pS),
assignement (p,pS),
full_expression (p,pS),
pS->Fatal(p0,
"expression, "
"v= expr or '!' expected")
)
&& S(p,pS)
&& ( Char(p,'\0')
||pS->Fatal(p,
"end of input expected")
)
);
} |
Maintaining Information across Runs
This interpreter runs in an eternal loop of waiting for input, and then reading, parsing and evaluating it. Each time the user enters an assignment, it not only calculates the new value of the variable, which is directly involved into the assignment, but it also looks for other formulas which use the changed value and recalculates them. This information is kept in a map with the following value type:
typedef std::set<std::string,CompareNoCase> set_istring;
struct Formula{
Formula(std::string e="",double v=0.0, const set_istring& sUsed=set_istring())
:expression(e),value(v),setUsedIdents(sUsed){}
std::string expression;
double value;
set_istring setUsedIdents;
};
The code responsible for calculating the new variable value and recalculating the dependent formulas is contained in the parser part for the rule assignement: variable S '=' S expression;
. The variable, which is the target of the assignment, is stored with its value and name, the whole input line, and the variable names which appear on the right hand side of the assignment as Formula
in the variables
map.
Handling of Runtime errors
Runtime error handling is one of the most underestimated subjects. The two main questions in runtime error handling are:
- Amount of runtime error checks.
- Kind of error action: program abortion, raising an exception, setting a flag, or autocorrection.
The expectations regarding the amount of runtime checks are different for interpreted and compiled languages and differs from language to language. Array bound checks, e.g., are standard in Pascal, but not available for C/C++. Even if the language specification defines a runtime error as severe, the programming language culture eventually may gently force the implementor to dispense with issuing a runtime error. A good example is integer overflow in C/C++, which results in undefined behaviour but will not be caught by most C/C++ implementations.
The answer to the second question regarding runtime error handling, namely what kind of error action should be carried out, is currently biased towards "raising an exception". C++ and Java even define a complete inheritance tree of runtime exceptions in their standard libraries. The only notable exception, where throwing an exception may not be the right thing, is floating point error handling. IEEE 754 (the standard for floating point arithmetic) supports both, setting an error flag and taking a trap. Once a floating point error flag is set, it will stay until it is explicitly cleared. The standard stream library recognizes the IEEE 754 error flags and prints them out as in the sample shown below:
3/0
>> = 1.#INF
1e300*-1e300
>> = -1.#INF
When a floating point overflow occurs, it may be meaningful to continue the computation by setting the floating point value to the maximum floating point value. This is done by the Calc0 interpreter when assigning to a variable in order to avoid illegal variable values in the variables map.
x= 3/0
>> x= 1.#INF
.. floating point overflow (=DBL_MAX)
>> x= 1.797693134862316e+308
The Calc0 interpreter recognizes floating point errors by using the non standard _fpclass
function which returns error indicators like _FPCLASS_NINF
(negative overflow), _FPCLASS_PINF
(positive overflow), and so on. C99 [2] standardizes floating point error handling with additions to the math.h header, but VC6 and VC7 do not yet support it.
Take-Home Points
Direct evaluation during parsing can be difficult because of the limited context information available during parsing. We can sometimes improve the situation by changing the grammar.
Parsing errors can be handled easily within the PEG framework, because the components of a PEG grammar rule are sequentially evaluated, which makes execution order predictable.
Runtime error handling must be adapted to the expectations of the target user group. Standards for runtime error handling are rare. An exception is floating point error handling.
History
- 2005 April: initial version.
References
- Parsing Expression Grammars, Bryan Ford, MIT, January 2004 [^]
- Parsing Expression Grammars, Wikipedia, the free encyclopedia[^]
- C99 Library Reference for <math.h>[^]