Introduction
Recursive descent parsing is probably the most common form of parsing. It is the standard choice for implementing hand rolled parsers. This article aims to teach you how to build them.
Conceptualizing this Mess
Parsing is used in countless software products, and involves breaking up structured text into meaningful "phrases" which are used by your code to implement some sort of functionality. What functionality that is depends on what you're parsing. In our examples, we have a JSON parser whose functionality involves building a tree of normalized JSON data, as well as an expression parser, whose functionality involves evaluating an integer expression at runtime and returning the result.
The Design Phase: Building the Grammar and Planning the Parser
There are several steps to implementing a parser, from planning to implementation. The first step is to design the "grammar" for what you'll be parsing. It can be helpful to define it using EBNF or a variant thereof. I use XBNF which is my own variant of EBNF in order to write the grammar, but it doesn't matter what form you use as long as it's understandable. We'll then use this grammar as our design template, which we'll use to guide us through the creation of the parser.
Our grammar for the JSON parser is as follows:
Json<start>= Object | Array;
Object= "{" [ Field { "," Field } ] "}";
Field= string ":" Value;
Array= "[" [ Value { "," Value } ] "]";
Value<collapsed>= string |
number |
Object |
Array |
Boolean |
null ;
Boolean= true|false;
number= '\-?(0|[1-9][0-9]*)(\.[0-9]+)?([Ee][\+\-]?[0-9]+)?';
string= '"([^\n"\\]|\\([btrnf"\\/]|(u[A-Fa-f]{4})))*"';
true= "true";
false= "false";
null= "null";
lbracket<collapsed>= "[";
rbracket<collapsed>= "]";
lbrace<collapsed>= "{";
rbrace<collapsed>= "}";
colon<collapsed>= ":";
comma<collapsed>= ",";
whitespace<hidden>= '[\n\r\t ]+
Meanwhile, our grammar for the expression parser is as follows:
Term<start,type="int">= Factor { ("+"|"-") Factor };
Factor<type="int">= Unary { ("*"|"/") Unary };
Unary<type="int">= ("+"|"-") Unary | Leaf;
Leaf<type="int">= integer | "(" Term ")";
add="+";
sub="-";
mul="*";
div="/";
lparen="(";
rparen=")";
integer='[0-9]+';
whitespace<hidden>='\s+';
The JSON grammar should be fairly obvious if you understand EBNF and possibly XBNF. You can read about XBNF at any of my previous parser generator projects, which all use XBNF as their input format. Unlike my other parser projects, we will not be machine generating code for an XBNF grammar. We'll be implementing it ourselves once again simply using the grammar as a guide to help us code it.
Note above I've labeled the non-terminal elements of the grammar with TitleCase names and the terminal elements with camelCase names. If you don't understand terminals versus non-terminals, it doesn't matter much here. Basically, non-terminals are composed of terminals and other non-terminals, while terminals just represent individual "flat" lexemes in the input stream. Non-Terminals are divisible into their component parts, while a terminal is atomic/indivisible. Again, you don't really need to understand this part, but it's nice if you do.
After we have the grammar, the next step should be to decide whether we'll be using a tokenizer/scanner or whether we'll be implementing a scannerless parser. Using a tokenizer/scanner can make it easier to parse, but requires either additional handwritten code or an additional build step in order to generate the scanner code.
You may be wondering what exactly a tokenizer/scanner does. Briefly it takes an input stream of text and finds patterns in it. Each different pattern has "symbol id" associated with it that tells you what pattern it is (e.g., is it a number, a string, or an opening or closing parenthesis, etc?) This information is returned as a series of "tokens" which hold the symbol id, the value of the text that was matched, and the line, column and position info of the token within the input stream. The parser then uses those symbol ids to decide what to parse next. Parsers don't really care about the token values, just the token symbol ids. To recap, the value is the literal text that was matched, while the symbol id is the numeric "code" associated with that type of text fragment.
Why should we use a tokenizer/scanner? A tokenizer/scanner can drastically ease parsing by separating the lexical analysis and syntax analysis as separate steps. Without a tokenizer/scanner, both steps are combined together in the parse routines, which can complicate them.
Why not use a tokenizer/scanner? Tokenizers/scanners don't have any context information so they cannot conditionally return different symbols for the same text depending on where it appears in the input. Because of this their matching power is less than it could be for a scannerless parser. Realistically however, this is almost never a problem. It's very rare that you'll run into a need to distinguish terminal tokens based on context. However, you will run into it with grammars like Python because of significant versus insignificant whitespace in that language's grammar. In these cases, you can either still hand write a separate tokenizer/scanner, or simply integrate the scanner portions directly into the parsing code itself. The other reason you might not want to use one is in the case of very simple grammars.
Provided in the code for this article is a scannerless JSON parser as well as an integer expression parser, the latter of which does use a tokenizer. We could have just as easily used a scanner/tokenizer for JSON but I wanted to demonstrate both techniques.
If you decide to use a tokenizer, it is generally much easier to generate one using a tool like Rolex rather than write one by hand. I've include the Rolex binary in the solution folder so that the ExpressionParser
project can be rebuilt from the scanner/tokenizer specification file.
The JSON Parser
The JSON parser is a simple parser that reads JSON input and parses it into an object model, using IDictionary<string, object>
to represent a JSON object and IList<object>
to represent a JSON array.
The Expression Parser
The expression parser is another simple parser that reads an integer expression and evaluates the result, returning an integer. One thing to be aware of when parsing expressions is operator precedence. For example, a factor such as "multiply" has a higher precedence than a term such as "add." We must break out these different operators in the grammar in order to enforce precedence. Notice in the grammar how Term is made up of Factors, and Factor is made up of Unaries, which are made up of Leaves. This is how we establish precedence. If we simply put all the operators under the same non-terminal declaration there would be no operator precedence. This will become clearer when we visit the code in the next section.
Coding this Mess
In order to code a recursive descent parser we're going to make a routine for nearly every construct that appears in the grammar on the left hand side of the "=" sign. These routines will call the other routines as necessary in order to parse. This is how recursive descent parsing works.
public sealed class JsonParser : IDisposable
{
const int _TabWidth = 4;
readonly IEnumerator<char> _input;
bool _isEnd;
int _line;
int _column;
long _position;
JsonParser(IEnumerable<char> input)
{
_input = input.GetEnumerator();
_line = 1;
_column = 1;
_position = 0L;
}
bool _NextInput()
{
if (_isEnd)
return false;
if(!_input.MoveNext())
{
_isEnd = true;
return false;
}
++_position;
switch(_input.Current)
{
case '\n':
_column = 1;
++_line;
break;
case '\r':
_column = 1;
break;
case '\t':
_column += _TabWidth;
break;
default:
++_column;
break;
}
return true;
}
public static object Parse(IEnumerable<char> input)
{
if (null == input)
throw new ArgumentNullException("input");
using (var p = new JsonParser(input))
{
if (!p._NextInput())
return null;
switch (p._input.Current)
{
case '{':
return p._ParseObject();
case '[':
return p._ParseArray();
}
p._ThrowExpecting("Expecting object or array");
}
return null;
}
public void Close()
{
_input.Dispose();
}
void IDisposable.Dispose()
{
Close();
}
IDictionary<string,object> _ParseObject()
{
_SkipWhiteSpace();
if ('{' != _input.Current)
_ThrowExpecting("Expecting {");
if (!_NextInput())
_ThrowExpecting("Unterminated object");
var result = new Dictionary<string, object>();
while(!_isEnd && '}' != _input.Current)
{
_SkipWhiteSpace();
var key = _ParseString();
_SkipWhiteSpace();
if (':' != _input.Current)
_ThrowExpecting("Expecting :");
if (!_NextInput())
_ThrowExpecting("Unterminated field");
_SkipWhiteSpace();
result.Add(key, _ParseValue());
_SkipWhiteSpace();
if ('}' == _input.Current)
break;
else if (',' != _input.Current)
_ThrowExpecting("Expecting ,");
if (!_NextInput())
_ThrowExpecting("Unterminated object");
}
if ('}' != _input.Current)
_ThrowExpecting("Unterminated object");
_NextInput();
return result;
}
IList<object> _ParseArray()
{
_SkipWhiteSpace();
if ('[' != _input.Current)
_ThrowExpecting("Expecting [");
if (!_NextInput())
_ThrowExpecting("Unterminated array");
var result = new List<object>();
while (!_isEnd && ']' != _input.Current)
{
_SkipWhiteSpace();
result.Add(_ParseValue());
_SkipWhiteSpace();
if (']' == _input.Current)
break;
else if (',' != _input.Current)
_ThrowExpecting("Expecting ,");
if (!_NextInput())
_ThrowExpecting("Unterminated array");
}
if (']' != _input.Current)
_ThrowExpecting("Unterminated array");
_NextInput();
return result;
}
object _ParseValue()
{
_SkipWhiteSpace();
switch(_input.Current)
{
case '\"':
return _ParseString();
case '{':
return _ParseObject();
case '[':
return _ParseArray();
case 't':
case 'f':
return _ParseBoolean();
case 'n':
_ParseNull();
return null;
case '0':
case '1':
case '2':
case '3':
case '4':
case '5':
case '6':
case '7':
case '8':
case '9':
case '-':
case '.':
return _ParseNumber();
}
_ThrowExpecting("Unexpected character in JSON value");
return null;
}
bool _ParseBoolean()
{
if('t'==_input.Current)
{
if (!_NextInput())
_ThrowExpecting("Unterminated boolean");
if ('r' != _input.Current)
_ThrowExpecting("Unexpected character in boolean");
if (!_NextInput())
_ThrowExpecting("Unterminated boolean");
if ('u' != _input.Current)
_ThrowExpecting("Unexpected character in boolean");
if (!_NextInput())
_ThrowExpecting("Unterminated boolean");
if ('e' != _input.Current)
_ThrowExpecting("Unexpected character in boolean");
_NextInput();
return true;
}
if('f'==_input.Current)
{
if (!_NextInput())
_ThrowExpecting("Unterminated boolean");
if ('a' != _input.Current)
_ThrowExpecting("Unexpected character in boolean");
if (!_NextInput())
_ThrowExpecting("Unterminated boolean");
if ('l' != _input.Current)
_ThrowExpecting("Unexpected character in boolean");
if (!_NextInput())
_ThrowExpecting("Unterminated boolean");
if ('s' != _input.Current)
_ThrowExpecting("Unexpected character in boolean");
if (!_NextInput())
_ThrowExpecting("Unterminated boolean");
if ('e' != _input.Current)
_ThrowExpecting("Unexpected character in boolean");
_NextInput();
return false;
}
_ThrowExpecting("Expecting boolean");
return false;
}
void _ParseNull()
{
if ('n' == _input.Current)
{
if (!_NextInput())
_ThrowExpecting("Unterminated boolean");
if ('u' != _input.Current)
_ThrowExpecting("Unexpected character in boolean");
if (!_NextInput())
_ThrowExpecting("Unterminated boolean");
if ('l' != _input.Current)
_ThrowExpecting("Unexpected character in boolean");
if (!_NextInput())
_ThrowExpecting("Unterminated boolean");
if ('l' != _input.Current)
_ThrowExpecting("Unexpected character in boolean");
_NextInput();
return;
}
_ThrowExpecting("Expecting null");
}
string _ParseString()
{
if (_isEnd || '\"' != _input.Current)
_ThrowExpecting("Expecting \"");
if (!_NextInput())
_ThrowExpecting("Unterminated string");
var sb = new StringBuilder();
while(!_isEnd && '\"'!=_input.Current)
{
if('\n'==_input.Current)
_ThrowExpecting("Unterminated string");
if ('\\' == _input.Current)
{
if (!_NextInput())
_ThrowExpecting("Unterminated string");
switch (_input.Current)
{
case '/':
sb.Append('/');
break;
case '\\':
sb.Append('\\');
break;
case '\"':
sb.Append('\"');
break;
case 'b':
sb.Append('\b');
break;
case 't':
sb.Append('\t');
break;
case 'r':
sb.Append('\r');
break;
case 'n':
sb.Append('\n');
break;
case 'f':
sb.Append('\f');
break;
case 'u':
if (!_NextInput())
_ThrowExpecting("Unterminated string");
int ch = 0;
ch <<= 4;
ch |= _FromHexChar(_input.Current);
if (!_NextInput())
_ThrowExpecting("Unterminated string");
ch <<= 4;
ch |= _FromHexChar(_input.Current);
if (!_NextInput())
_ThrowExpecting("Unterminated string");
ch <<= 4;
ch |= _FromHexChar(_input.Current);
if (!_NextInput())
_ThrowExpecting("Unterminated string");
ch <<= 4;
ch |= _FromHexChar(_input.Current);
sb.Append((char)ch);
break;
default:
_ThrowExpecting("Unrecognized escape sequence");
break;
}
}
else
sb.Append(_input.Current);
_NextInput();
}
if (_isEnd || '\"' != _input.Current)
_ThrowExpecting("Unterminated string");
_NextInput();
return sb.ToString();
}
double _ParseNumber()
{
if (_isEnd)
_ThrowExpecting("Expecting number");
var sb = new StringBuilder();
if ('-' == _input.Current)
{
sb.Append(_input.Current);
if (!_NextInput())
_ThrowExpecting("Unterminated number");
}
if (char.IsDigit(_input.Current))
{
sb.Append(_ParseDigits());
}
if ('.' == _input.Current)
{
sb.Append('.');
if (!_NextInput())
_ThrowExpecting("Unterminated number");
sb.Append(_ParseDigits());
}
if ('E' == _input.Current || 'e' == _input.Current)
{
sb.Append(_input.Current);
if (!_NextInput())
_ThrowExpecting("Unterminated number");
if ('-' == _input.Current || '+' == _input.Current)
{
sb.Append(_input.Current);
if (!_NextInput())
_ThrowExpecting("Unterminated number");
}
sb.Append(_ParseDigits());
}
return double.Parse(sb.ToString());
}
string _ParseDigits()
{
if (_isEnd || !char.IsDigit(_input.Current))
_ThrowExpecting("Expecting digits");
var sb = new StringBuilder();
sb.Append(_input.Current);
while (_NextInput() && char.IsDigit(_input.Current))
sb.Append(_input.Current);
return sb.ToString();
}
void _ThrowExpecting(string msg)
{
msg = string.Concat(msg,string.Format(" at line {0},
column {1}, position {2}", _line, _column, _position));
throw new Exception(msg);
}
byte _FromHexChar(char hex)
{
if (':' > hex && '/' < hex)
return (byte)(hex - '0');
if ('G' > hex && '@' < hex)
return (byte)(hex - '7');
if ('g' > hex && '`' < hex)
return (byte)(hex - 'W');
_ThrowExpecting("The value was not hex");
return 0;
}
void _SkipWhiteSpace()
{
while (!_isEnd && '\n'==_input.Current ||
'\r'==_input.Current || '\t'==_input.Current || ' '==_input.Current)
_NextInput();
}
}
Here I've bolded all the methods that deal with parsing a component of the grammar. The rest of the code is support. Covering the support code, there's some bookkeeping that must be done to track when the end of the input is reached, and to record the line, column and position information. We also have a helper to report errors a helper to skip whitespace, and a helper to convert from a hexadecimal string to a byte
value. Notice how we have a helper, _NextInput()
that tracks line, column, and position information as well as determining whether we're at the end of the input. This is important. When we throw errors, we need to report the position within the input where the error occurred so we must always update our line, column and position information whenever we advance the input. We also must know when we're at the end so we don't try to read more input after we've reached the end. Without this, an exception would be thrown when we move past the end of the input and attempt to keep parsing.
The static Parse()
method kicks off our parsing by preparing a new parser instance with the specified input, and then delegates to the appropriate _ParseXXXX()
method depending on which character is under the cursor.
Inside our _ParseXXXX()
methods, we handle the appropriate grammar construct represented by the method, so for example _ParseObject()
would parse a JSON object { ... }
while _ParseArray()
would parse a JSON array [ ... ]
.
There are some tricky bits to this, noteably parsing strings and numbers which gets somewhat complicated. This is where a scanner/tokenizer would make it somewhat easier but we're not using one in this parser. Otherwise, parsing non-terminal JSON elements is simply a matter of delegating to the appropriate method(s), and skipping whitespace wherever it may occur. Tokenizer/scanners are particularly useful for ignoring things like whitespace and comments, but again, we're not using one here so we have to skip whitespace manually (there are no comments in JSON.)
Note that each time we parse something, each method starts by examining the current input character, and leaves the method with the input past the end of the construct we just parsed. This is important, and it must be consistent for each routine.
Now we're going to visit the ExpressionParser
code which uses a tokenizer/scanner:
public sealed class ExpressionParser : IDisposable
{
IEnumerator<Token> _input;
bool _isEnd;
void _ThrowExpecting(string msg)
{
msg = string.Concat(msg, string.Format(" at line {0},
column {1}, position {2}", _input.Current.Line,
_input.Current.Column, _input.Current.Position));
throw new Exception(msg);
}
ExpressionParser(IEnumerator<Token> input)
{
_input = input;
}
bool _NextInput()
{
if(!_input.MoveNext())
{
_isEnd = true;
return false;
}
return true;
}
public static int Parse(IEnumerable<char> input)
{
using (var p = new ExpressionParser(new ExpressionTokenizer(input).GetEnumerator()))
{
if (!p._NextInput())
p._ThrowExpecting("Empty expression");
return p._ParseTerm();
}
}
public void Close()
{
_input.Dispose();
}
void IDisposable.Dispose()
{
Close();
}
int _ParseTerm()
{
int result = _ParseFactor();
var done = false;
while (!_isEnd && !done)
{
switch (_input.Current.SymbolId)
{
case ExpressionTokenizer.add:
if (!_NextInput())
_ThrowExpecting("Unterminated expression");
result += _ParseFactor();
break;
case ExpressionTokenizer.sub:
if (!_NextInput())
_ThrowExpecting("Unterminated expression");
result -= _ParseFactor();
break;
default:
done = true;
break;
}
}
return result;
}
int _ParseFactor()
{
int result = _ParseUnary();
var done = false;
while (!_isEnd && !done)
{
switch (_input.Current.SymbolId)
{
case ExpressionTokenizer.mul:
if (!_NextInput())
_ThrowExpecting("Unterminated expression");
result *= _ParseUnary();
break;
case ExpressionTokenizer.div:
if (!_NextInput())
_ThrowExpecting("Unterminated expression");
result /= _ParseUnary();
break;
default:
done = true;
break;
}
}
return result;
}
int _ParseUnary()
{
switch(_input.Current.SymbolId)
{
case ExpressionTokenizer.add:
if(!_NextInput())
_ThrowExpecting("Unterminated expression");
return _ParseUnary();
case ExpressionTokenizer.sub:
if(!_NextInput())
_ThrowExpecting("Unterminated expression");
return -_ParseUnary();
}
return _ParseLeaf();
}
int _ParseLeaf()
{
int result;
switch(_input.Current.SymbolId)
{
case ExpressionTokenizer.integer:
result= int.Parse(_input.Current.Value);
_NextInput();
return result;
case ExpressionTokenizer.lparen:
if (!_NextInput())
_ThrowExpecting("Unterminated expression");
result = _ParseTerm();
if (ExpressionTokenizer.rparen != _input.Current.SymbolId)
_ThrowExpecting("Unterminated expression");
_NextInput();
return result;
}
_ThrowExpecting("Invalid token in expression");
return 0;
}
}
Here, we are using a tokenizer (emphasized in bold above) to parse our expression. See how our Parse()
method creates the tokenizer and "primes" it by moving the cursor to the first token. This is slightly different than the scannerless Parse()
method for JSON we created earlier, but the overall concept is the same except we're dealing with tokens instead of characters. The API for the tokenizer varies depending on what tool you've used to generate it. This example is using the Rolex tool and tokenizer/scanner API which reports tokens through IEnumerable<Token>
. If you were using a tool like GPLEX your code for reading tokens will look different but the overarching concept is the same.
In our _ParseXXXX()
methods, you can see we switch
on _input.Current.SymbolId
a lot. Remember I said that scanners/tokenizers associate a symbol id with a text fragment? This symbol id is reported by the aforementioned SymbolId
member. A lot of times, we don't care what the actual text is - it can often be inferred from the symbol id. For example, we know that the text for the symbol id represented by "add" will always be "+". Rolex puts constants representing each SymbolId
on its tokenizer class. We use those in the switch
to determine what action to take.
The process for creating these methods is to take the appropriate grammar definition and basically "synthesize" code for it. I say "synthesize" because the code is very regular and in theory could be generated, but generating the code, while possible, can get complicated. There is very commonly employed, less complicated way to machine generate parser code without code synthesis, using a more rote and perhaps "primitive" technique than below, but that's beyond the scope of this article. All that having been said, we create this code by hand. It's relatively easy for humans but complicated for computers.
Basically, when we see an ...|...
that's an alternation, so we figure out which one of the alternatives it is (usually based on the first token under the cursor) and parse appropriately.
When we see a { ... }
(zero or more) or a { ... }+
(one or more) that indicates a loop. We use typically use a while
loop for this. Otherwise, when we see a non-terminal element such as Factor
we'd call _ParseFactor()
.
If we see a [ ... ]
that indicates an optional expression. We treat this the same as if it were ...|
(an alternation with nothing as the final alternative.) It's basically just syntactic sugar.
Finally, of course, when we see a terminal or non-terminal, we either parse the terminal or delegate to the _ParseXXXX()
method that corresponds to the non-terminal respectively. For example, when we see Factor
in the grammar definition we know to call _ParseFactor()
.
One thing about expressions particularly is a wrinkle involving negative numbers. Notice our integer is only positive. Expressing something like -10
is a unary expression, not a leaf. The -
is responsible for the unary expression. See the grammar. Also this allows +
which is basically a non-operation - it does nothing but is included for completeness so +10
is legal. Because Unary
invokes itself (see the grammar) this allows ---10
and +-+10
similar to C#.
Using these parsers is simple:
Console.WriteLine(ExpressionParser.Parse("(1+1+1)+5*2*2"));
var json = @"{ ""foo"":""bar"", ""fubar"":1.2, ""fubaz"": [ {},true,false,null] }";
var j = JsonParser.Parse(json);
var d = j as IDictionary<string, object>;
Console.WriteLine(d.Count);
For more information on generating Rolex tokenizers, see this article.
History
- 14th March, 2020 - Initial submission