Introduction
This paper demonstrates a technique of building parsers using grammar inheritance. Inheritance allows building new grammars based on existing grammars by adding new rules or overriding inherited rules, speeding up the development of new parsers. Any changes made to the base grammar become visible in inherited grammars automatically. This technique is especially helpful when a parser should support a number of language dialects or versions.
Grammar inheritance is supported in a number of parser toolkits (such as ANTLR or Nitra) and is inherently available in many tools that use object-oriented languages as DSL languages for describing grammars (such as Sprache and Irony).
Extensible expression evaluator library serves as an example code for this article. The evaluator supports arithmetic operations, custom functions and parameters. It takes string
representation of an expression and converts it to a structured LINQ expression instance which can easily be compiled to an executable delegate. In contrast with interpreted expression evaluators such as NCalc, compiled expressions perform just as fast as native C# methods. Here is an example usage of a ready evaluator library:
var s = "Sin(y/x)";
var expr = calc.ParseExpression(s, x => 2, y => 3.14);
var func = expr.Compile();
Console.WriteLine("Result = {0}", func());
calc.RegisterFunction("Mul", (a, b, c) => a * b * c);
expr = calc.ParseExpression("2 ^ Mul(PI, a, b)", a => 2, b => 10);
Console.WriteLine("Result = {0}", func.Compile()());
calc.RegisterFunction("Add", "a + b", "a", "b");
expr = calc.ParseExpression("Add(353, 181)");
Console.WriteLine("Result = {0}", expr.Compile()());
| Sprache.Calc library isn't just an example, it's a ready to use expression evaluator. To use it in your own projects, install Sprache.Calc NuGet package by typing the following command in the Package Manager Console:
PM> Install-Package Sprache.Calc
|
Sprache Toolkit Overview
Sprache toolkit is a lightweight functional library for constructing parsers directly in C# code. The authors claim that it fits somewhere in between regular expressions and full-featured toolsets like ANTLR and doesn't mean to compete with "industrial-strength" language workbenches.
I'd say that Sprache is an excellent tool, well suited for a wide range of parsing-related tasks. For me, the most attractive feature of Sprache is its ability to use LINQ query syntax to define parsers and strong-typed parsing results. Sprache stimulates step-by-step grammar design and encourages test-driven development of parsers. Of course, parser combinators aren't free from certain drawbacks (complicated diagnostic and poor error recovery capabilities), but these details are beyond the scope of this paper.
Parsers in Sprache are functions transforming input string into something else. Unlike traditional parser frameworks, Sprache doesn't use code generation. Parsers are defined directly in the source code, and can be used to parse text right away. It allows writing unit tests in parallel with parsers themselves. Here is an example of a simple parser that consumes a sequence of A characters:
var parseA = Parse.Char('A').AtLeastOnce();
Simple parsers are combined into complex ones using Sprache's built-in extension-methods (such as Or
, And
, Many
, etc.), often callable via LINQ query comprehensions:
Parser<string> identifier =
from leading in Parse.WhiteSpace.Many()
from first in Parse.Letter.Once()
from rest in Parse.LetterOrDigit.Many()
from trailing in Parse.WhiteSpace.Many()
select new string(first.Concat(rest).ToArray());
var id = identifier.Parse(" abc123 ");
Assert.AreEqual("abc123", id);
The complete rule set, or language grammar, typically looks like a static class with delegate fields. To learn more about Sprache toolkit, have a look at this introductory article by Nicholas Blumhardt.
Expression Evaluator Design
Expression evaluator supports three modes: simple, scientific and extensible. Simple evaluator supports arithmetic operations and decimal floating-point operands, unary minus and parentheses. Scientific mode adds support for hexadecimal and binary numbers, exponential notation and System.Math
functions. Extensible mode allows using parameterized expressions and custom functions with overloading support.
Every next mode supports all features of its predecessors and adds its own features, so the grammars describing the input language of expression evaluators are organized as hierarchy.
Expression parser is a function transforming a source string into a LINQ expression which can be easily compiled to a delegate and executed just as normal .NET method:
var expr = "4*(1/1-1/3+1/5-1/7+1/9-1/11+1/13-1/15+1/17-1/19+10/401)";
var func = calc.ParseExpression(expr).Compile();
var result = func();
Simple Evaluator
Simple evaluator is based on Sprache LinqyCalculator sample. The grammar is optimized in a way to simplify the instantiation of LINQ expressions during parsing:
Expr ::= Term ("+"|"-" Term)*
Term ::= InnerTerm ("*"|"/"|"%" InnerTerm)*
InnerTerm ::= Operand ("^" Operand)
Operand ::= NegativeFactor | Factor
NegativeFactor ::= "-" Factor
Factor ::= "(" Expr ")" | Constant
Constant ::= Decimal
Sprache parsers are normally defined as static lambda-functions. Static fields can't be redefined in derived classes, so we'll define them as virtual properties instead:
public static readonly Parser<Expression> ExpressionInParentheses =
from lparen in Parse.Char('(')
from expr in Expr
from rparen in Parse.Char(')')
select expr;
protected virtual Parser<Expression> ExpressionInParentheses =>
from lparen in Parse.Char('(')
from expr in Expr
from rparen in Parse.Char(')')
select expr;
After this fairly trivial transformation, any rule can be overriden in derived classes. To enable unit testing for parsers, declare properties as protected internal
.
The full grammar of the simple calculator is available here. It's effectively the same as Sprache's LinqyCalculator sample.
Scientific Mode
Scientific calculator inherits from simple calculator, so it supports all features of simple calculator automatically. For binary and hexadecimal numbers support, it introduces new grammar rules:
protected virtual Parser<string> Binary =>
Parse.IgnoreCase("0b").Then(x =>
Parse.Chars("01").AtLeastOnce().Text()).Token();
protected virtual Parser<string> Hexadecimal =>
Parse.IgnoreCase("0x").Then(x =>
Parse.Chars("0123456789ABCDEFabcdef").AtLeastOnce().Text()).Token();
Adding new rules is not enough because base grammar has no idea when to apply them. Binary and hexadecimal numbers are constants, so we'll add them to the Constant rule of base grammar. Note that Binary and Hexadecimal parsers return string while Constant parser returns LINQ expression, so we'll need helper methods to convert strings into LINQ expressions. Finally, Constant parser with decimal, binary and hexadecimal numbers support will look like this:
protected override Parser<Expression> Constant =>
Hexadecimal.Select(x => Expression.Constant((double)ConvertHexadecimal(x)))
.Or(Binary.Select(b => Expression.Constant((double)ConvertBinary(b))))
.Or(base.Constant);
Now let's add support for functions. We'll need two more rules:
protected virtual Parser<string> Identifier =>
Parse.Letter.AtLeastOnce().Text().Token();
protected virtual Parser<Expression> FunctionCall =>
from name in Identifier
from lparen in Parse.Char('(')
from expr in Expr.DelimitedBy(Parse.Char(',').Token())
from rparen in Parse.Char(')')
select CallFunction(name, expr.ToArray());
CallFunction
helper method simply creates a LINQ expression for calling System.Math static
methods:
protected virtual Expression CallFunction(string name, Expression[] parameters)
{
var methodInfo = typeof(Math).GetMethod(name, parameters.Select(e => e.Type).ToArray());
if (methodInfo == null)
{
throw new ParseException(string.Format("Function '{0}({1})' does not exist.",
name, string.Join(",", parameters.Select(e => e.Type.Name))));
}
return Expression.Call(methodInfo, parameters);
}
Again, we need to override a base grammar rule to suggest when to apply the new rules. Now, finding a rule to override is not as obvious as it was with binary and hexadecimal constants. The proper rule should be chosen based on function call operation priority. It can easily be seen that it should have the highest priority, same as parentheses. For example, when evaluating Sin(2) * Cos(3), we have to calculate function values and then perform multiplication.
In base grammar parentheses are featured in the Factor rule, so we'll override it. By calling base.Factor
, we indicate that we still support everything supported by the base grammar:
protected override Parser<Expression> Factor =>
base.Factor.XOr(FunctionCall);
Adding Custom Functions
The most advanced version of expression evaluator inherits from scientific calculator class. Adding custom functions doesn't introduce new syntax, so we don't need to add new grammar rules. All we need to modify is the method that generates LINQ expressions for function calls. Here is its pseudocode:
override Expression CallFunction(string name, Expression[] parameters)
{
if there is a custom function named 'name',
{
return an expression to call this function
with the given parameters;
}
return base.CallFunction(name, parameters);
}
Any custom function for our calculator can be presented as a Func<double[], double>
delegate. Named functions can be stored as a Dictionary<string, Func<double[], double>>
, and all we need to do to support overloaded functions is to add number of arguments to the function name, i.e.:
"Min:2" -- Min function with two arguments
"Min:5" -- Min function with five arguments, etc.
The pseudocode above turns into something like that:
protected override Expression CallFunction(string name, Expression[] parameters)
{
var mangledName = name + ":" + parameters.Length;
if (CustomFuctions.ContainsKey(mangledName))
{
return Expression.Call(...);
}
return base.CallFunction(name, parameters);
}
There's an issue with Expression.Call
expression we need to generate. The story is that Expression.Call
requires a MethodInfo
instance, so it only supports calling existing methods while custom functions are created at runtime. To work around it, we define an adapter method for calling named functions:
private double CallCustomFunction(string name, double[] parameters) =>
CustomFuctions[name](parameters);
This method exists at compile-time, and it can be used to create an Expression.Call
expression. To get a MethodInfo
instance for this method, we use a Func
delegate constructor (it's faster and safer than using System.Reflection
). List of parameters should be converted into an array using Expression.NewArrayInit
as follows:
protected override Expression CallFunction(string name, Expression[] parameters)
{
var mangledName = name + ":" + parameters.Length;
if (CustomFuctions.ContainsKey(mangledName))
{
var newParameters = new List<Expression>();
newParameters.Add(Expression.Constant(mangledName));
newParameters.Add(Expression.NewArrayInit(typeof(double), parameters));
var callCustomFunction = new Func<string, double>(CallCustomFunction).Method;
return Expression.Call(Expression.Constant(this), callCustomFunction, newParameters.ToArray());
}
return base.CallFunction(name, parameters);
}
Adding Parameters to Expressions
Adding parameters requires a new syntax. A parameter is just an identifier that can take place anywhere a constant or a function call can occur:
protected virtual Parser<Expression> Parameter => Identifier;
protected override Parser<Expression> Factor => Parameter.Or(base.Factor);
Here, we finally encounter a conflict in the grammar! Factor rule has two alternatives that both start with an identifier: Parameter and Function. When the parser finds an identifier, it has no idea what to parse next (a parameter or a function) unless it looks ahead of it. If the identifier is followed by a "(
" symbol, it must be a function, otherwise it's a parameter.
As far as I know, Sprache doesn't help finding such conflicts. Full-featured workbenches like ANTLR or Irony will automatically detect them while processing your grammar, but with Sprache, you'll only detect them via failed unit tests (that's why tests are so important for Sprache-based parsers).
Our conflict is quite easy to solve: we just define Parameter as "an identifier not followed by a "(
" symbol". This rule will not cause the conflict because it eliminates ambiguity. Here is how it looks like:
protected virtual Parser<Expression> Parameter =>
from id in Identifier
from n in Parse.Not(Parse.Char('('))
select GetParameterExpression(id);
Parse.Not
parser is similar to negative lookahead in regular expressions: it doesn't move the current position and it succeeds if the parser given as parameter, in this case, Parse.Char('(')
, fails.
Handling Parameters
In contrast with custom functions which are bound to the instance of evaluator, parameters are bound to expressions. An expression can be evaluated multiple times using different parameter values. Named parameters are passed as Dictionary<string, double>
.
To handle parameters, we need to change our generated expression type from Expression<Func<double>>
(parameterless function returning double
value) to Expression<Func<Dictionary<string, double>, double>>
(a function that takes a dictionary of parameters and returns double
value). Parameterized expressions are used as follows:
var function = calc.ParseFunction("y/x").Compile();
var parameters = new Dictionary<string, double>{ { "x", 2 }, { "y", 4 } };
var result = function(parameters);
Let's generate a LINQ expression for accessing named parameter. If parameter name is unknown, we can try looking at System.Math
constants. This trick allows using built-in math constants such as PI and E automatically:
protected virtual Expression GetParameterExpression(string id)
{
var systemMathConstants = typeof(System.Math).GetFields(BindingFlags.Public | BindingFlags.Static);
var constant = systemMathConstants.FirstOrDefault(c => c.Name == id);
if (constant == null)
{
throw new ParseException(string.Format("Parameter or constant '{0}' does not exist.", id));
}
return Expression.Constant(constant.GetValue(null));
}
Generating an indexer access expression such as "parameters[name]" might look tricky unless you remember that C# compiler treats an indexer as a property named "Item
". By convention, property getters are named like "get_PropertyName
" and setters like "set_PropertyName
", so reading a named parameter is just calling "get_Item
" method:
var getItemMethod = typeof(Dictionary<string, double>).GetMethod("get_Item");
return Expression.Call(ParameterExpression, getItemMethod, Expression.Constant(name));
To keep things simple, our generated expression won't check whether the parameter with the given name exists. Dictionary
class will check it anyway and throw an exception in case of emergency. Here is a full body of parameter access generator code:
protected virtual Expression GetParameterExpression(string name)
{
var systemMathConstants = typeof(System.Math).GetFields(BindingFlags.Public | BindingFlags.Static);
var constant = systemMathConstants.FirstOrDefault(c => c.Name == name);
if (constant != null)
{
return Expression.Constant(constant.GetValue(null));
}
var getItemMethod = typeof(ParameterList).GetMethod("get_Item");
return Expression.Call(ParameterExpression, getItemMethod, Expression.Constant(name));
}
Syntactic Sugar for Parameters
Usage example in the beginning of this article features this syntactic sugar for specifying parameter values:
var expr = calc.ParseExpression("Sin(y/x)", x => 2, y => System.Math.PI);
This syntax is much more compact and easier to read than creating and populating a Dictionary<string, double>
instance. This syntax comes in handy when the list of available parameters is fixed. Although it's a bit off-topic, here's how it works:
public Expression<Func<double>> ParseExpression(string text,
params Expression<Func<double, double>>[] parameters)
{
var paramList = new Dictionary<string, double>();
foreach (var p in parameters)
{
var paramName = p.Parameters.Single().Name;
var paramValue = p.Compile()(0);
paramList[paramName] = paramValue;
}
return ParseExpression(text, paramList);
}
A Word About Unit Tests
Grammar inheritance adds extra burden for unit-tested Sprache parsers. Unit tests must ensure that every parser class still works with new and overridden rules. We have to test not only methods added in each class, but also all inherited methods.
The straightforward solution to this problem is using inheritance in unit tests. Base test class defines a virtual method CreateCalculator
that creates an instance of expression evaluator used in unit tests. Derived test class overrides CreateCalculator
method and returns a superclass instance instead, so all test methods inherited from base test class automatically test this superclass instance.
Conclusion
Grammar inheritance is a powerful technique that helps to keep complexity under control and to speed up the development of new parsers. Thanks to inheritance, most rules defined in base grammar are reused in derived grammars which allows to focus developer's efforts on new rules and features different from the base version. In Sprache-based parsers, grammar inheritance stimulates defining smaller reusable rules, step-by-step and test-driven development of parsers. Here is a quick summary:
- Declare parsers as virtual properties instead of static fields
- Decompose the grammar into small and reusable rules
- Write unit tests for every single atomic parser
- Use "
protected internal
" access modifier to enable unit testing - Unit tests must be organized in the same hierarchy as parser classes
References
History
- 10.07.2014: Initial post
- 20.02.2016: Updated the article to use C# 7.0 syntax.
- 10.03.2019: Added custom functions defined as
string
s (thanks to Andrey Leskov)