Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles / Languages / C#4.0

TinyLisp: A Language and Parser to See LINQ Expressions in Action

4.95/5 (10 votes)
12 Jun 2010CPOL11 min read 34.4K   344  
A small program that parses expressions and evaluates them using LINQ Expressions

Introduction

Lately, I've been playing around some more with Linq Expression trees. They're pretty awesome, but for some reason the examples don't seem to move beyond creating a simple hardcoded expression and executing it. Nice from an academic point of view, but since the expression was hardcoded, it's never going to change its behaviour and might as well have been written in plain C#. So I tried a little project do create them on the fly.

To be able to create expressions dynamically, we need some way to change the definition they're created from. This could be an XML document, but I thought it would be nice to write a very simple parser, because that's where ExpressionTrees usually come from. (Yes kids, really! They don't come from the stork, time to man up now and face parsers ;-).)

In this article, I'll present a very simple expression language that I'll call TinyLisp. The program with this article evaluates expressions written in it and displays the result. It uses Linq.Expressions under the hood to make all that work. You can enter an expression and see how it's parsed into an abstract syntax tree, then converted to a Linq expression tree and finally executed.

TinyLisp.jpg

Here's what we're going to cover:

  • First, we will design our little language. We'll cut some smart corners here, so the rest can be done within one article.
  • Then, we'll write the code to parse these expressions. This will be a lot easier than you might expect. Because of how we designed our language, we can actually write this parser in one method.
  • Now that we have our parser that creates abstract syntax trees, we need something that will transform them into Linq.Expression trees. There's a little difference between our language and Linq that we need to bridge, so we'll need to rewrite that tree.
  • Now we're home free! Compiling and executing Linq Expressions is something kids and grannies do nowadays, so we'll only briefly touch on that.

So without further ado, let's design that little language.

The Language

The main design goal of our language is that it should be extremely easy to parse. We'll do that by writing down expressions in a bit of an unusual way. Here's an expression written in our language, that we will call TinyLisp.

C#
(+ 1 2 3)

It looks a little strange at first because it is written in prefix notation that probably needs a little explanation. In school, we were taught to write that same expression like this:

C#
1 + 2 + 3

This is called infix notation. The plusses are the operators, and 1, 2 and 3 are the operands that they operate on. With infix notation, the operator sits between two operands. In our prefix notation, the operator is in front of the operands, and they're all surrounded by parentheses.

Note that with prefix, we only needed 1 operator for 3 operands. Because they are within parentheses, there can be as many operands as you like. To do that in infix, you need to chain the operators between each pair of operands. That's why we needed 2 plusses for infix. This is the little difference between the two I mentioned that we'll have to bridge later on.

Now see another expression written in infix:

C#
5 + 4 * 3 + 2 / 1

This is a bit of a pickle with infix notation. To work out that the answer is 19, you need to know that division comes before multiplication, and that multiplication comes before addition. There's plenty of theory on how to parse infix notation, but it would complicate our parser way too much. This article is more about Linq Expressions, so we want to simplify our job. We'll apply the precedence rules ourselves and add parentheses to indicate the order of evaluation:

C#
5 + (4 * 3) + (2 / 1)

That's better already, but when we'd try to parse this it would still be little complicated because of the order in which the expressions appear. When we move the operator to the front of each (sub)expression, we get this, which is the prefix notation syntax we're after:

C#
(+ 5 (* 4 3) (/ 2 1))

Now the user has put in the parentheses himself which saves the parser a lot of thinking. Apart from that, the order is very convenient. We first get our operator, then the operands, and the entire expression is neatly enclosed within parentheses. This sequential flow from left to right allows our parser to be very simple.

Our language definition in EBNF now looks like this:

expression = '('   ( '+' | '-'  | '*' | '/'  )    [  ['0'..'9'] | expression ]+    ')'

In normal words: an expression consists of an opening parenthesis, then one of these operands +,-,*,/ then one or more integers or expressions, and finally a closing parenthesis.

As an extra corner we can cut, we will only deal with the numbers [0 .. 9], and only with operators: [+, -, *, /]. This way, every thing we will encounter is just 1 character long. Or, every single character is a token. In short, a token is like a word in a sentence. I won't elaborate on these topics here, if you're interested, check here. The point here is that we only need to enumerate the chars in the string to get a stream of tokens. We can do that.

Parsing Our Language

Parsing means to take a string and convert that into something a computer can understand: an abstract syntax tree. To revisit our example, this expression...

C#
"(+ 5 (* 4 3) (/ 2 1))"

... is just a bunch of chars to a computer. It needs to be turned into a tree structure that a program can work on:

     +
 /  |    \
5   *    '/'
   / \   /  \
  4   3 2    1

That's what our parser will do. But how you ask, aren't parsers difficult? Well, that's where that prefix notation really comes in handy. The language design we chose basically IS a tree notation. Each opening parenthesis '(' marks the beginning of an expression. The first character after that is the operator. Then come one or more operands that can either be an integer or a new subexpression. Finally the expression is closed with a closing parenthesis. Then this (sub)expression is done and we can return it. Let's write that down in a bit of pseudo code:

C#
Expression GetExpression:                                
   Check opening parenthesis                           
   Get Operator                                        
   Loop:                                               
       Get next token          
       If token = '(' then Operands.Add GetExpression  // here we recurse for the 
						// subexpression
       If token = ')' break loop                        
       If token in [0..9] then Operands.Add token    

This is basically all we need to do to read these prefix expressions into a tree. Sounds doable. Below is the whole thing written down in C#, after this short introduction.

The method in the listing is actually the constructor of the class ExpressionNode. To use it, call something along these lines:

C#
ExpressionNode n = new ExpressionNode("(+ 5 4)")

Instead of a normal string, the real thing takes an argument of type CharStream. This CharStream is a simple wrapper class I wrote that takes a string and then exposes the individual chars one by one. The reason this class is used over a normal string is that we use recursion to parse the string into a tree. When we return from a subcall, we want to continue on the string where the subcall left off, not where it started. By passing this CharStream by reference, we get exactly that behaviour. It also handles some simple chores like skipping whitespace.

So here's our parser:

C#
// constructor to create expression tree by parsing a string
public ExpressionNode(CharStream Stream) {

    // keep track of start of current expression within whole expression
    this.Start = Stream.Position;
    this.StringExpression = Stream.Stream;
    
    // assert that we have a opening parenthesis to start this expression 
    if (Stream.Current != '(') throw new Exception("a (sub)expression 
    	must always start with a '(' at position " + Stream.Current);
    
    // skip to next char 
    Stream.MoveNext(); 
    
    // get the operator which should be here now
    if (Operators.Contains(Stream.Current)) this.Operator = Stream.Current;
    else throw new Exception("Unexpected character '" + 
    	Stream.Current + "' at position " + Stream.Position + 
    		".\n\nThe first character after the '(' at position " + 
    		this.Start + " that started this (sub)expression 
    		must be its operator [+, -, *, /].");
    
    // loop over the characters in the stream and try to figure them out one by one
    while (true) {
        // skip to next character
        Stream.MoveNext(); 
        
        // is it a '(' marking the start of new sub expression?
        if (Stream.Current == '(') {
            // start a new sub expression from here and add that (recurse)
            this.Operands.Add(new ExpressionNode(Stream));
            continue;
        }
        
        //  is it a ')' marking end of current expression?
        if (Stream.Current == ')') {
            // this expression is done, mark ending and break loop 
            this.End = Stream.Position;
            break;
        }
        
        // is it an int?                
        if (Ints.Contains(Stream.Current)) {
            // add int and loop to go to next char
            this.Operands.Add(new IntNode(Stream.Current));
            continue;
        }                
        
        // If none of the above was true, we have an error. 
        // Our language does not allow this situation.            
        throw new Exception("Error, expecting either '(', ')' or [0 .. 9]");
    }
}

That's it. It checks that the first character is a '(', then it reads the operator. Then it loops until the closing parenthesis looking for either integers or subexpressions. When it comes across a ')', it's done and returns the (sub)expression as a tree structure. I encourage you to step through this code in the debugger to see how it all fits together.

Rewriting our Tree

Here's our tree again. If you look at it closely, you can see that we have a little problem because the top '+' operator has three arguments: 5, * and /. That's a no-no in Linq Expressions, because that only has BinaryExpressions that take 2 operands.

     +
 /  |    \
5   *    '/'  <<< 3 operands for the + operator on this line...
   / \   /  \
  4   3 2    1

The problem here is that we have 3 operands and each operator can only accept 2. What we can do is use 2 operators, where one is the child of the other. The top one has 1 operator left and the bottom 2, totalling the 3 operators we needed. Let's put that to practise on our tree:

The top + operator with 3 operands, splits off a new + operator. This new one takes the last two operands and becomes the righthand operand of the original. Now each operator has two operands and we're good.

    +
 /   \
5     +   <<< this is the new node that was split off
    /   \
   *    '/'
  / \   /  \
 4   3 2    1

Before we go to the code for that, however, there's another issue we need to tackle.

Converting our Syntax Tree to a Linq Expression Tree

Besides the need to have a definition for our expressions that we could change easily, there's another reason why we needed our own syntax tree. The reason is that you can't just put together a Linq expression tree together any way you like. Because of the way the constructors are defined, you're forced to create them bottom up. Take, for instance, the BinaryOperator that we need to evaluate our expressions, we can't just go and say:

C#
BinaryOperator b = new BinaryOperator()

because you'll get this message:

Error	56	The type 'System.Linq.Expressions.BinaryExpression' 
		has no constructors defined

No constructors at all .. but .. then .. where DO they come from? Again, no storks involved, Expressions come from static methods on the Expression base class. To create a BinaryExpression that adds 2 numbers, here's the static method on Expression you need:

C#
public static BinaryExpression Add(Expression left, Expression right);

The problem with this however is that you need to have left and right ready before you can create the BinaryExpression. By having our own abstract syntax tree, we can first construct that any way we like and then convert it into a Linq Expression tree by traversing it depth first. When we hit the leaves and come back up, we'll have the operands we need ready and we can create the Linq Expression tree bottom up.

Here's the code on the OperatorNode of our syntax tree that does just that. This code kills two birds with one stone: First, it rewrites our tree so every operator only has two operands. It does that by recursing down and inserting nodes where needed. Then, on the way back up, it creates Linq.Expression nodes from that binary tree and returns the whole expression, rewritten as a Linq Expression.

The reason the two are combined here is that they fit together very nicely since they both use depth first recursion to walk the tree. First converting our own syntax to a binary tree and then convert that to a Linq tree would just be double work. Here it is:.

C#
    private Expression ToExpressionInternal(int Child) {
    // if this is the last (or only) child, just return that. 
    // Can't have a binary operator with just 1 operand.
    if (Child == Operands.Count - 1) {
        return Operands[Child].ToExpression();
    } else {
        // so we have more than one, let's deal with that case:
        
        // assign the current node to the left operand
        Expression left = Operands[Child].ToExpression();
        
        // Recurse down for the remaining child(ren) for the righthand operand. 
        // If there's more than 1 remaining they'll need to be nested into a tree also
        Expression right = ToExpressionInternal(Child + 1);
        
        // we're done recursing down and are now coming back up
        // Put left and right together in a BinaryOperator and return that subtree
        switch (Operator) {
            case '+': return Expression.Add(left, right);
            case '-': return Expression.Subtract(left, right);
            case '*': return Expression.Multiply(left, right);
            case '/': return Expression.Divide(left, right);
            default: throw new Exception();
        }
    }
}

Let me talk you through a simple case where the operator has 2 operands.

  1. The code gets called externally for the first time with Child = 0, to deal with the first operand.
  2. This means that (0 == 2 -1) is false and it goes into the 'else' branch.
  3. Child 0 is put into the left operand.
  4. To work out the righthand operand, the function calls itself with Child + 1.
    1. (1 == 2 - 1) is true, so it goes into the true branch
    2. The second operand is returned as an int node
  5. The right hand operand now is that Int node
  6. Both get put into a BinaryExpression when we call return Expression.Add(left, right);

For three and more operands, it will keep nesting the expressions into the righthand operand until it has turned them into a binary tree. This is better watched in the debugger than written out here, it wouldn't make a lot of sense writing that down.

When there are no more nodes to go down the tree, the right hand operand is assigned, and both get put into a BinaryOperator which is returned. For deep trees, this will return to where the recursive call was made. For the top node, the complete Linq.Expression is returned. Et voila, the tree, nicely converted!

Compiling and Executing our Linq Expression

We're almost there! Now all we need to do is put our expression into a lambda, and compile and execute that. Randomly punching some knobs on my laptop resulted in this little snippet:

C#
// convert to Linq.Expression
Expression eExpression = nExpression.ToExpression();
                
// put it into a Lambda
Expression<Func<int>> lExpression = Expression.Lambda<Func<int>>(eExpression);
                
// Compile the lambda
Func<int> fExpression = lExpression.Compile();  

The first step calls the code we described above and converts our syntax tree into a Linq Expression tree. Then we put that into a lambda expression. The expression defines what needs to be done, but the lambda lets you actually do that. That's what the lambda does, it's the boilerplate stuff that lets call your expression trees. It handles stuff like parameters. When we have that Lambda, we call Compile() on it, that returns a Func<int>.

Func<int> is shorthand for a delegate int. So now we have a delegate that we can call to execute our expression. After all this work, it would be silly not to, so here goes:

C#
 // and execute the compiled lambda to see the result
MessageBox.Show(fExpression());

Using the Code

Using the program is pretty easy. When you run it, you can enter an expression in the textbox on the left and press the button. The program will then parse that expression, convert it to a Linq Expression tree and execute that. It also displays the two expression trees side by side, so the user can compare the approaches. Just set a breakpoint on button_click and step through the whole parser en conversion routines should give you a good start on parsers and syntax trees.

Here are a few sample expressions to get you started:

(+ 1 2 3)       		see how the prefix expression can have any 
			number of operands, where the Linq tree has only 
			two and needs to be nested
 
(+ 5 (* 4 3) (/ 2 1))   	the expression we used as example
 
(+123(*456))            	see tree rewriting in action again
 
(+123456789)    		really see how binary expression are a pretty blunt tool. 
			Also note that the digits are seen as separate numbers, 
			not one large number. The parser only looks at separate chars.
 
(+ 1 2 (* 3 4))         	an expression with a subexpression
 
(+(-(*(/12)34(*56))7)(+89)1)2)  some deep expression

That's all folks. I hope you have fun and learn a little playing around with this. If you like it, please let me know in the comments or rate my article.

History

  • 12th June, 2010: Initial post

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)