Building a Simple .NET Compiler

Jason R. Fruit

5.00/5 (25 votes)

26 Jan 2018MIT9 min read

36.7K

645

Implement a compiling calculator to learn about .NET CIL compilation

Download source - 3.4 KB

Introduction

Microsoft's .NET framework is designed for interoperability among many languages. To that end, its designers created a "one language to rule them all," a fairly low-level language that can access all the capabilities of the .NET CLR. It's called the Common Intermediate Language (CIL from here on); all other .NET languages compile to it. If the .NET virtual machine's bytecode is its machine code, CIL is its assembly language. It is human-readable without great effort and is more expressive than most assembly languages, making it an excellent compilation target, even for beginners: it's almost like writing a translator between two high-level languages.

In this article, we build the most basic .NET compiler: an integer calculator.

Why a Calculator?

There is no reason a calculator should compile to .NET CIL, or anything else; the time spent typing in expressions dwarfs the time even the most naïve interpreter would take to evaluate them. A compiling calculator is perfect for us, however, because it is a microcosm of a full compiler; we can follow the same steps as a compiler for a real language with much less complexity than even the most basic programming language would entail:

Tokenizing (breaking the code into its component symbols)
Parsing (converting those symbols into an internal representation of the semantic structures)
Compiling (converting the internal representation to the executable format, in this case CIL)

Calculator Overview

First, let's describe the calculator's input format. Ours is simpler than any pocket calculator: our input is limited to positive integers, the operators +, -, *, ÷ or /, ^, and %, and parentheses. We combine them to make expressions, like:

3 * (4 - 1) + 2 ^ 3
2 + 1
15 % 3 + 2

The grammar is tiny (see Wikipedia's article on Backus-Naur Form for more information on how it's expressed here):

expression → ['('] integer [ { operator expression } ] [')']
integer → {digit}
digit → '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9'
operator → '+' | '-' | '*' | '/' | '÷' | '^' | '%'

Our calculator needs to handle the order of operations correctly so the user doesn't have to fuss with rearranging or over-parenthesizing their expressions: the operations run in the order () → % or ^ → * or / → + or -. That is, exactly the order you learned in grade school (if you were paying attention).

The Implementation

Our input expressions will be processed by the constructor of a Program object which has three private methods:

Tokenize()
Parse()
Execute()

Tokenize() breaks the expression into its components; Parse() processes those components to create an internal postfix representation of the expression; and Execute() compiles that to CIL, which is immediately run.

Tokenizing

Our tokenizer will go through the code character by character, breaking it into tokens. Tokens are either (potentially multi-digit) numbers or single-character operators or parentheses; we know we're done reading a number when we see anything that is not a digit. We'll store the resulting tokens in a list of strings.

Our tokenizing algorithm is exceedingly simple:

for each character c in the code:
    if c is an operator:
        if there's a number being built, add it as a token
        add c as a token
    else if c is whitespace:
        if there's a number being built, add it as a token
    else if c is a digit:
        append c to the number accumulator
    else:
        throw a parsing error
end for

if there's a number left in the accumulator:
    add it as a token

Its implementation is likewise simple:

private void Tokenize(string expression) {
    // place to accumulate digits
    string number = "";

    // loop by char through code
    foreach (char c in expression) {

        // operators can end tokens as well as being one
        if (IsOperator(c.ToString()) ||
            "()".Contains(c.ToString())) {

            if (number != "") {
                Tokens.Add(number);
                number = "";
            }
            Tokens.Add(c.ToString());
        }
        // spaces end numbers
        else if (c == ' ') {
            if (number != "") {
                Tokens.Add(number);
                number = "";
            }
        }
        // digits can combine, so store each one
        else if ("0123456789".Contains(c.ToString())) {
            number += c;
        }
        else {
                // barf if you encounter something not a digit,
                // space, or operator
                throw new ParseException(
                    "Unexpected character '" + c.ToString() + "'.");
        }
    }

    // add the last token if there is one
    if (number != "") {
        Tokens.Add(number);
    }

    return;
}

Note that we've used the most naïve implementation of a token, each of which is just a string. If we wanted to be able to preserve information on the line and column numbers at which parsing or compilation error occurred, we'd have to use a token object that carried those numbers with it. For a real programming language, we'd need to, but our expressions are short enough that there's little need for that; the ParseException that's thrown if an unexpected character is encountered should be enough to point out to the user their error.

Parsing

.NET CIL uses a stack for most operations; that is, you push operands onto the stack and then issue instructions that operate on them. The most natural math representation for a stack machine is postfix notation, where the operators follow their operands; i.e. 3 + 4 → 3 4 +, or 12 / (3 + 1) → 12 3 1 + /. This notation allows you to push the operands onto the stack in the order they occur, applying operations as they are encountered. The only trick is to rearrange the tokens into the correct order, handling the order of operations and parenthesized expressions.

As in so many cases, there is an efficient and easy-to-implement algorithm for this that was conceived by the brilliant Edsger W. Dijkstra; I would likely not have arrived at it myself. It's called the shunting-yard algorithm, and it goes a little something like this:

Shunting-yard algorithm

Given a stream of tokens, create a stack of operators and an output list of tokens.

for each token in the stream
    if token is operator:
        while (stack.peek has higher precedence or
               (stack.peek has equal precedence and
                stack.peek is left-associative)) and
              stack.peek is not '(':
            pop from stack to end of output list
        push token to stack
    else if token is '(':
        push token to stack
    else if token is ')':
        while stack.peek is not '(':
            pop from stack to end of output list
        pop '(' from stack and discard
    else if token is numeric:
        append to output list
end for

while stack has operators:
    pop from stack to end of output list

Our implementation is almost as direct as the pseudocode above. For legibility, we move to a separate method, PopOperators(string token) all the logic that pops operators from the stack onto the output list when a new operator is encountered.

private void Parse() {

     foreach (string token in Tokens) {
         if (IsOperator(token)) {
             PopOperators(token);
             OperatorStack.Push(token);
         }
         else if (token == "(") {
             // parens only go on stack temporarily,
             // as they're not ops
             OperatorStack.Push(token);
         }
         else if (token == ")") {
             while (OperatorStack.Peek() != "(") {
                 Postfix.Add(OperatorStack.Pop());
             }
             OperatorStack.Pop();
         }
         else if (int.TryParse(token, out int throwaway)) {
             Postfix.Add(token);
         }
         else {
             throw new ParseException(
                 "Unrecognized token: " + token + ".");
         }
     }

     while (OperatorStack.Count > 0) {
         Postfix.Add(OperatorStack.Pop());
     }
 }

The PopOperators(token) method is a little ugly, because our algorithm pushes parentheses onto the operator stack; if we try to parse them as operators, it will fail. We'll make a virtue of necessity and use that condition to detect parentheses:

private static bool GreaterPrecedence(Operator a, Operator b) {
    return (a.Precedence > b.Precedence) ||
        ((a.Precedence == b.Precedence) &&
         (b.Associativity == Associativities.Left));
}

private void PopOperators(string token) {
    Operator cur = OperatorFromString(token);

    // if there are no operators, get out
    if (OperatorStack.Count == 0) {
        return;
    }

    try {
        for (Operator top = OperatorFromString(OperatorStack.Peek());
             (OperatorStack.Count > 0 && GreaterPrecedence(top, cur));
             top = OperatorFromString(OperatorStack.Peek())) {
            Postfix.Add(OperatorStack.Pop());
        }
    }
    catch (ParseException) {
        // it's a parenthesis, which can't be parsed as an operator
        return;
    }
}

I'm not sure I'm proud of that, but it's brief and effective.

At any rate, once Parse() has finished, we have a list of operands and operators in postfix notation, ready to be compiled and run when the consuming program calls the Execute() method.

Compilation

Compilation in our case is not at all difficult. Since we're already working with postfix notation, we will encounter the operands and operators in the correct order; we can translate each to CIL as it comes:

for each token in postfix:
    if token is an integer:
        emit CIL to push it to the stack
    else if token is an operator:
        emit CIL to consume operands and push result to the stack
    else:
        throw a compilation exception about an unrecognized token

So, as an example, if a user enters the expression:

3 * (4 + 2)

that will be parsed into postfix notation as 3 4 2 + *. When we compile it, we will issue instructions to:

Push 3 onto the stack
Push 4 onto the stack
Push 2 onto the stack
Add the top two numbers (2 and 4) and push the result (6) onto the stack — the stack now is [3 6].
Multiply the top two numbers (3 and 6) and push the result (18) onto the stack

The only number left on the stack after those instructions are executed is the result, 18. In CIL, the instructions will be:

JavaScript

ldc.i4 3
ldc.i4 4
ldc.i4 2
add
mul

Those instructions may look a little cryptic, but they're both easy to understand and well-documented. ldc.i4 loads its numeric argument to the stack as a 32-bit integer; add and mul add and multiply respectively.

Pre-compilation Housekeeping

We'll use the contents of the Reflection.Emit namespace to emit CIL. It's a clean, efficient, readable way to create an assembly on the fly. It also removes responsibility for running ilasm on saved CIL and executing an external executable, as you'll see: we can load and run the assembly without ever writing it to disk.

The first thing we need to do is create an assembly that executes in the same application domain as the current assembly:

JavaScript

// create everything needed to start writing CIL
AppDomain domain = AppDomain.CurrentDomain;
AssemblyName name = new AssemblyName();
name.Name = "CalculatorExpression";
// make a run-only assembly (can't save as .exe)
AssemblyBuilder asm = domain.DefineDynamicAssembly(
    name, AssemblyBuilderAccess.Run);

(Note that we could easily create a persistent assembly that we could save to disk as a separate executable — for that we'd use AssemblyBuilderAccess.Save. For our purposes, it's easiest to create, execute, and discard the assembly.)

Now that we have an assembly, we'll add a module to it, create a type to represent our expression, and add a WriteValue() method that will calculate the expression's value and write it to the console.

JavaScript

ModuleBuilder module = asm.DefineDynamicModule(
    "CalculatorExpressionModule");
TypeBuilder tpb = module.DefineType(
    "ExpressionExecutor", TypeAttributes.Public);
// the method that will hold our expression code
MethodBuilder meth = tpb.DefineMethod(
    "WriteValue", MethodAttributes.Public | MethodAttributes.Static);

The names of everything except the type and the method are irrelevant to us; we'll not refer to them again. Notice that the method is static; there is no need to instantiate a class for one calculation.

To write the CIL comprising our new method, we create a new ILGenerator:

JavaScript

// thing that writes CIL to the method
ILGenerator il = meth.GetILGenerator();

We can now use our ILGenerator to write any CIL we wish to the WriteValue method.

Compiling the Expression

We'll loop through the tokens in our postfix representation, handling them as they come. We try to parse an integer into an output variable; if we succeed, we emit CIL to push the result to the stack:

JavaScript

// push ints onto the stack
if (int.TryParse(token, out intToken)) {
    il.Emit(OpCodes.Ldc_I4, intToken);
}

If the token is not an integer, it should be an operator. In that case, we emit one or more CIL instructions based on what operator it is. In the case of the basic four instructions, i.e. +, -, *, and / or ÷, as well as the modulo operator %, the operation is a single instruction:

JavaScript

else if (IsOperator(token)) {
    switch (token) {
        case "+":
            il.Emit(OpCodes.Add);
            break;
        case "-":
            il.Emit(OpCodes.Sub);
            break;
        case "*":
            il.Emit(OpCodes.Mul);
            break;
        case "/":
        case "÷":
            il.Emit(OpCodes.Div);
            break;
        case "%":
            il.Emit(OpCodes.Rem);
            break;
        …

The power operator, ^, is different; there is no simple opcode to raise an integer to an integer power. In fact, there's no mscorlib method to do so directly, either; we need to use Math.Pow, which expects two float64's and returns one. That means we need to first define a local variable to hold one of the stack values, convert the other to a float64, push the variable's value on to the stack, and convert that to float64. Then we can call Math.Pow. Finally, we need to convert its return value to an int32. It's clearer in code:

JavaScript

case "^":

    // convert top two values to float64, since that's
    // what Math.Pow expects

    // create a temp variable
    LocalBuilder tmp = il.DeclareLocal(typeof(int));

    il.Emit(OpCodes.Stloc, tmp); // pop into temp variable
    il.Emit(OpCodes.Conv_R8);    // convert remaining value
    il.Emit(OpCodes.Ldloc, tmp); // push temp variable to stack
    il.Emit(OpCodes.Conv_R8);    // convert pushed value

    // call Math.Pow
    il.EmitCall(OpCodes.Call,
                typeof(Math).GetMethod("Pow"), null);

    // convert the float64 return value to int32
    il.Emit(OpCodes.Conv_I4);

    break;
    …

If the operator is not represented above, we'll throw an error: we've neglected to implement it. Finally, if the token is somehow neither an integer nor an operator, the expression probably contains unbalanced parentheses:

            …
            default:
                throw new CompileException(
                    "Unknown operator: '" + token + "'.");
        }

    }
    else {
        // if the token's not an integer or an operator, 
        // barf appropriately
        if ("()".Contains(token)) {
            throw new CompileException("Unbalanced parentheses.");
        }
        else {
            throw new CompileException(
                "Unable to compile expression; unknown token '" +
                token + "'.");
        }
    }
}

When the loop completes, all the code has been written to figure the expression's value, but we need to write it out. Our ILGenerator has a convenience method for WriteLine which requires us to assign the value to a variable:

JavaScript

// result needs a variable so we can WriteLine it
LocalBuilder result = il.DeclareLocal(typeof(int));
il.Emit(OpCodes.Stloc, result);

// print and return
il.EmitWriteLine(result);

Finally, we can return from the method:

JavaScript

il.Emit(OpCodes.Ret); // this will err if values left on stack

Housekeeping and Execution

That's all there is to compilation. There's a little housekeeping left, though. We need to finalize the type definition, now that we've added all the instructions it needs:

JavaScript

// bake the type definition
tpb.CreateType();

Once we've done that, we can no longer alter our assembly.

We've created our assembly, with a type carrying a method that does our calculation. Now, we need to run it. We don't have to instantiate a class to do so, since it's static; we just create an empty array of arguments, since it takes none, and call it:

JavaScript

// empty argument array
object[] noArgs = new object[0];

// call the method we wrote (on a null object because we don't
// need one for static method)
asm.GetType("ExpressionExecutor").GetMethod(
    "WriteValue").Invoke(null, noArgs);

That's it for our low-functioning compiler. Let's hook it up to some user input.

Putting Everything Together

The user can enter expressions to be compiled and executed. The program exits if they enter a blank line or stdin closes; it will report an unparseable or uncompileable expression.

JavaScript

class MainClass {

    private static string PromptExpression() {
        Console.WriteLine("Enter an integer arithmetic expression: ");
        return Console.ReadLine();
    }

    public static void Main(string[] args) {

        try {

            string expr;

            while ((expr = PromptExpression()) != "") {
                Program program = new Program(expr);
                program.Execute();
            }

        }
        catch (NullReferenceException) {
            // stdin EOF
            return;
        }
        catch (ParseException e) {
            Console.WriteLine(
                "Unparseable expression: " + e.Message);
        }
        catch (CompileException e) {
            Console.WriteLine(
                "Unable to compile expression: " + e.Message);
        }
    }
}

Here's a sample run:

Enter an integer arithmetic expression: 32 - 2 ^ 4
16
Enter an integer arithmetic expression: 16 - 2 ^ (2 + 2)
0
Enter an integer arithmetic expression: 12 / 2 ^ 2
3
Enter an integer arithmetic expression: 18 % 4
2
Enter an integer arithmetic expression: 18 % (2 + 3)
3
Enter an integer arithmetic expression: 
12 * (2 + 1
Unable to compile expression: Unbalanced parentheses.

What Is Not Done

There are several problems with our calculator that would prevent using it for any real purpose:

Even within our limited input syntax, it's possible to slip invalid input past the parser, which should not happen. For example, the meaningless expression 3 3 + 4 will be converted to meaningless postfix notation, and passed to the compiler, which emits CIL that will throw an exception when it tries to return with a value still on the stack. (The fix is trivial: ensure that all expressions start with an integer, and that integers and operators alternate (ignoring parentheses). Adding that, however, would have obscured the more educational features of our implementation.)
/ is almost useless without floats.
Even if we don't allow floats, signed integers are necessary.
Several classes of error are not handled.
Integer overflow is accepted unquestioningly, which is almost never the right decision.

But as we've already agreed that ours is not a practical way to write a calculator, let's just be pleased with what it taught us about compilers.

License

This article, along with any associated source code and files, is licensed under The MIT License