Introduction
Microsoft's .NET framework is designed for interoperability among many languages. To that end, its designers created a "one language to rule them all," a fairly low-level language that can access all the capabilities of the .NET CLR. It's called the Common Intermediate Language (CIL from here on); all other .NET languages compile to it. If the .NET virtual machine's bytecode is its machine code, CIL is its assembly language. It is human-readable without great effort and is more expressive than most assembly languages, making it an excellent compilation target, even for beginners: it's almost like writing a translator between two high-level languages.
In this article, we build the most basic .NET compiler: an integer calculator.
Why a Calculator?
There is no reason a calculator should compile to .NET CIL, or anything else; the time spent typing in expressions dwarfs the time even the most naïve interpreter would take to evaluate them. A compiling calculator is perfect for us, however, because it is a microcosm of a full compiler; we can follow the same steps as a compiler for a real language with much less complexity than even the most basic programming language would entail:
- Tokenizing (breaking the code into its component symbols)
- Parsing (converting those symbols into an internal representation of the semantic structures)
- Compiling (converting the internal representation to the executable format, in this case CIL)
Calculator Overview
First, let's describe the calculator's input format. Ours is simpler than any pocket calculator: our input is limited to positive integers, the operators +, -, *, ÷ or /, ^, and %, and parentheses. We combine them to make expressions, like:
3 * (4 - 1) + 2 ^ 3
2 + 1
15 % 3 + 2
The grammar is tiny (see Wikipedia's article on Backus-Naur Form for more information on how it's expressed here):
expression → ['('] integer [ { operator expression } ] [')']
integer → {digit}
digit → '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9'
operator → '+' | '-' | '*' | '/' | '÷' | '^' | '%'
Our calculator needs to handle the order of operations correctly so the user doesn't have to fuss with rearranging or over-parenthesizing their expressions: the operations run in the order () → % or ^ → * or / → + or -. That is, exactly the order you learned in grade school (if you were paying attention).
The Implementation
Our input expressions will be processed by the constructor of a Program
object which has three private
methods:
Tokenize()
Parse()
Execute()
Tokenize()
breaks the expression into its components; Parse()
processes those components to create an internal postfix representation of the expression; and Execute()
compiles that to CIL, which is immediately run.
Tokenizing
Our tokenizer will go through the code character by character, breaking it into tokens. Tokens are either (potentially multi-digit) numbers or single-character operators or parentheses; we know we're done reading a number when we see anything that is not a digit. We'll store the resulting tokens in a list of string
s.
Our tokenizing algorithm is exceedingly simple:
for each character c in the code:
if c is an operator:
if there's a number being built, add it as a token
add c as a token
else if c is whitespace:
if there's a number being built, add it as a token
else if c is a digit:
append c to the number accumulator
else:
throw a parsing error
end for
if there's a number left in the accumulator:
add it as a token
Its implementation is likewise simple:
private void Tokenize(string expression) {
// place to accumulate digits
string number = "";
// loop by char through code
foreach (char c in expression) {
// operators can end tokens as well as being one
if (IsOperator(c.ToString()) ||
"()".Contains(c.ToString())) {
if (number != "") {
Tokens.Add(number);
number = "";
}
Tokens.Add(c.ToString());
}
// spaces end numbers
else if (c == ' ') {
if (number != "") {
Tokens.Add(number);
number = "";
}
}
// digits can combine, so store each one
else if ("0123456789".Contains(c.ToString())) {
number += c;
}
else {
// barf if you encounter something not a digit,
// space, or operator
throw new ParseException(
"Unexpected character '" + c.ToString() + "'.");
}
}
// add the last token if there is one
if (number != "") {
Tokens.Add(number);
}
return;
}
Note that we've used the most naïve implementation of a token, each of which is just a string
. If we wanted to be able to preserve information on the line and column numbers at which parsing or compilation error occurred, we'd have to use a token object that carried those numbers with it. For a real programming language, we'd need to, but our expressions are short enough that there's little need for that; the ParseException
that's thrown if an unexpected character is encountered should be enough to point out to the user their error.
Parsing
.NET CIL uses a stack for most operations; that is, you push operands onto the stack and then issue instructions that operate on them. The most natural math representation for a stack machine is postfix notation, where the operators follow their operands; i.e. 3 + 4 → 3 4 +, or 12 / (3 + 1) → 12 3 1 + /. This notation allows you to push the operands onto the stack in the order they occur, applying operations as they are encountered. The only trick is to rearrange the tokens into the correct order, handling the order of operations and parenthesized expressions.
As in so many cases, there is an efficient and easy-to-implement algorithm for this that was conceived by the brilliant Edsger W. Dijkstra; I would likely not have arrived at it myself. It's called the shunting-yard algorithm, and it goes a little something like this:
Shunting-yard algorithm
Given a stream of tokens, create a stack of operators and an output list of tokens.
for each token in the stream
if token is operator:
while (stack.peek has higher precedence or
(stack.peek has equal precedence and
stack.peek is left-associative)) and
stack.peek is not '(':
pop from stack to end of output list
push token to stack
else if token is '(':
push token to stack
else if token is ')':
while stack.peek is not '(':
pop from stack to end of output list
pop '(' from stack and discard
else if token is numeric:
append to output list
end for
while stack has operators:
pop from stack to end of output list
Our implementation is almost as direct as the pseudocode above. For legibility, we move to a separate method, PopOperators(string token)
all the logic that pops operators from the stack onto the output list when a new operator is encountered.
private void Parse() {
foreach (string token in Tokens) {
if (IsOperator(token)) {
PopOperators(token);
OperatorStack.Push(token);
}
else if (token == "(") {
// parens only go on stack temporarily,
// as they're not ops
OperatorStack.Push(token);
}
else if (token == ")") {
while (OperatorStack.Peek() != "(") {
Postfix.Add(OperatorStack.Pop());
}
OperatorStack.Pop();
}
else if (int.TryParse(token, out int throwaway)) {
Postfix.Add(token);
}
else {
throw new ParseException(
"Unrecognized token: " + token + ".");
}
}
while (OperatorStack.Count > 0) {
Postfix.Add(OperatorStack.Pop());
}
}
The PopOperators(token)
method is a little ugly, because our algorithm pushes parentheses onto the operator stack; if we try to parse them as operators, it will fail. We'll make a virtue of necessity and use that condition to detect parentheses:
private static bool GreaterPrecedence(Operator a, Operator b) {
return (a.Precedence > b.Precedence) ||
((a.Precedence == b.Precedence) &&
(b.Associativity == Associativities.Left));
}
private void PopOperators(string token) {
Operator cur = OperatorFromString(token);
// if there are no operators, get out
if (OperatorStack.Count == 0) {
return;
}
try {
for (Operator top = OperatorFromString(OperatorStack.Peek());
(OperatorStack.Count > 0 && GreaterPrecedence(top, cur));
top = OperatorFromString(OperatorStack.Peek())) {
Postfix.Add(OperatorStack.Pop());
}
}
catch (ParseException) {
// it's a parenthesis, which can't be parsed as an operator
return;
}
}
I'm not sure I'm proud of that, but it's brief and effective.
At any rate, once Parse()
has finished, we have a list of operands and operators in postfix notation, ready to be compiled and run when the consuming program calls the Execute()
method.
Compilation
Compilation in our case is not at all difficult. Since we're already working with postfix notation, we will encounter the operands and operators in the correct order; we can translate each to CIL as it comes:
for each token in postfix:
if token is an integer:
emit CIL to push it to the stack
else if token is an operator:
emit CIL to consume operands and push result to the stack
else:
throw a compilation exception about an unrecognized token
So, as an example, if a user enters the expression:
3 * (4 + 2)
that will be parsed into postfix notation as 3 4 2 + *. When we compile it, we will issue instructions to:
- Push 3 onto the stack
- Push 4 onto the stack
- Push 2 onto the stack
- Add the top two numbers (2 and 4) and push the result (6) onto the stack — the stack now is [3 6].
- Multiply the top two numbers (3 and 6) and push the result (18) onto the stack
The only number left on the stack after those instructions are executed is the result, 18. In CIL, the instructions will be:
ldc.i4 3
ldc.i4 4
ldc.i4 2
add
mul
Those instructions may look a little cryptic, but they're both easy to understand and well-documented. ldc.i4
loads its numeric argument to the stack as a 32-bit integer; add
and mul
add and multiply respectively.
Pre-compilation Housekeeping
We'll use the contents of the Reflection.Emit namespace
to emit CIL. It's a clean, efficient, readable way to create an assembly on the fly. It also removes responsibility for running ilasm
on saved CIL and executing an external executable, as you'll see: we can load and run the assembly without ever writing it to disk.
The first thing we need to do is create an assembly that executes in the same application domain as the current assembly:
AppDomain domain = AppDomain.CurrentDomain;
AssemblyName name = new AssemblyName();
name.Name = "CalculatorExpression";
AssemblyBuilder asm = domain.DefineDynamicAssembly(
name, AssemblyBuilderAccess.Run);
(Note that we could easily create a persistent assembly that we could save to disk as a separate executable — for that we'd use AssemblyBuilderAccess.Save
. For our purposes, it's easiest to create, execute, and discard the assembly.)
Now that we have an assembly, we'll add a module to it, create a type to represent our expression, and add a WriteValue()
method that will calculate the expression's value and write it to the console.
ModuleBuilder module = asm.DefineDynamicModule(
"CalculatorExpressionModule");
TypeBuilder tpb = module.DefineType(
"ExpressionExecutor", TypeAttributes.Public);
MethodBuilder meth = tpb.DefineMethod(
"WriteValue", MethodAttributes.Public | MethodAttributes.Static);
The names of everything except the type and the method are irrelevant to us; we'll not refer to them again. Notice that the method is static
; there is no need to instantiate a class for one calculation.
To write the CIL comprising our new method, we create a new ILGenerator
:
ILGenerator il = meth.GetILGenerator();
We can now use our ILGenerator
to write any CIL we wish to the WriteValue
method.
Compiling the Expression
We'll loop through the tokens in our postfix representation, handling them as they come. We try to parse an integer into an output variable; if we succeed, we emit CIL to push the result to the stack:
if (int.TryParse(token, out intToken)) {
il.Emit(OpCodes.Ldc_I4, intToken);
}
If the token is not an integer, it should be an operator. In that case, we emit one or more CIL instructions based on what operator it is. In the case of the basic four instructions, i.e. +, -, *, and / or ÷, as well as the modulo operator %, the operation is a single instruction:
else if (IsOperator(token)) {
switch (token) {
case "+":
il.Emit(OpCodes.Add);
break;
case "-":
il.Emit(OpCodes.Sub);
break;
case "*":
il.Emit(OpCodes.Mul);
break;
case "/":
case "÷":
il.Emit(OpCodes.Div);
break;
case "%":
il.Emit(OpCodes.Rem);
break;
…
The power operator, ^, is different; there is no simple opcode to raise an integer to an integer power. In fact, there's no mscorlib
method to do so directly, either; we need to use Math.Pow
, which expects two float64
's and returns one. That means we need to first define a local variable to hold one of the stack values, convert the other to a float64
, push the variable's value on to the stack, and convert that to float64
. Then we can call Math.Pow
. Finally, we need to convert its return value to an int32
. It's clearer in code:
case "^":
LocalBuilder tmp = il.DeclareLocal(typeof(int));
il.Emit(OpCodes.Stloc, tmp);
il.Emit(OpCodes.Conv_R8);
il.Emit(OpCodes.Ldloc, tmp);
il.Emit(OpCodes.Conv_R8);
il.EmitCall(OpCodes.Call,
typeof(Math).GetMethod("Pow"), null);
il.Emit(OpCodes.Conv_I4);
break;
…
If the operator is not represented above, we'll throw an error: we've neglected to implement it. Finally, if the token is somehow neither an integer nor an operator, the expression probably contains unbalanced parentheses:
…
default:
throw new CompileException(
"Unknown operator: '" + token + "'.");
}
}
else {
if ("()".Contains(token)) {
throw new CompileException("Unbalanced parentheses.");
}
else {
throw new CompileException(
"Unable to compile expression; unknown token '" +
token + "'.");
}
}
}
When the loop completes, all the code has been written to figure the expression's value, but we need to write it out. Our ILGenerator
has a convenience method for WriteLine
which requires us to assign the value to a variable:
LocalBuilder result = il.DeclareLocal(typeof(int));
il.Emit(OpCodes.Stloc, result);
il.EmitWriteLine(result);
Finally, we can return from the method:
il.Emit(OpCodes.Ret);
Housekeeping and Execution
That's all there is to compilation. There's a little housekeeping left, though. We need to finalize the type definition, now that we've added all the instructions it needs:
tpb.CreateType();
Once we've done that, we can no longer alter our assembly.
We've created our assembly, with a type carrying a method that does our calculation. Now, we need to run it. We don't have to instantiate a class to do so, since it's static
; we just create an empty array of arguments, since it takes none, and call it:
object[] noArgs = new object[0];
asm.GetType("ExpressionExecutor").GetMethod(
"WriteValue").Invoke(null, noArgs);
That's it for our low-functioning compiler. Let's hook it up to some user input.
Putting Everything Together
The user can enter expressions to be compiled and executed. The program exits if they enter a blank line or stdin closes; it will report an unparseable or uncompileable expression.
class MainClass {
private static string PromptExpression() {
Console.WriteLine("Enter an integer arithmetic expression: ");
return Console.ReadLine();
}
public static void Main(string[] args) {
try {
string expr;
while ((expr = PromptExpression()) != "") {
Program program = new Program(expr);
program.Execute();
}
}
catch (NullReferenceException) {
return;
}
catch (ParseException e) {
Console.WriteLine(
"Unparseable expression: " + e.Message);
}
catch (CompileException e) {
Console.WriteLine(
"Unable to compile expression: " + e.Message);
}
}
}
Here's a sample run:
Enter an integer arithmetic expression: 32 - 2 ^ 4
16
Enter an integer arithmetic expression: 16 - 2 ^ (2 + 2)
0
Enter an integer arithmetic expression: 12 / 2 ^ 2
3
Enter an integer arithmetic expression: 18 % 4
2
Enter an integer arithmetic expression: 18 % (2 + 3)
3
Enter an integer arithmetic expression:
12 * (2 + 1
Unable to compile expression: Unbalanced parentheses.
What Is Not Done
There are several problems with our calculator that would prevent using it for any real purpose:
- Even within our limited input syntax, it's possible to slip invalid input past the parser, which should not happen. For example, the meaningless expression
3 3 + 4
will be converted to meaningless postfix notation, and passed to the compiler, which emits CIL that will throw an exception when it tries to return with a value still on the stack. (The fix is trivial: ensure that all expressions start with an integer, and that integers and operators alternate (ignoring parentheses). Adding that, however, would have obscured the more educational features of our implementation.) /
is almost useless without floats. - Even if we don't allow floats, signed integers are necessary.
- Several classes of error are not handled.
- Integer overflow is accepted unquestioningly, which is almost never the right decision.
But as we've already agreed that ours is not a practical way to write a calculator, let's just be pleased with what it taught us about compilers.