Writing Interpreters and Compilers
Writing an interpreter or a compiler is one of the most educational tasks in programming because you can become familiarized with the details of the code interpretation and evaluation process. You can obtain much deeper knowledge of what sorts of things are going on behind the scenes and gain some insights into the decisions behind language design.
This article will perform a basic overview of this process by showing how to write an interpreter for a simple language that we can use for a calculator application.
This article assumes that you have intermediate programming experience and basic familiarity with JavaScript.
Some Background
The Difference Between Interpreters, Compilers and Transpilers
In this article, we will be creating an interpreter, as opposed to a compiler. Our program will take some code as input and immediately execute it.
If we were writing a compiler, we would instead transform the inputted code in the source language into code in some lower-level target language, such as MSIL, or an assembly language, or even machine code.
A transpiler is similar to a compiler, except that the source language and target language are about the same level of abstraction. The canonical example of this in the client side web programming world is CoffeeScript, which transcompiles into JavaScript.
The Traditional Language Code Interpretation Process
Most compilers will interpret code in a multi-step process that involves processes that include lexical analysis, preprocessing, parsing, optimization, and finally code generation, or, in the case of an interpreter, evaluation.
In this article, we will only focus on lexing, parsing, and evaluation.
Lexing
A lexer takes a input a string of characters, and outputs a list of tokens. For example, if we had some code like this...
(12 + 4) / 6
...the lexer would divide it up into the individual parts, called tokens, and output a list that might look something like this
[
["operator", "("],
["number", 12],
["operator", "+"],
["number", 4],
["operator", ")"],
["operator", "/"],
["number", 6]
]
This process is usually relatively simple. Somebody with a little bit of experience with string manipulation can work their way through building a lexer fairly quickly.
Parsing
The parser takes the list of tokens produced by the lexer as input, parses it according to some syntax rules, and outputs a representation of the syntactic structure called a parse tree.
{
operation: "/",
left: {
operation: "+",
left: 12,
right: 4
}
right: 6
}
This tends to be the most difficult part of the process.
Evaluation
The evaluator takes the parse tree produced by the parser and evaluates it. Evaluators usually traverse the parse tree recursively. Given the syntax tree above, the evaluator might first evaluate the left operand of the top-level /
operation, then the right operand, and then return the result of the division.
Evaluating the left operand would simplify the syntax tree to this:
{
operation: "/",
left: 16,
right: 6
}
which in turn will evaluate to the final result:
2.6666666666666665
AEL: The Calculator Language
Before we can start writing our interpreter, we need to understand the language that we will be interpreting, which I just made up and will refer to as AEL, short for Arithmetic Expression Language.
The language is written as a series of arithmetic expressions composed of numbers and the arithmetic operators +
(addition), -
(subtraction and negation), *
(multiplication), /
(division), %
(modulo), ^
(exponentiation), as well as parentheses ()
for grouping.
The interpreter will evaluate each expression and print the result of each in the order that they were presented. For example, given the following input...
3
2 ^ 8
(12 % 7) * (3 + 2)
19 / -9
...the interpreter will output:
3
256
25
-2.111111111111111
AEL also will allow you to define variable assignments using the =
operator. The left hand side will be some identifier, and the right hand side will be an arithmetic expression. A statement with an assignment has no return value, so the interpreter will not print out a corresponding line.
For example:
hoursPerDay = 24
minutesPerHour = 60
minutesPerDay = minutesPerHour * hoursPerDay
minutesPerDay
minutesPerDay * 60
outputs:
1440
86400
AEL will have two pre-defined variables: pi
and e
, which will correspond to the values 3.141592653589793
and 2.718281828459045
, respectively.
Finally, AEL will allow you to define functions. A function assignment also uses the =
operator, but the left hand argument must be an identifier followed by parentheses, which will contain zero or more argument names separated by commas. Once defined, a function can be called by writing an identifier name followed by parentheses containing zero or more arithmetic expressions separated by commas.
For example:
toDegrees(radians) = radians * 180 / pi
toDegrees(2 * pi)
cylinderVolume(r, h) = pi * r ^ 2 * h
cylinderVolume(2, 4)
outputs:
360
50.26548245743669
Now that we know what the language is like, we can start writing the interpreter.
For reference, I have put an implementation of an AEL interpreter online.
Step 1: The Lexer
The lexer takes text input and returns a list of tokens. The skeleton of our lex
function thus should look like this:
var lex = function (input) {
var tokens = [];
return tokens;
};
We have several different types of tokens. There are operators, which are defined by the set of characters +-*/^%=(),
there are numbers composed of numeral digits, there is whitespace, which our lexer will ignore, and there are identifiers, which will be defined as any string of characters that does not contain operators, digits, or whitespace.
var isOperator = function (c) { return /[+\-*\/\^%=(),]/.test(c); },
isDigit = function (c) { return /[0-9]/.test(c); },
isWhiteSpace = function (c) { return /\s/.test(c); },
isIdentifier = function (c)
{ return typeof c === "string" && !isOperator(c) && !isDigit(c) && !isWhiteSpace(c); };
We will create a simple loop to scan the input text. We will have a variable c
which corresponds to the current character, create a function called advance
to step forward one character in the input and return the next character, and a function called addToken
which appends a token to the list of tokens.
var tokens = [], c, i = 0;
var advance = function () { return c = input[++i]; };
var addToken = function (type, value) {
tokens.push({
type: type,
value: value
});
};
while (i < input.length) {
c = input[i];
}
If c
is whitespace, we just ignore it and move on. If c
is an operator, add an operator token to the list and move on.
if (isWhiteSpace(c)) advance();
else if (isOperator(c)) {
addToken(c);
advance();
}
For simplicity, we will define a number as a series of digits, optionally followed by a decimal point and another series of digits.
else if (isDigit(c)) {
var num = c;
while (isDigit(advance())) num += c;
if (c === ".") {
do num += c; while (isDigit(advance()));
}
num = parseFloat(num);
if (!isFinite(num)) throw "Number is too large or too small for a 64-bit double.";
addToken("number", num);
}
Every other character is an identifier.
else if (isIdentifier(c)) {
var idn = c;
while (isIdentifier(advance())) idn += c;
addToken("identifier", idn);
}
And, just in case something weird gets thrown at us:
else throw "Unrecognized token.";
After we're done scanning the input, we'll add a token to demarcate the end. Then we're done.
addToken("(end)");
return tokens;
Step 2: The Parser
There are lots of different strategies for writing a parser. One of the most common is writing a Backus-Naur grammar and using recursive descent. In this article, we will be using a somewhat less common strategy called operator-precedence parsing, using techniques described Douglas Crockford's Top Down Operator Precedence article.
Operator precedence parsing begins with the following process:
- Associate every operational token with a left binding power, and an operational function.
- If the operator manipulates tokens to its left (such as
+
), associate it with a left denotative function (hereafter abbreviated as led
). If the operator does not manipulate the tokens on its left (such as the unary -
), associate it with a null
denotative function (hereafter abbreviated as nud
). Identifiers and numbers also have a nud
function associated with them.
After this is done, it can start to generate the parse tree. To show how this works, I'll use the arithmetic expression below as an example.
12 + 4 / 6
The parser function starts by executing the nud
of the first token (12
), which returns itself.
{ type: "number", value: 12 }
This become the left hand side that we pass into the led
of the next token (+
), which will recursively call the parser function to parse the remaining tokens.
{
type: "+",
left: { type: "number", value: 12 },
right:
}
We start over by calling the nud
of the first of the remaining tokens (4
), which will return itself.
{ type: "number", value: 4 }
This becomes the left hand side that we pass into the led of the next token (/), which will recursively call the parser function to parse the remaining tokens.
{
type: "+",
left: { type: "number", value: 12 },
right: {
type: "/"
left: { type: "number", value: 4 },
right:
}
}
We call the nud
of the first of the remaining tokens (6
), which will return itself. This is the end of the list of tokens, so the parse tree is complete.
{
type: "+",
left: { type: "number", value: 12 },
right: {
type: "/"
left: { type: "number", value: 4 },
right: { type: "number", value: 6 }
}
}
If we now switch the +
and /
, we will see how binding power comes into play. /
has a higher binding power than +
, so it will bind more tightly to the operands around it.
12 / 4 + 6
We start out the same way as before:
{
type: "/",
left: { type: "number", value: 12 },
right:
}
But this time, after we call the nud
of the 4
, we will look ahead one token, and see that +
has lower precedence than the /
. This means that we will stick the 4
in the right hand side, and then pass the entire current tree into the led of +
, which makes us end up with this parse tree:
{
type: "+",
left: {
type: "/",
left: { type: "number", value: 12 },
right: { type: "number", value: 4 }
},
right: { type: "number", value: 6 },
}
Now that we have some understanding of the parsing process, we can begin writing our parser. The parser accepts the tokens that the lexer produced, and returns a parse tree, so the skeleton of our parse
function will look like this:
var parse = function (tokens) {
var parseTree = [];
return parseTree;
};
We need to have some sort of symbol table that associates symbol with a binding power, nud or led, and we need a function that will associate a token with the corresponding symbol.
var symbols = {},
symbol = function (id, nud, lbp, led) {
var sym = symbols[id] || {};
symbols[id] = {
lbp: sym.lbp || lbp,
nud: sym.nud || nud,
led: sym.led || led
};
};
var interpretToken = function (token) {
var sym = Object.create(symbols[token.type]);
sym.type = token.type;
sym.value = token.value;
return sym;
};
We need some way to see what the current token is and a way to advance to the next token.
var i = 0, token = function () { return interpretToken(tokens[i]); };
var advance = function () { i++; return token(); };
Now we can write an expression function which will generate the parse tree of an expression according to the way it was described above:
var expression = function (rbp) {
var left, t = token();
advance();
if (!t.nud) throw "Unexpected token: " + t.type;
left = t.nud(t);
while (rbp < token().lbp) {
t = token();
advance();
if (!t.led) throw "Unexpected token: " + t.type;
left = t.led(left);
}
return left;
};
We can now create the infix and prefix functions, which we will be able to use to define operators:
var infix = function (id, lbp, rbp, led) {
rbp = rbp || lbp;
symbol(id, null, lbp, led || function (left) {
return {
type: id,
left: left,
right: expression(rbp)
};
});
},
prefix = function (id, rbp) {
symbol(id, function () {
return {
type: id,
right: expression(rbp)
};
});
};
Now we can define all of arithmetic operators declaratively. Note that the exponentiation operator (^
) has a right binding power that is smaller than its left binding power. This is because exponentiation is right associative.
prefix("-", 7);
infix("^", 6, 5);
infix("*", 4);
infix("/", 4);
infix("%", 4);
infix("+", 3);
infix("-", 3);
There are some operators that just exist as separation or end demarcation. They are not prefixes or infixes so it is sufficient to add them to the symbol table.
symbol(",");
symbol(")");
symbol("(end)");
The parentheses nud
just returns what is inside of it, and the number nud
just returns itself.
symbol("(", function () {
value = expression(2);
if (token().type !== ")") throw "Expected closing parenthesis ')'";
advance();
return value;
});
symbol("number", function (number) {
return number;
});
The identifier nud
and the =
operator's nud
are more complicated because they have context-dependent behaviour.
symbol("identifier", function (name) {
if (token().type === "(") {
var args = [];
if (tokens[i + 1].type === ")") advance();
else {
do {
advance();
args.push(expression(2));
} while (token().type === ",");
if (token().type !== ")") throw "Expected closing parenthesis ')'";
}
advance();
return {
type: "call",
args: args,
name: name.value
};
}
return name;
});
infix("=", 1, 2, function (left) {
if (left.type === "call") {
for (var i = 0; i < left.args.length; i++) {
if (left.args[i].type !== "identifier") throw "Invalid argument name";
}
return {
type: "function",
name: left.name,
args: left.args,
value: expression(2)
};
} else if (left.type === "identifier") {
return {
type: "assign",
name: left.value,
value: expression(2)
};
}
else throw "Invalid lvalue";
});
Now that all the hard stuff is out of the way, we can just put on the finishing glue.
var parseTree = [];
while (token().type !== "(end)") {
parseTree.push(expression(0));
}
return parseTree;
Step 3: The Evaluator
The evaluator accepts the parse tree generated by the parser, evaluates each item, and produces the evaluated output. The skeleton of our evaluate function will look like this:
var evaluate = function (parseTree) {
var parseNode = function (node) {
};
var output = "";
for (var i = 0; i < parseTree.length; i++) {
var value = parseNode(parseTree[i]);
if (typeof value !== "undefined") output += value + "\n";
}
return output;
};
It is straightforward to define the behavior of the operators and all of the predefined variables and functions:
var operators = {
"+": function(a, b) {
return a + b;
},
"-": function(a, b) {
if (typeof b === "undefined") return -a;
return a - b;
},
"*": function(a, b) {
return a * b;
},
"/": function(a, b) {
return a / b;
},
"%": function(a, b) {
return a % b;
},
"^": function(a, b) {
return Math.pow(a, b);
}
};
var variables = {
pi: Math.PI,
e: Math.E
};
var functions = {
sin: Math.sin,
cos: Math.cos,
tan: Math.cos,
asin: Math.asin,
acos: Math.acos,
atan: Math.atan,
abs: Math.abs,
round: Math.round,
ceil: Math.ceil,
floor: Math.floor,
log: Math.log,
exp: Math.exp,
sqrt: Math.sqrt,
max: Math.max,
min: Math.min,
random: Math.random
};
var args = {};
Now we can put in the meat of our evaluator, and we're done. The parseNode
function recursively traverses and evaluates the parse tree.
var parseNode = function (node) {
if (node.type === "number") return node.value;
else if (operators[node.type]) {
if (node.left) return operators[node.type](parseNode(node.left), parseNode(node.right));
return operators[node.type](parseNode(node.right));
}
else if (node.type === "identifier") {
var value = args.hasOwnProperty(node.value) ? args[node.value] : variables[node.value];
if (typeof value === "undefined") throw node.value + " is undefined";
return value;
}
else if (node.type === "assign") {
variables[node.name] = parseNode(node.value);
}
else if (node.type === "call") {
for (var i = 0; i < node.args.length; i++) node.args[i] = parseNode(node.args[i]);
return functions[node.name].apply(null, node.args);
}
else if (node.type === "function") {
functions[node.name] = function () {
for (var i = 0; i < node.args.length; i++) {
args[node.args[i].value] = arguments[i];
}
var ret = parseNode(node.value);
args = {};
return ret;
};
}
};
Putting It All Together
Now we can wrap it all up and put the thee pieces together in one single calculate
function.
var calculate = function (input) {
try {
var tokens = lex(input);
var parseTree = parse(tokens);
var output = evaluate(parseTree);
return output;
} catch (e) {
return e;
}
};
What To Do Next
There are many things you could add to the language to make it more useful or just to see how things work. Some additions would be fairly easy, some could be very hard.
Easy Stuff
- Play around with adding your own predefined functions.
- Fiddle with the binding power numbers to see how that changes the way expressions are evaluated.
- Allow scientific notation in number literals or make it possible to use binary or hexidecimal number literals.
Intermediate Stuff
- Allow manipulation of more types of data, such as strings, booleans and arrays.
- Allow functions to have multiple statements and conditionally return values.
- Replace the evaluator with a compiler or transpiler that takes the parse tree and outputs code in another language.
- Make a package manager that allows you to import code from other files.
Harder Stuff
- Implement optimizations to allow calculations to be performed more quickly.
- Make function first-class citizens and allow closures.
- Make it so that all the data is evaluated lazily.
If you know how to do some of this stuff, you are prepared to start making interpreters and compilers for your own domain specific languages.
That's All, Folks!
If there is an error in this article or you see some way to make either the code or the explanations of the code easier to understand, let me know, and I can update it accordingly.
History
- 30th October, 2014: Initial version