Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles / Languages / C#

Writing Your First Domain Specific Language, Part 1 of 2

4.95/5 (73 votes)
26 Apr 2009CPOL8 min read 1   1.6K  
A guide to writing a compiler in .NET for beginners, using Irony.

Introduction

This two-part article is aimed at experienced C# .NET programmers who wish to write their own little computer languages (see part two here). Historically, this has been reasonably difficult due to requiring in-depth knowledge of compilation theory and/or the use of one or more tools, each of which had its own learning curves. Recently though, there has been somewhat of a revolution in this area, with tools being developed which greatly simplify the writing of compilers. The Irony Compiler Construction Toolkit for .NET is used in this tutorial due to the fact that it requires no configuration files etc. (Just drop the Irony DLL into your project) and it simplifies many aspects of compiler construction.

The Sample Problem

Imagine you are writing a content-management system for websites where your users can upload their own images. Now, imagine you have a client who wants to upload a large image, and then display only part of the image, but have the image scrolling around, like so:

Hong_Kong_at_night.jpg

And oh, here's the next thing: they want to be able to set up the way the "camera" scrolls across the image, and change it themselves. So, it is decided that the users will type instructions into a textbox to control the scrolling of the images. The language will look something like this:

Set camera size: 400 by 300 pixels.
Set camera position: 100, 100.
Move 200 pixels right.
Move 100 pixels up.
Move 250 pixels left.
Move 50 pixels down.

It's a far cry from C#, but having a language which basically reads like English will (hopefully) be appreciated by the users. This language is so simple that you could easily write your own parser that extracts the data from the string; however, as soon as the language gets a little more complicated (for example, if you introduce "if" statements and variables, as in part II of this article), then writing a bona-fide compiler will pay dividends.

So, how do we write a compiler for this language? Well, the first step is to formally describe the language.

Defining the Grammar

Looking at the language example above, we can say that a program starts with a line to set the camera size, then a line to set the initial position, and then one or more lines to move the camera. Each line ends with a '.' character. Most of the text is just fixed keywords (such as "Move" and "camera"), with the exception of numbers and directions ("up", "down", "left", "right"). The language can be written in upper-case or lower-case (or a combination).

The paragraph above explains the grammar; however, it is always a good idea to write this description in a more formal manner. A very common way to describe a grammar is using Backus–Naur Form, and the following is a slightly non-standard version of BNF describing the program:

<Program> ::= <CameraSize> <CameraPosition> <CommandList>
<CameraSize> ::= "set" "camera" "size" ":" <number> "by" <number> "pixels" "."
<CameraPosition> ::= "set" "camera" "position" ":" <number> "," <number> "."
<CommandList> ::= <Command>+
<Command> ::= "move" <number> "pixels" <Direction> "."
<Direction> ::= "up" | "down" | "left" | "right"

You can read "::=" as something like "is made up of", and "|" as "or". So, for example, a "Direction" is made up of one of four strings.

If you have written in BNF previously, you may be surprised to see "<Command>+", which uses the regular-expression convention to mean "one or more commands". This is not standard in grammars (or in BNF); however, Irony lets you do this (I will not go into the more complicated traditional way to achieve the same thing).

A grammar always ends up as a tree with a single root. For any (valid) program that we input into the compiler, a tree will be created with "Program" as the root node. This will have three children: "CameraSize", "CameraPosition", and "CommandList". This will continue down into the tree until the leaf nodes, which will be the keywords, numbers, and directions. The most important job of the compiler then is to convert the source code into a tree which is termed as an "Abstract Syntax Tree" (AST). So, based on the grammar above, with the following program...

Set camera size: 400 by 300 pixels.
Set camera position: 100, 100.
Move 200 pixels right.
Move 100 pixels up.

...we should get the following Abstract Syntax Tree:

CameraTree.jpg

Note that the blue nodes are called "Non-terminals" whereas the orange nodes are called "Terminals". The terminals correspond to the actual code the user has entered. While conventionally terminals such as "set", "camera", and "size" would come under the "CameraSize" node, when using Irony, you can ignore these non-useful terminals (more on this later).

Once we have a tree, we can then generate the code to execute the program. So, how do we do all this with Irony?

Writing the Grammar in Irony

After you have downloaded Irony, create a new project in Visual Studio, and add a reference to Irony.dll. Then, add a new class to define the grammar: CameraControlGrammar.cs will do.

After referencing the correct Irony namespace (using Irony.Compiler;), make your class inherit from the abstract "Grammar" class:

C#
public class CameraControlGrammar : Grammar

The grammar of the language is defined in the constructor. The first part is to do some preparation for the actual grammar, and set a few options, for example, making it case-insensitive:

C#
this.CaseSensitive = false;

Then, define all the terminals and non-terminals needed:

C#
var program = new NonTerminal("program");
var cameraSize = new NonTerminal("cameraSize");
var cameraPosition = new NonTerminal("cameraPosition");
var commandList = new NonTerminal("commandList");
var command = new NonTerminal("command");
var direction = new NonTerminal("direction");
var number = new NumberLiteral("number");

And finally, specify which non-terminal is the root node in the abstract syntax tree:

C#
this.Root = program;

OK, with the ingredients all cut up and prepared, we can start cooking. Defining the grammar in Irony is surprisingly similar to writing it in BNF form:

C#
// <Program> ::= <CameraSize> <CameraPosition> <CommandList>
program.Rule = cameraSize + cameraPosition + commandList;

// <CameraSize> ::= "set" "camera" "size" ":" <number> "by" <number> "pixels" "."
cameraSize.Rule = Symbol("set") + "camera" + "size" + ":" + 
                  number + "by" + number + "pixels" + ".";

// <CameraPosition> ::= "set" "camera" "position" ":" <number> "," <number> "."
cameraPosition.Rule = Symbol("set") + "camera" + "position" + 
                      ":" + number + "," + number + ".";

// <CommandList> ::= <Command>+
commandList.Rule = MakePlusRule(commandList, null, command);

// <Command> ::= "move" <number> "pixels" <Direction> "."
command.Rule = Symbol("move") + number + "pixels" + direction + ".";

// <Direction> ::= "up" | "down" | "left" | "right"
direction.Rule = Symbol("up") | "down" | "left" | "right";

Irony has overloaded the "+" and "|" operators so that you can join together the different parts of the grammar in a very natural way. The only exception is where the first part of the rule is a string: in general, you should wrap the first string in a Symbol() method call so that the C# compiler knows that the string is actually a special Irony construct. Also note, rather than having "set camera size:" in a single string, we add each word separately, which allows any whitespace (including new lines) to go between each word.

Finally - and this is optional, but recommended - you can specify certain strings as "punctuation". On each line, there are lots of keywords that we don't actually care about once we are using the generated AST (for example, on each "move" line, we only care about the number of pixels and the direction). We can tell Irony that we don't want the AST cluttered with keywords by registering all the "punctuation":

C#
this.RegisterPunctuation("set", "camera", "size", ":", 
        "by", "pixels", ".", "position", ",", "move");

This means a "Command" will have only two children (number and direction) instead of five children ("move", number, "pixels", direction, and ".").

Compiling the Source code

We are now ready to parse some code. This is done by passing an instance of your grammar to a LanguageCompiler object and getting a reference to the root node of the AST:

C#
CameraControlGrammar grammar = new CameraControlGrammar();
LanguageCompiler compiler = new LanguageCompiler(grammar);
AstNode program = compiler.Parse("the source code as a string");

Assuming the source code successfully compiled, you can run this in debug mode and browse through the ChildNodes property of each node in the tree to view it. (See the attached project to see one way to handle invalid source code.)

Using an Abstract Syntax Tree

An elegant way to generate code from an AST is to write a class for each non-terminal node in the tree, and then each node in the tree simply generates the piece of code that it is responsible for. This is a very good pattern to use, especially when dealing with more complex languages; however, to keep things relatively simple, I will simply act directly on the returned AST (for an example which uses classes to generate code, see JSBasic which converts from BASIC to JavaScript).

Examining the tree above, we know that the first child node of the tree is the camera size declaration, which should have two children itself: the width and the height of the camera:

C#
AstNode cameraSizeNode = program.ChildNodes[0];
Token widthToken = (Token)cameraSizeNode.ChildNodes[0];
int width = (int)widthToken.Value;
Token heightToken = (Token)cameraSizeNode.ChildNodes[1];
int height = (int)heightToken.Value;

Note that the actual number entered is considered a Token (in fact, each piece of text in the source code is a Token); however, not all ChildNodes are Tokens (e.g., "CommandList" is a ChildNode of Program; however, it is a non-terminal, so not a token). This is why we must convert the children to Token objects before accessing their values.

A similar deal is used to get the initial position. These camera sizes and positional values are used in the demo project to create some JavaScript which initialises the photo. (The actual JavaScript is not explained here as it is outside the scope of this article.)

Next, we know that the third child of the program is a list of Command nodes, which we can then loop through:

C#
// loop through the movement commands
foreach (AstNode commandNode in program.ChildNodes[2].ChildNodes)
{
    // get the number of pixels to move
    Token pixelsToken = (Token)commandNode.ChildNodes[0];
    int pixelsToMove = (int)pixelsToken.Value;
    // get the direction
    Token directionToken = (Token)commandNode.ChildNodes[1];
    string directionText = directionToken.Text.ToLower();
}

Using the pixelsToMove and directionText values, JavaScript is generated for each command in the program. Again, this is left out as it isn't important in terms of how to write your language.

And We're Done

And that's all there is to it, really. Of course, once you introduce variables, branch and conditional statements etcetera, you would probably want to look into creating your own AST node classes and look at the AstNode.Evaluate method; however, that is for another tutorial, along with many other Irony features which have been left out in this tutorial. Hopefully though, this should give you enough information to allow you to write your own simple little languages.

And don't forget to try the online demo.

Useful Links

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)