Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles / Languages / ASM

How to create your own virtual machine

4.90/5 (78 votes)
10 May 2012CPOL10 min read 230.6K   7.7K  
This article takes you through a step-by-step process of creating your own virtual machine.

 

Attention People

I am working on a new version of this, a B33 virtual machine. I will post it on here when its done. This one will not have a PDF and the full tutorial will be contained in the code project article. I will make sure I let everyone know when its done! 

Introduction 

Have you ever wondered how the Microsoft .NET Framework or how a Java virtual machine works? Have you ever wondered how to create your own emulator to target a specific computer you own? If you answered yes to any of these questions, then you will find my tutorial helpful. B32 is a complete virtual machine created in C#. This is part 1 of a 2 part series. The tutorial presented here on CodeProject is a condensed version of the complete PDF tutorial. In this part of the tutorial, I will explain the fundamentals, enough to get us started, and we will code the assembler portion of our virtual machine. Part 2 will build the actual virtual machine. The tutorial starts from scratch with a new project and solution. It will teach you how to write both, an assembler for our B32 intermediate language code, and it will teach you how to write the virtual machine itself.

Intended Audience

This tutorial assumes you are pretty familiar with C# and Visual Studio 2005 or 2008. This tutorial was written with Visual Studio 2008, targeting the .NET Framework 2.0. It should also work with Visual Studio 2005. Some of the concepts introduced in this tutorial may be a little hard for a beginner C# programmer to understand. I have tried to make it as simple to understand as possible. Keep in mind that, after reading this tutorial, no one will be able to walk away from it ready to build the next Java or something. Virtual machines are very complex tasks, and even simple ones can take many years to make. If you need help or have questions, you are welcome to post them here or email me direct at icemanind@yahoo.com. If I lose any of you, I suggest reading the PDF and using it as a reference to supplement this article. The PDF goes into more details.

Virtual Machine Fundamentals

To build a virtual machine, you first need to understand what a virtual machine is. A virtual machine is a software implementation of either a real physical machine or a fictional machine. To create a virtual machine, we first need to define an intermediate language (sometimes called an assembly language). This intermediate language will be the lowest level, human readable code, that will eventually get assembled into bytecode. Bytecode consists of one or more bytes of data that is readable only by our virtual machine (in real world machines, this would be called the "machine language"). The "assembler", which converts Intermediate Language into bytecode, will be the first tool we will develop. This will be developed in this part of the tutorial.

Planning Out our Virtual Machine

Before getting our hands dirty with any code, we need to plan out our virtual machine and how it will work. I believe in trying to stick with the KISS (Keep It Simple Stupid) principle whenever possible. B32, the virtual machine created in this tutorial, is going to be a 16-bit machine. It will have access to 64K of memory, and B32 bytecode programs will have access to any of this memory ranging from $0000 - $FFFF. 64K may not seem like a lot of memory compared to today's modern computers; however, as you will see, 64K will be more than enough to produce some interesting B32 programs.

I said that programs would have access to 64K of memory; however, not all of this will be for storage. From $A000 - $AFA0, this 4K area of memory will be devoted to our B32 screen output. Screen output in our virtual machine will work very similar to older DOS programs. There will be 1 byte for the character and 1 byte for the attribute on an 80x25 screen. The attribute byte comes after the character byte, and it defines the foreground and background color of the character.

Our virtual machine will have two 8-bit registers called 'A' and 'B'. Registers are kind of like variables in programming. We can store any number between 0 and 255 in either register 'A', 'B', or both. In addition, we are going to have two 16-bit registers called 'X' and 'Y'. These registers can hold any value from 0 to 65,535. As if that's not enough, we are going to have one more 16-bit register called 'D'. This register is unique, as 'D' represents the high order bits and low order bits of registers 'A' and 'B'. In other words, if we were to store a value of $12F9 in our D register, then it would automatically change register 'A' to have the value $12 and register 'B' to have the value $F9. The opposite is also true. If we store $C1 in register 'A' and $78 in register 'B', then register 'D' would have the value $C178.

Before diving into creating our assembler, we need to define at least three mnemonics for now to make any kind of useful program. Mnemonics, by the way, are the building blocks of our intermediate language. They are human readable characters, usually 3-5 characters long, that tell the assembler what we want to do; e.g., "The verb". Most (but not all) mnemonics have an operand. An operand is some kind of data or information for the mnemonic to act on. For now, our assembler will understand the following mnemonics:

MnemonicBytecodeDescription
LDA$01Loads an 8-bit value into our 'A' register
LDX$02Loads a 16-bit value into our 'X' register
STA$03Stores the byte value loaded in the 'A' register into the address pointed to by the 'X' register
END$04Terminates execution of our program.

If you are confused by what 'STA' actually does, think of it like this. If 'X' register contained $1234 and 'A' register contained $9F, then executing 'STA' would store the value $9F at memory location $1234. The rest of the mnemonics should make sense.

Here is our first B32 assembly test file, which will compile with our assembler, once it is done:

START:
 LDA #65
 LDX #$A000
 STA ,X
 END START

You may be wondering what the pound sign '#' means? In our B32 intermediate language, the '#' will always mean "use this data", or rather, use the value immediately following the pound sign. The first line in our program is called a label. A label is a human readable point of reference in our program. Labels must start with a letter and end with a colon ':', followed by a new line. For simplicity, our assembler will not have any error checking, so we must play by some rules. All mnemonics must be preceded by one or more spaces. This is how our assembler knows whether the text is a label or a mnemonic. Labels, obviously, should not have a space preceding them. The second line assigns the value 65 to our 'A' register. The third line assigns the hexadecimal number $A000 to our 'X' register. The fourth line says to store the value of the 'A' register into the memory location pointed to by our 'X' register. In other words, store 65 at $A000. The final line in the program tells our program to end. The 'START' preceding END tells the assembler where the execution of this program should begin. In this case, it starts at the 'START' label, which is the beginning of our program.

You might be able to guess at what this will do. Remember, we said earlier that our screen will start at $A000? This program should put a capital letter 'A' (65 is ASCII for the letter 'A') in the upper left hand corner of our B32 screen. Pretty dull, I know. But it will demonstrate the workings of our assembler and virtual machine.

The last thing we need to do before we start building the assembler is to define a B32 binary file format. Our file format should start with a header with some magic numbers so that our virtual machine will know if our program is a valid B32 binary file. Our B32 binaries will start with the letters 'B32'. That will be our magic number. After that will be the 16-bit starting address. This defines where in our 64K memory our program is to be loaded. Our assembler will default our origin to $1000, but we can change that to anything. Following that will be the 16-bit execution address. This is where the execution of our binary program will start. Finally, our bytecode will follow.

Building our Assembler

You can download the source code to both the assembler and the virtual machine from up above. Keep in mind that the source code was built using the PDF tutorial. Therefore, our assembler will have a lot more functionality and mnemonics built into it than what is explained in this section. Rather than explaining a step-by-step process for building the assembler, I will just explain the important parts. If you want a step-by-step, read the PDF file attached above.

The assembler works by first reading in the source file into a string buffer, and then by opening a BinaryWriter stream, pointing to our output B32 binary program:

C#
private void btnAssemble_Click(object sender, EventArgs e)
{
    AsLength = Convert.ToUInt16(this.txtOrigin.Text, 16);
    System.IO.BinaryWriter output;
    System.IO.TextReader input;
    System.IO.FileStream fs = new System.IO.FileStream(
        this.txtOutputFileName.Text, System.IO.FileMode.Create);

    output = new System.IO.BinaryWriter(fs);

    input = System.IO.File.OpenText(this.txtSourceFileName.Text);
    SourceProgram = input.ReadToEnd();
    input.Close();

    output.Write('B');
    output.Write('3');
    output.Write('2');
    output.Write(Convert.ToUInt16(this.txtOrigin.Text, 16));
    output.Write((ushort)0);
    Parse(output);

    output.Seek(5, System.IO.SeekOrigin.Begin);
    output.Write(ExecutionAddress);
    output.Close();
    fs.Close();
    MessageBox.Show("Done!");
}

Notice our B32 magic bytes are written followed by the origin (or start address), followed by a 16-bit zero. This zero will be replaced later by the execution address. For now, we have no way of knowing what that will be. A call to Parse() is made, which is where most of our work is done. Once it's done, we seek back to that zero we placed earlier and replace it with the execution address. Finally, we close the streams and show a "Done!" box.

Our assembler works in two passes. The first pass simply scans the assembly file for all the labels and stores those labels in a hash table. This first pass is crucial in case there is a reference to a future label. The second pass does the hard part of converting the mnemonics into bytecode. Here is our Parse() function:

C#
private void Parse(System.IO.BinaryWriter OutputFile)
{
    CurrentNdx = 0;
    while (IsEnd == false)
        LabelScan(OutputFile, true);

    IsEnd = false;
    CurrentNdx = 0;
    AsLength = Convert.ToUInt16(this.txtOrigin.Text, 16);

    while (IsEnd == false)
        LabelScan(OutputFile, false);
}

This function simply goes into a loop, calling another function called LabelScan. The first loop is the first pass. After resetting the index, it goes into a second loop. This is our second pass. All LabelScan() does is detect if the first character is a space. If it is, it assumes it is not a label, and it reads in the mnemonic. This mnemonic is passed off to another function that uses a series of 'if' statements on the mnemonic, calling the appropriate function to process the mnemonic:

C#
private void ReadMneumonic(System.IO.BinaryWriter OutputFile, bool IsLabelScan)
{
    string Mneumonic = "";

    while (!(char.IsWhiteSpace(SourceProgram[CurrentNdx])))
    {
        Mneumonic = Mneumonic + SourceProgram[CurrentNdx];
        CurrentNdx++;
    }
    if (Mneumonic.ToUpper() == "LDX") InterpretLDX(OutputFile, IsLabelScan);
    if (Mneumonic.ToUpper() == "LDA") InterpretLDA(OutputFile, IsLabelScan);
    if (Mneumonic.ToUpper() == "STA") InterpretSTA(OutputFile, IsLabelScan);
    if (Mneumonic.ToUpper() == "END")
    { IsEnd = true; DoEnd(OutputFile, IsLabelScan); EatWhiteSpaces(); 
      ExecutionAddress = (ushort)LabelTable[(GetLabelName())]; return; }

}

Each of the Interpret() functions pretty much do the same thing. They read the operand, if there is one, then writes the appropriate byte code to the output file.

If you download the assembler from the link above, go ahead and try to compile our test program. Remember to precede each mnemonic with at least one space (mnemonics only, not labels). Also, remember to end the last line with a new line. Keep those two rules in mind, and the test program should compile and produce a .B32 binary file.

If you want more details on how the compiler works, reference the PDF tutorial. It goes into a lot more depth.

History

This is my first PDF tutorial. I wanted to get my feet wet a little bit and see how people responded to my writing and tutorial skills. This is only part 1 of the virtual machine tutorial. Part 2 will go into how to build the actual virtual machine. Please leave feedback, good or bad, and let me know if this was a help to you.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)