(untagged)

A simple hand-coded script parser

arussell

0.00/5 (No votes)

30 Nov 2004

A simple script parser and engine to demonstrate coding a parser, and recursive descent statement evaluation.

Download source files - 33.2 Kb

Introduction

This is the third parser I have written, and it sucks somewhat less than the first two. The goal of this one was an easily expandable parser that would be easy to understand by other programmers, and would be easy to embed into other projects. I also wanted it to be simple enough for an intermediate programmer to understand how the code works. I�ve been partially successful.

It still isn�t that easy to add more operators or built in functions, but the code is quite a bit more transparent than my previous efforts. Why would I want to do this when there are very good parsers out there, and tools for generating C code for parsers given a description of a language? Those tools are often a bit much for someone just getting started with writing a parser to work with. This code is small enough to see how a simple parser works, yet has enough features to demand a better design than a very simple parser. It is suitable for students to study how a parser can be put together.

Features

User defined functions
Supports recursion and nested calls of user defined functions
Global and local variables
Rich set of operators
Must declare all variables
Generates simple byte code.
Small, easy to insert into existing projects.

Deficiencies

No headers � cannot call a function defined later in the script. A function can call itself.
Cannot call a function in complex statements. Function values can be assigned to variables, but must be the only term of the statement, e.g.: y=foo{}; is OK, y=x + foo{}; is not OK.
Uses a simple recursive descent method to evaluate statements. This makes it easy to see how this part of the parser works, but it generates inefficient byte code. Having the parser generate reverse Polish statements would have made the engine more efficient and flexible.
Arrays limited to one dimension.
Functions limited to 20 parameters.
Functions cannot have array parameters

Notes on the Parser and Engine

This is a one pass parser. It reads the whole script text into memory and parses it from beginning to end in one pass. There is one extra pass to process the directives before the main parser pass. Except for comments, white-space is ignored.

The parser reads in one �token� at a time. A token is one valid piece of text, for example, "int", "+", "==", a variable name, or a literal value. The parser is hard-coded to check syntax. For example, if it encounters a literal string that is an integer, it expects an assignment to follow it. This logic is hard-coded, not driven by a description of the language.

Parsing Variables

When variables are declared, their type, name, and location are saved. When variables are later encountered in statements, the parser generates code that uses the variables type and location to get or set its value.

Global variables start at the beginning of the byte stream and simply get added as more global variables are declared. In the byte code, a global variable will be a byte indicating the type (int, float, etc�) followed by an integer that is the variable's offset from the start of the byte code.

Local variables are quite different. The code supports recursion, so local variables' location must change depending on how many function calls are nested. When a function is called, a pointer to the current local variable stack is adjusted. Local variables are always located relative to the current position of this pointer, not to a fixed location, like global variables. When a function call is parsed, the current local variable pointer is increased by the size of the function�s local variables, and on exit, this same amount is removed from the local variable pointer. Local variables are then retrieved relative to this moving pointer.

If / when

This is a single pass parser. When an if is parsed, it is replaced by code that evaluates the logic statement, then jumps past the end of the if block if the logical statement is false. The parser does not know where to jump to yet, so we leave the goto�s address blank, and save where this blank address is in a temporary stack that only exists while parsing. Then, when we reach an end statement, we can fill in the missing goto address, and remove the goto from the temporary stack.

while, break, and else statements are handled in a similar fashion. Using a stack of addresses allows the parser to support nested flow control.

Start Up Code

The parser is hard-coded to run the user function "main" first. A few bytes of code are placed in the start of the byte stream to initialize the data pointers and call the main function.

What I Would Have Done Differently

Despite this being the third parser I have written over the last decade, I made a number of poor design decisions that caused me to have to re-write large portions of the parser. My first attempt at handling variables worked fine for globals, but did not handle local variables at all. Re-Write. The method I used to evaluate statements worked fine as long as there were no function calls. When I added functions calls, I had to do another re-write, and the parser is still limited to having function calls only by themselves in statements.

If I have time to re-do this, I would create a much more firm design document � especially for the byte code that the parser emits. I started out with only a list of features and a vague design in my head. This is not ideal programming practice! I would have worked out a much better and general method for parsing and evaluating complex statements if I were to start over. The method I used, while easy to implement initially without worrying about function calls, has a number of limits and workarounds to handle function calls. I would also carefully study how existing compilers handle variables and function calls. A lot of wisdom exists in how C and other simple languages handle these tasks.

Evaluating Statements

The parser does not do much re-ordering of statements. It does change how assignments are done, but the statement on the right of an assignment is pretty much just replaced with byte code of the same form. At run time, statements are evaluated using a simple recursive descent method. This is an easy method to hand code, but generates less than optimal byte code. A more ambitious parser might generate reverse-Polish statements, or otherwise re-order the statement so that it can be run faster with a simpler statement evaluator.

Let�s look at a simple statement:

Plain Text

4 + 3 * 5

To be evaluated correctly, the 3*5 must be done first. Recursive descent does this by scanning the statement for higher precedent operators, evaluating them, and replacing the evaluated terms with the result. This scanning and replacing is repeated until a single value is left.

Plain Text

1^st pass 
4+15
2^nd pass
19 � all done

A second example
4 + (3*2+4) � 2
1^st pass
4+(6+4)-2
2^nd
4+(10)-2
3^rd
4+10-2
4^th
14-2
5^th
12 � all done

The byte code engine actually copies over the result of a pass overtop of the original statement (padding with null operators where required) until only a literal value is left. A copy of the statement is evaluated as the statement actually gets destroyed when parsed. This destructive evaluation is another reason why reordering statements is a better idea. A copy is required as the script does support looping.

In other parsers, I converted the statement to a linked list of statements which made the code easier to read, but even slower. The linked list was then evaluated, and evaluated nodes were replaced and deleted as required.

In the function evaluate_int_exp(), you will see this implemented by scanning for higher-precedent statements, then calling evaluate_in_exp() on a sub-part of the statement recursively until the whole statement is reduced to a single literal value.

Data Types

Integers, 1, 2, 3
Character �a�, �b�, �c�
Float, 1.2, 1.0, 9.9
Arrays
Integer
Character "abcded", "\n", "\t"
Float
Single dimension only
pointers? � no, but can assign string to char arrays

Declare like C.

int
float
char
int [ ]
float [ ]
char [ ]

Operators

Operator	Precedence	Integer Function	Float Function	Character String Function
`[ ]`	0	Array element	Array element	Array element
`=`	0	Assignment	Assignment	Assignment
`( )`	1	Change precedence	Change precedence	NA
`!`	2	Logical NOT	Logical NOT	NA
`-`	2	Negate	Negate	NA
`~`	2	Bit-wise NOT	NA	NA
`*`	3	Multiple	Multiple	NA
`/`	3	Divide	Divide	NA
`%`	3	Modulo DIV	Modulo DIV	NA
`+`	4	Add	Add	Concatenate
`-`	4	Subtract	Subtract	NA
`^`	5	Bit-wise exclusive OR	NA	NA
`&`	5	Bit-wise AND	NA	NA
`\|`	5	Bit-wise OR	NA	NA
`==`	6	Test for equality	Test for equality	Test for equality
`<=`	6	Test for less than equal	Test for less than equal	Less than or equal
`<`	6	Less than	Less than	Less than
`>=`	6	Greater than or equal	Greater than or equal	Greater than or equal
`>`	6	Greater than	Greater than	Greater than
`&&`	6	Logical AND	Logical AND	NA
`\|\|`	6	Logical OR	Logical OR	NA
`//`	NA	Comment	Comment	Comment
`atoi`	2	NA	NA	Convert to `int`
`atof`	2	NA	NA	Convert to `float`
`itoa`	2	Convert to `str`	NA	NA
`ftoa`	2	NA	Convert to `str`	NA

Flow control

while
if
if else
break
continue
end

Built-In Functions

print

print followed by comma delimited list of statements. E.g.:

print "hi world ", x, " that was x\n";

User Function Definition

func type name {parms};
end

Return type must be int or float. Arrays cannot be parameters. Maximum of 20 parameters. E.g.:

func int fubar {int x, int y, float b};
  body;
end;

Calling a function:

fubar {1, 2, 3, aa};

x=fubar {1, 2, 3, aa};

Cannot put a function in a complex statement:

X=fubar(1,2,3, aa} + 99;  // NOT allowed

Scope

global � all variables declared outside of functions.
local � variables declared within a function.
Parameters are treated like local variables.

Directives

Directives must go in comments. They are used to set the size reserved for variables and the call stack.

// $globalsize = size in bytes of static global vars (not a stack)
// $localsize = size in bytes of local var stack
// $parmsize = size in bytes of parameter stack
// $stacksize = size in bytes of code stack

Default is 64Kb for each.

Statements

Statements can contain literals, vars, and operators. Statements are always ended by a semi-colon �;�.

Statements with functions can have no other terms, i.e., many terms, or a single function call. E.g.:

i=1+j*5;

i=f1{};

Array index may be a statement.

Sample Script

Shrink ▲

// test file


int x;
int i,j,k;
int ii[100],ij, ik;
float f, g;
float ff[10];
char s [ 100 ] , b, #9; #9; c;
char s2[100];
// ---------------------------------------------------------

func int test{};
  int y;
  print "at top of test\n";
  s[0]='a';
  s[1]=0;
  print "s = ", s, " s[0] = ", s[0], "\n";
  s="I am the cat " + "Hi World\n";
  s2=s;
  print "s = ", s, " s2 = ", s2, " newline\n";
  print "Hi world\n";
  print "Hi world tab\t cr\r world\n";
  x=43;
  y=5;
  ii[34]=x + y;
  ii[x+y]=5;
  print "ii[34] = ", ii[34], "\n";
  print ">x +10 = < ", x + 10, " y="y, "ii[2]=", ii[2], " ii[34] =", ii[34], "\n";
  print "(4-2)*5 = ", (4-2) *5, "\n";
  b='a';
  print "b = ", b, "\n";
  x=5;
  y=10;
  print "x=", x, " y=",y,"\n";
  if 1;
    print "before if break\n";
    break;
    print "after if break\n";
  end
  while x > 0;
    x=x-1;
    print "x=", x, "\n";
    // continue;

    if x > 2;
      print " x > 2\n";
    else
      print " x is not > 2\n";
    end
    print "before the break\n";
    break;
    print "past the break\n";
  end
  print "at the end, past the while\n";
end

// ----------------------------------------------------

func int foo1{};
  set 5;
end
// ----------------------------------------------------

func int foo2{int x, int y};
  set x*y;
end
// ----------------------------------------------------

func int main{};
  // locals

  int y, z;
  float m;
  print "top of main\n";
  print "call test next\n";
  test{};
  m=(1.0 + 0.5) * 20.0;
  print "m=",m,"\n";
  x=foo1{};
  y=10;
  z= (10 + y)*x;
  print "z = ", z, "\n";
  print "x=", x, " y=",y,"\n";
  while x > 0;
    x=x-1;
    if x > 2;
      print " x > 2\n";
    else
      print " x is not > 2\n";
    end
    print "x=", x, "\n";
  end
  print "at the end, past the if/while\n";
  x=5;
  y=6;
  x=foo2{x,y};
  print "x of foo2=", x, "\n";
end

More Information

The famous "Let's build a Compiler" by Jack Crenshaw.
Google keywords: Recursive descent parser "source code".
CodeProject: Spirit.
CodeProject: MFC Parser code.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here