Introduction
This is the third parser I have written, and it sucks somewhat less than the first two. The goal of this one was an easily expandable parser that would be easy to understand by other programmers, and would be easy to embed into other projects. I also wanted it to be simple enough for an intermediate programmer to understand how the code works. I�ve been partially successful.
It still isn�t that easy to add more operators or built in functions, but the code is quite a bit more transparent than my previous efforts. Why would I want to do this when there are very good parsers out there, and tools for generating C code for parsers given a description of a language? Those tools are often a bit much for someone just getting started with writing a parser to work with. This code is small enough to see how a simple parser works, yet has enough features to demand a better design than a very simple parser. It is suitable for students to study how a parser can be put together.
Features
- User defined functions
- Supports recursion and nested calls of user defined functions
- Global and local variables
- Rich set of operators
- Must declare all variables
- Generates simple byte code.
- Small, easy to insert into existing projects.
Deficiencies
- No headers � cannot call a function defined later in the script. A function can call itself.
- Cannot call a function in complex statements. Function values can be assigned to variables, but must be the only term of the statement, e.g.:
y=foo{};
is OK, y=x + foo{};
is not OK.
- Uses a simple recursive descent method to evaluate statements. This makes it easy to see how this part of the parser works, but it generates inefficient byte code. Having the parser generate reverse Polish statements would have made the engine more efficient and flexible.
- Arrays limited to one dimension.
- Functions limited to 20 parameters.
- Functions cannot have array parameters
Notes on the Parser and Engine
This is a one pass parser. It reads the whole script text into memory and parses it from beginning to end in one pass. There is one extra pass to process the directives before the main parser pass. Except for comments, white-space is ignored.
The parser reads in one �token� at a time. A token is one valid piece of text, for example, "int
", "+
", "==
", a variable name, or a literal value. The parser is hard-coded to check syntax. For example, if it encounters a literal string that is an integer, it expects an assignment to follow it. This logic is hard-coded, not driven by a description of the language.
Parsing Variables
When variables are declared, their type, name, and location are saved. When variables are later encountered in statements, the parser generates code that uses the variables type and location to get or set its value.
Global variables start at the beginning of the byte stream and simply get added as more global variables are declared. In the byte code, a global variable will be a byte indicating the type (int
, float
, etc�) followed by an integer that is the variable's offset from the start of the byte code.
Local variables are quite different. The code supports recursion, so local variables' location must change depending on how many function calls are nested. When a function is called, a pointer to the current local variable stack is adjusted. Local variables are always located relative to the current position of this pointer, not to a fixed location, like global variables. When a function call is parsed, the current local variable pointer is increased by the size of the function�s local variables, and on exit, this same amount is removed from the local variable pointer. Local variables are then retrieved relative to this moving pointer.
If / when
This is a single pass parser. When an if
is parsed, it is replaced by code that evaluates the logic statement, then jumps past the end of the if
block if the logical statement is false. The parser does not know where to jump to yet, so we leave the goto
�s address blank, and save where this blank address is in a temporary stack that only exists while parsing. Then, when we reach an end
statement, we can fill in the missing goto
address, and remove the goto
from the temporary stack.
while
, break
, and else
statements are handled in a similar fashion. Using a stack of addresses allows the parser to support nested flow control.
Start Up Code
The parser is hard-coded to run the user function "main
" first. A few bytes of code are placed in the start of the byte stream to initialize the data pointers and call the main
function.
What I Would Have Done Differently
Despite this being the third parser I have written over the last decade, I made a number of poor design decisions that caused me to have to re-write large portions of the parser. My first attempt at handling variables worked fine for globals, but did not handle local variables at all. Re-Write. The method I used to evaluate statements worked fine as long as there were no function calls. When I added functions calls, I had to do another re-write, and the parser is still limited to having function calls only by themselves in statements.
If I have time to re-do this, I would create a much more firm design document � especially for the byte code that the parser emits. I started out with only a list of features and a vague design in my head. This is not ideal programming practice! I would have worked out a much better and general method for parsing and evaluating complex statements if I were to start over. The method I used, while easy to implement initially without worrying about function calls, has a number of limits and workarounds to handle function calls. I would also carefully study how existing compilers handle variables and function calls. A lot of wisdom exists in how C and other simple languages handle these tasks.
Evaluating Statements
The parser does not do much re-ordering of statements. It does change how assignments are done, but the statement on the right of an assignment is pretty much just replaced with byte code of the same form. At run time, statements are evaluated using a simple recursive descent method. This is an easy method to hand code, but generates less than optimal byte code. A more ambitious parser might generate reverse-Polish statements, or otherwise re-order the statement so that it can be run faster with a simpler statement evaluator.
Let�s look at a simple statement:
4 + 3 * 5
To be evaluated correctly, the 3*5 must be done first. Recursive descent does this by scanning the statement for higher precedent operators, evaluating them, and replacing the evaluated terms with the result. This scanning and replacing is repeated until a single value is left.
1st pass
4+15
2nd pass
19 � all done
A second example
4 + (3*2+4) � 2
1st pass
4+(6+4)-2
2nd
4+(10)-2
3rd
4+10-2
4th
14-2
5th
12 � all done
The byte code engine actually copies over the result of a pass overtop of the original statement (padding with null operators where required) until only a literal value is left. A copy of the statement is evaluated as the statement actually gets destroyed when parsed. This destructive evaluation is another reason why reordering statements is a better idea. A copy is required as the script does support looping.
In other parsers, I converted the statement to a linked list of statements which made the code easier to read, but even slower. The linked list was then evaluated, and evaluated nodes were replaced and deleted as required.
In the function evaluate_int_exp()
, you will see this implemented by scanning for higher-precedent statements, then calling evaluate_in_exp()
on a sub-part of the statement recursively until the whole statement is reduced to a single literal value.
Data Types
- Integers, 1, 2, 3
- Character �a�, �b�, �c�
- Float, 1.2, 1.0, 9.9
- Arrays
- Integer
- Character "abcded", "\n", "\t"
- Float
- Single dimension only
- pointers? � no, but can assign string to char arrays
Declare like C.
int
float
char
int [ ]
float [ ]
char [ ]
Operators
Operator |
Precedence |
Integer Function |
Float Function |
Character String Function |
[ ] |
0 |
Array element |
Array element |
Array element |
= |
0 |
Assignment |
Assignment |
Assignment |
( ) |
1 |
Change precedence |
Change precedence |
NA |
! |
2 |
Logical NOT |
Logical NOT |
NA |
- |
2 |
Negate |
Negate |
NA |
~ |
2 |
Bit-wise NOT |
NA |
NA |
* |
3 |
Multiple |
Multiple |
NA |
/ |
3 |
Divide |
Divide |
NA |
% |
3 |
Modulo DIV |
Modulo DIV |
NA |
+ |
4 |
Add |
Add |
Concatenate |
- |
4 |
Subtract |
Subtract |
NA |
^ |
5 |
Bit-wise exclusive OR |
NA |
NA |
& |
5 |
Bit-wise AND |
NA |
NA |
| |
5 |
Bit-wise OR |
NA |
NA |
== |
6 |
Test for equality |
Test for equality |
Test for equality |
<= |
6 |
Test for less than equal |
Test for less than equal |
Less than or equal |
< |
6 |
Less than |
Less than |
Less than |
>= |
6 |
Greater than or equal |
Greater than or equal |
Greater than or equal |
> |
6 |
Greater than |
Greater than |
Greater than |
&& |
6 |
Logical AND |
Logical AND |
NA |
|| |
6 |
Logical OR |
Logical OR |
NA |
|
NA |
Comment |
Comment |
Comment |
atoi |
2 |
NA |
NA |
Convert to int |
atof |
2 |
NA |
NA |
Convert to float |
itoa |
2 |
Convert to str |
NA |
NA |
ftoa |
2 |
NA |
Convert to str |
NA |
Flow control
while
if
if else
break
continue
end
Built-In Functions
print
print
followed by comma delimited list of statements. E.g.:
print "hi world ", x, " that was x\n";
User Function Definition
func type name {parms};
end
Return type must be int
or float
. Arrays cannot be parameters. Maximum of 20 parameters. E.g.:
func int fubar {int x, int y, float b};
body;
end;
Calling a function:
fubar {1, 2, 3, aa};
or
x=fubar {1, 2, 3, aa};
Cannot put a function in a complex statement:
X=fubar(1,2,3, aa} + 99;
Scope
- global � all variables declared outside of functions.
- local � variables declared within a function.
- Parameters are treated like local variables.
Directives
Directives must go in comments. They are used to set the size reserved for variables and the call stack.
= size in bytes of static global vars (not a stack)
= size in bytes of local var stack
= size in bytes of parameter stack
= size in bytes of code stack
Default is 64Kb for each.
Statements
Statements can contain literals, vars, and operators. Statements are always ended by a semi-colon �;
�.
Statements with functions can have no other terms, i.e., many terms, or a single function call. E.g.:
i=1+j*5;
OR
i=f1{};
Array index may be a statement.
Sample Script
int x;
int i,j,k;
int ii[100],ij, ik;
float f, g;
float ff[10];
char s [ 100 ] , b, #9; #9; c;
char s2[100];
func int test{};
int y;
print "at top of test\n";
s[0]='a';
s[1]=0;
print "s = ", s, " s[0] = ", s[0], "\n";
s="I am the cat " + "Hi World\n";
s2=s;
print "s = ", s, " s2 = ", s2, " newline\n";
print "Hi world\n";
print "Hi world tab\t cr\r world\n";
x=43;
y=5;
ii[34]=x + y;
ii[x+y]=5;
print "ii[34] = ", ii[34], "\n";
print ">x +10 = < ", x + 10, " y="y, "ii[2]=", ii[2], " ii[34] =", ii[34], "\n";
print "(4-2)*5 = ", (4-2) *5, "\n";
b='a';
print "b = ", b, "\n";
x=5;
y=10;
print "x=", x, " y=",y,"\n";
if 1;
print "before if break\n";
break;
print "after if break\n";
end
while x > 0;
x=x-1;
print "x=", x, "\n";
if x > 2;
print " x > 2\n";
else
print " x is not > 2\n";
end
print "before the break\n";
break;
print "past the break\n";
end
print "at the end, past the while\n";
end
func int foo1{};
set 5;
end
func int foo2{int x, int y};
set x*y;
end
func int main{};
int y, z;
float m;
print "top of main\n";
print "call test next\n";
test{};
m=(1.0 + 0.5) * 20.0;
print "m=",m,"\n";
x=foo1{};
y=10;
z= (10 + y)*x;
print "z = ", z, "\n";
print "x=", x, " y=",y,"\n";
while x > 0;
x=x-1;
if x > 2;
print " x > 2\n";
else
print " x is not > 2\n";
end
print "x=", x, "\n";
end
print "at the end, past the if/while\n";
x=5;
y=6;
x=foo2{x,y};
print "x of foo2=", x, "\n";
end
More Information