Articles About Bird
Table of Contents
Overview of Bird's Structure
- Expressions are parsed by different recognizers, and the node is passed to the plugins.
- Plugins can modify the recognized expression including type managing, constant evaluation
- Architectures are for generating code for a platform. Currently the only is x86.
Basic Insight into Code Parsing
The language consists of recognizers, and supporting new languages can be done by replacing recognizers. The functions, types, variables will be available between Bird and the new languages. I don't use parser generators, because I think using other software could prevent from implementing some of the functionalities or I would have to do things in a different way.
The CompilerState
class has a Language
object that contains all recognizers. Here is the Language
class:
public abstract class Language : LanguageNode
{
public IList<IExprRecognizer> ExprRecognizers;
public IList<IIdRecognizer> IdRecognizers;
public IList<ICommRecognizer> CommRecognizers;
public IList<IModRecognizer> ModRecognizers;
public IList<INameGenerator> NameGenerators;
public IList<IResultSkippingHandler> GlobalHandlers;
public IArgRecognizer ArgRecognizer;
public IGenericRecognizer GenericRecognizer;
public IGroupRecognizer GroupRecognizer;
public IRetValLessRecognizer RetValLessRecognizer;
public IInnerScopeRecognizer InnerScopeRecognizer;
public IVarDeclRecognizer VarDeclRecognizer;
public ITypeDeclRecognizer TypeDeclRecognizer;
public IConstDeclRecognizer ConstDeclRecognizer;
public INamespaceDeclRecognizer NamespaceDeclRecognizer;
public IDeclarationRecognizer DeclarationRecognizer;
public IGlobalScopeProcessor GlobalScopeProcessor;
public ICodeFileProcessor CodeFileProcessor;
public ICodeProcessor CodeProcessor;
...
}
String Handling
Bird uses StringSlice
to store string
s to avoid a new string
allocation when it calls string.Substring
. This structure also has functions for bracket handling and other operations. The CodeString
structure consists of a StringSlice
and a reference to CodeFile
class that can also manage lines and indents of them. CodeString
has a Line
field and when its substring is needed, it updates the Line
member. Most of the CodeString
's methods are just wrapper around StringSlice
's functions.
The ISkippingHandler
interface is used to skip results that are in a string
constant or bracket. The recognizers can store data in ResultSkippingManager.Datas
, if it's needed. In some cases, it's necessary to make sure that the brackets are not skipped. A bracket recognizer shouldn't skip brackets if ResultSkippingManager.DoNotSkipBrackets
is true
.
public class BracketRecognizer : LanguageNode, IExprRecognizer,
IIdRecognizer, IResultSkippingHandler
{
...
public int SkipResult(ref ResultSkippingManager RSM)
{
if (!RSM.DoNotSkipBrackets && Helper.GetBracket(RSM.CurrentChar, RSM.Back))
{
var Handlers = RSM.SkippingHandlers;
if (RSM.Back)
{
var NewString = RSM.String.Substring(0, RSM.Current + 1);
var Pos = NewString.GetBracketPos(true, Handlers);
if (Pos != -1) return Pos;
}
else
{
var NewString = RSM.String.Substring(RSM.Current);
var Pos = NewString.GetBracketPos(false, Handlers);
if (Pos != -1) return Pos + RSM.Current;
}
}
return -1;
}
}
So this is the BracketRecognizer
class. It has to return an index where the string
search should be continued or -1
. The GetBracketPos
method returns the position of the bracket's pair which the string
ends or starts with. If the first parameter is false
, the string
has to begin with a bracket that looks to the right.
LanguageNode Class and Recognizers
The LanguageNode
class contains an array of string
s called Operators
that it recognizes and another array of string
s (LanguageNode.Skip
) that it should skip, because the operators, that it has to recognise, is the part of them. E.g., an addition recognizer has +, - operators then the skip array should contain ++, --. The NewLine
string
s contain which operators need the previous or the next line. E.g., if the line ends with +, then it will continue with the next:
public abstract class LanguageNode
{
public DataList Data = new DataList();
public Language Root;
public LanguageNode Parent;
public IList<LanguageNode> Children;
public IList<IResultSkippingHandler> SkippingHandlers;
public string[] Operators;
public string[] SkipFind;
public string[] Skip;
public string[] NewLineLeft;
public string[] NewLineRight;
public string[] OnlyLeft;
public string[] OnlyRight;
...
}
Recognizations are done in a recursive way. I think it's simple but maybe not the fastest way to parse a code file. To decrease the time spent on string
operations, I use my own functions which are optimized for these things and significantly faster, although far more time spent on code generation. Currently my compiler can process 10 000 lines of code in about 1 second.
An expression recognizer should inherit from this class and IExprRecognizer
interface. Recognizers can use Find2
function to determine which operators aren't unary. If the left side of the result ends with an operator or there's nothing, it will ignore that result. When it's not enough to use the LanguageNode.Operators
, for example the cast recognizer needs to implement the IFind2Handler
interface. The following code is the addition recognizer:
public class AdditiveRecognizer : LanguageNode, IExprRecognizer
{
public AdditiveRecognizer()
{
Operators = new string[] { "+", "-" };
NewLineLeft = NewLineRight = Operators;
}
public ExprRecognizerRes Recognize
(CodeString Code, ExprPlugin Plugin, ref ExpressionNode Ret)
{
var Result = ExprRecognizerHelper.Find2(this, Plugin.Container, Code.String);
if (Result.Position != -1)
{
var Ch = ExprRecognizerHelper.TwoParamOpNode(Code, Result, Plugin);
if (Ch == null) return ExprRecognizerRes.Failed;
Operator Op;
if (Result.Index == 0) Op = Operator.Add;
else if (Result.Index == 1) Op = Operator.Subract;
else throw new ApplicationException();
Ret = new OpExpressionNode(Op, Ch, Code);
return ExprRecognizerRes.Succeeded;
}
return ExprRecognizerRes.Unknown;
}
}
This still wouldn't be compatible with the NumberRecognizer
, because floating point numbers can have e notation (2e+2 = 2 * 102). To solve this problem, the NumberRecognizer
adds a result skipper to LanguageNode.SkippingHandlers
. ExprRecognizerHelper.Find2
takes a LanguageNode
parameter first, which is used to get the skipping handlers from.
The Code Generator
This is most complex part of Bird. It is about the third of the whole project. These are major steps it does to compile a function:
- Processes expressions
- Calculates the expression result storage locations
- Calculates the locals' locations
- Checks whether there is a need for temporary register (in x86 assembly most instructions can only have one memory operand, some must be either memory location or general register, etc.)
- If a temporary register needed restarts the data location allocation, allocates the temporary registers and continues at 2nd point.
- Generates assembly (without jumps)
- Jump optimizations
- Jump replacements
The Jump Optimizations and Replacements
int Fib(int x)
if x < 2: return 1
else return Fib(x - 1) + Fib(x - 2)
This function would have the following assembly without jump optimizations:
_Fib:
push ebx
push ebp
mov ebp, eax
cmp ebp, 2
jge _26
mov eax, 1
jmp _2
jmp _24
_26:
mov eax, ebp
dec eax
call _Fib
mov ebx, eax
mov eax, ebp
sub eax, 2
call _Fib
add eax, ebx
jmp _2
_24:
_2:
pop ebp
pop ebx
ret
The label _26
and _24
are for the condition and the label _2
is for function return. There are some unnecessary jumps and labels that should be deleted:
_Fib:
push ebx
push ebp
mov ebp, eax
cmp ebp, 2
jge _26
mov eax, 1
jmp _2
_26:
mov eax, ebp
dec eax
call _Fib
mov ebx, eax
mov eax, ebp
sub eax, 2
call _Fib
add eax, ebx
_2:
pop ebp
pop ebx
ret
It could save a jump if it does the same as where it jumps to. Unless it is too long, it replaces the jump:
_Fib:
push ebx
push ebp
mov ebp, eax
cmp ebp, 2
jge _26
mov eax, 1
pop ebp
pop ebx
ret
_26:
mov eax, ebp
dec eax
call _Fib
mov ebx, eax
mov eax, ebp
sub eax, 2
call _Fib
add eax, ebx
pop ebp
pop ebx
ret
Temporary Register Handling
The IsEqual
function needs 3 temporary register to fulfil that the mov
instruction can only have one memory operand, and the memory location can't have a memory address:
long* g_Pointers
int g_Index
long g_Value
bool IsEqual()
return g_Value == g_Pointers[g_Index]
_IsEqual:
mov eax, dword[_g_Pointers]
mov edx, dword[_g_Index]
mov ecx, dword[eax + edx * 8]
cmp dword[_g_Value], ecx
jne _27
mov ecx, dword[eax + edx * 8 + 4]
cmp dword[_g_Value + 4], ecx
jne _27
mov al, 1
ret
_27:
xor al, al
ret
Condition Assembly Generation
This is the main function for condition assembly generation (Bird.CodeGenerator.GetConditionCode)
:
void GetConditionCode(ExpressionNode Node, int Then, int Else, bool ElseAfterCondition)
The Then
and Else
are the index of labels. If ElseAfterCondition
is
True
, then the condition is followed by the else scope and the jump's condition is the same as the operator and it jumps to the Then
labelFalse
, then the condition is followed by the then scope and the jump's condition is negated and it jumps to the Else
label
Here is an example:
void WhileTest(bool Loop)
var a = 0
while Loop
a++
void DoWhileTest(bool Loop)
var a = 0
do
a++
while Loop
; This is the assembly without jump replacements
_WhileTest:
xor edx, edx
_90:
test eax, eax
jnz _18
inc edx
jmp _90
_18:
ret
_DoWhileTest:
xor edx, edx
_93:
inc edx
test eax, eax
jz _93
ret
; After jump replacements
_WhileTest:
xor edx, edx
test eax, eax
jnz _18
_89:
inc edx
test al, al
jnz _89
_18:
ret
_DoWhileTest:
xor edx, edx
_93:
inc edx
test eax, eax
jz _93
ret
In the WhileTest
function, the condition is followed by the then scope, ElseAfterCondition
is false
. In the DoWhileTest
it's followed by the return command (which happens if the condition is false
), the ElseAfterCondition
is true
.
This is how the and
and or
operators work:
void While_OrTest(bool x, y, z)
var a = 0
while x or y or z
a++
void DoWhile_OrTest(bool x, y, z)
var a = 0
do
a++
while x or y or z
void While_AndTest(bool x, y, z)
var a = 0
while x and y and z
a++
void DoWhile_AndTest(bool x, y, z)
var a = 0
do
a++
while x and y and z
_While_OrTest:
xor ecx, ecx
_45:
test al, al
jnz _44
test ah, ah
jnz _44
test dl, dl
jz _7
_44:
inc ecx
jmp _45
_7:
ret
_DoWhile_OrTest:
xor ecx, ecx
_50:
inc ecx
test al, al
jnz _50
test ah, ah
jnz _50
test dl, dl
jnz _50
ret
_While_AndTest:
xor ecx, ecx
_57:
test al, al
jz _9
test ah, ah
jz _9
test dl, dl
jz _9
inc ecx
jmp _57
_9:
ret
_DoWhile_AndTest:
xor ecx, ecx
_62:
inc ecx
test al, al
jz _10
test ah, ah
jz _10
test dl, dl
jnz _62
_10:
ret
In all the while
and do
-while
tests, the last condition remains the same, the ElseAfterCondition
is only changed for the first two conditions. For the and
conditions, the ElseAfterCondition
is false
, for the or
conditions it's true
.
Here is an example that contains both operators:
void While_AndOrTest(bool x, y, z, w)
var a = 0
while (x or y) and (z or w)
a++
void DoWhile_AndOrTest(bool x, y, z, w)
var a = 0
do
a++
while (x or y) and (z or w)
_While_AndOrTest:
xor ecx, ecx
_71:
test al, al
jnz _72
test ah, ah
jz _11
_72:
test dl, dl
jnz _70
test dh, dh
jz _11
_70:
inc ecx
jmp _71
_11:
ret
_DoWhile_AndOrTest:
xor ecx, ecx
_77:
inc ecx
test al, al
jnz _79
test ah, ah
jz _12
_79:
test dl, dl
jnz _77
test dh, dh
jnz _77
_12:
ret