Bird Programming Language: Part 3

Dávid Kocsis

4.88/5 (5 votes)

1 Jan 2013GPL35 min read

30.9K

282

A new general purpose language that aims to be fast, high level and simple to use.

Download Bird.zip - 3.8 MB

Articles About Bird

Overview of Bird's Structure
Basic Insight into Code Parsing
- String Handling
- LanguageNode Class and Recognizers
The Code Generator

Overview of Bird's Structure

Expressions are parsed by different recognizers, and the node is passed to the plugins.
Plugins can modify the recognized expression including type managing, constant evaluation
Architectures are for generating code for a platform. Currently the only is x86.

Basic Insight into Code Parsing

The language consists of recognizers, and supporting new languages can be done by replacing recognizers. The functions, types, variables will be available between Bird and the new languages. I don't use parser generators, because I think using other software could prevent from implementing some of the functionalities or I would have to do things in a different way.

The CompilerState class has a Language object that contains all recognizers. Here is the Language class:

public abstract class Language : LanguageNode
{
    public IList<IExprRecognizer> ExprRecognizers;
    public IList<IIdRecognizer> IdRecognizers;
    public IList<ICommRecognizer> CommRecognizers;
    public IList<IModRecognizer> ModRecognizers;
    public IList<INameGenerator> NameGenerators;
    public IList<IResultSkippingHandler> GlobalHandlers;

    public IArgRecognizer ArgRecognizer;
    public IGenericRecognizer GenericRecognizer;
    public IGroupRecognizer GroupRecognizer;

    public IRetValLessRecognizer RetValLessRecognizer;
    public IInnerScopeRecognizer InnerScopeRecognizer;
    public IVarDeclRecognizer VarDeclRecognizer;
    public ITypeDeclRecognizer TypeDeclRecognizer;
    public IConstDeclRecognizer ConstDeclRecognizer;
    public INamespaceDeclRecognizer NamespaceDeclRecognizer;
    public IDeclarationRecognizer DeclarationRecognizer;

    public IGlobalScopeProcessor GlobalScopeProcessor;
    public ICodeFileProcessor CodeFileProcessor;
    public ICodeProcessor CodeProcessor;
    ...
}

String Handling

Bird uses StringSlice to store strings to avoid a new string allocation when it calls string.Substring. This structure also has functions for bracket handling and other operations. The CodeString structure consists of a StringSlice and a reference to CodeFile class that can also manage lines and indents of them. CodeString has a Line field and when its substring is needed, it updates the Line member. Most of the CodeString's methods are just wrapper around StringSlice's functions.

The ISkippingHandler interface is used to skip results that are in a string constant or bracket. The recognizers can store data in ResultSkippingManager.Datas, if it's needed. In some cases, it's necessary to make sure that the brackets are not skipped. A bracket recognizer shouldn't skip brackets if ResultSkippingManager.DoNotSkipBrackets is true.

public class BracketRecognizer : LanguageNode, IExprRecognizer,
                IIdRecognizer, IResultSkippingHandler
{
    ...

    public int SkipResult(ref ResultSkippingManager RSM)
    {
        if (!RSM.DoNotSkipBrackets && Helper.GetBracket(RSM.CurrentChar, RSM.Back))
        {
            var Handlers = RSM.SkippingHandlers;
            if (RSM.Back)
            {
                var NewString = RSM.String.Substring(0, RSM.Current + 1);
                var Pos = NewString.GetBracketPos(true, Handlers);
                if (Pos != -1) return Pos;
            }
            else
            {
                var NewString = RSM.String.Substring(RSM.Current);
                var Pos = NewString.GetBracketPos(false, Handlers);
                if (Pos != -1) return Pos + RSM.Current;
            }
        }

        return -1;
    }
}

So this is the BracketRecognizer class. It has to return an index where the string search should be continued or -1. The GetBracketPos method returns the position of the bracket's pair which the string ends or starts with. If the first parameter is false, the string has to begin with a bracket that looks to the right.

LanguageNode Class and Recognizers

The LanguageNode class contains an array of strings called Operators that it recognizes and another array of strings (LanguageNode.Skip) that it should skip, because the operators, that it has to recognise, is the part of them. E.g., an addition recognizer has +, - operators then the skip array should contain ++, --. The NewLine strings contain which operators need the previous or the next line. E.g., if the line ends with +, then it will continue with the next:

public abstract class LanguageNode
{
    public DataList Data = new DataList();
    public Language Root;
    public LanguageNode Parent;
    public IList<LanguageNode> Children;
    public IList<IResultSkippingHandler> SkippingHandlers;

    public string[] Operators;
    public string[] SkipFind;
    public string[] Skip;

    public string[] NewLineLeft;
    public string[] NewLineRight;
    public string[] OnlyLeft;
    public string[] OnlyRight;

    ...
}

Recognizations are done in a recursive way. I think it's simple but maybe not the fastest way to parse a code file. To decrease the time spent on string operations, I use my own functions which are optimized for these things and significantly faster, although far more time spent on code generation. Currently my compiler can process 10 000 lines of code in about 1 second.

An expression recognizer should inherit from this class and IExprRecognizer interface. Recognizers can use Find2 function to determine which operators aren't unary. If the left side of the result ends with an operator or there's nothing, it will ignore that result. When it's not enough to use the LanguageNode.Operators, for example the cast recognizer needs to implement the IFind2Handler interface. The following code is the addition recognizer:

public class AdditiveRecognizer : LanguageNode, IExprRecognizer
{
    public AdditiveRecognizer()
    {
        Operators = new string[] { "+", "-" };
        NewLineLeft = NewLineRight = Operators;
    }

    public ExprRecognizerRes Recognize
        (CodeString Code, ExprPlugin Plugin, ref ExpressionNode Ret)
    {
        var Result = ExprRecognizerHelper.Find2(this, Plugin.Container, Code.String);
        if (Result.Position != -1)
        {
            var Ch = ExprRecognizerHelper.TwoParamOpNode(Code, Result, Plugin);
            if (Ch == null) return ExprRecognizerRes.Failed;

            Operator Op;
            if (Result.Index == 0) Op = Operator.Add;
            else if (Result.Index == 1) Op = Operator.Subract;
            else throw new ApplicationException();

            Ret = new OpExpressionNode(Op, Ch, Code);
            return ExprRecognizerRes.Succeeded;
        }

        return ExprRecognizerRes.Unknown;
    }
}

This still wouldn't be compatible with the NumberRecognizer, because floating point numbers can have e notation (2e+2 = 2 * 10²). To solve this problem, the NumberRecognizer adds a result skipper to LanguageNode.SkippingHandlers. ExprRecognizerHelper.Find2 takes a LanguageNode parameter first, which is used to get the skipping handlers from.

The Code Generator

This is most complex part of Bird. It is about the third of the whole project. These are major steps it does to compile a function:

Processes expressions
Calculates the expression result storage locations
Calculates the locals' locations
Checks whether there is a need for temporary register (in x86 assembly most instructions can only have one memory operand, some must be either memory location or general register, etc.)
If a temporary register needed restarts the data location allocation, allocates the temporary registers and continues at 2nd point.
Generates assembly (without jumps)
Jump optimizations
Jump replacements

The Jump Optimizations and Replacements

int Fib(int x)
    if x < 2: return 1
    else return Fib(x - 1) + Fib(x - 2)

This function would have the following assembly without jump optimizations:

ASM

_Fib:
    push ebx
    push ebp
    mov ebp, eax
    cmp ebp, 2
    jge _26
    mov eax, 1
    jmp _2
    jmp _24
_26:
    mov eax, ebp
    dec eax
    call _Fib
    mov ebx, eax
    mov eax, ebp
    sub eax, 2
    call _Fib
    add eax, ebx
    jmp _2
_24:
_2:
    pop ebp
    pop ebx
    ret

The label _26 and _24 are for the condition and the label _2 is for function return. There are some unnecessary jumps and labels that should be deleted:

ASM

_Fib:
    push ebx
    push ebp
    mov ebp, eax
    cmp ebp, 2
    jge _26
    mov eax, 1
    jmp _2
_26:
    mov eax, ebp
    dec eax
    call _Fib
    mov ebx, eax
    mov eax, ebp
    sub eax, 2
    call _Fib
    add eax, ebx
_2:
    pop ebp
    pop ebx
    ret

It could save a jump if it does the same as where it jumps to. Unless it is too long, it replaces the jump:

ASM

_Fib:
    push ebx
    push ebp
    mov ebp, eax
    cmp ebp, 2
    jge _26
    mov eax, 1
    pop ebp
    pop ebx
    ret
_26:
    mov eax, ebp
    dec eax
    call _Fib
    mov ebx, eax
    mov eax, ebp
    sub eax, 2
    call _Fib
    add eax, ebx
    pop ebp
    pop ebx
    ret

Temporary Register Handling

The IsEqual function needs 3 temporary register to fulfil that the mov instruction can only have one memory operand, and the memory location can't have a memory address:

C++

long* g_Pointers
int g_Index
long g_Value

bool IsEqual()
    return g_Value == g_Pointers[g_Index]


_IsEqual:
    mov eax, dword[_g_Pointers]
    mov edx, dword[_g_Index]
    mov ecx, dword[eax + edx * 8]
    cmp dword[_g_Value], ecx
    jne _27
    mov ecx, dword[eax + edx * 8 + 4]
    cmp dword[_g_Value + 4], ecx
    jne _27
    mov al, 1
    ret
_27:
    xor al, al
    ret

Condition Assembly Generation

This is the main function for condition assembly generation (Bird.CodeGenerator.GetConditionCode):

void GetConditionCode(ExpressionNode Node, int Then, int Else, bool ElseAfterCondition)

The Then and Else are the index of labels. If ElseAfterCondition is

True, then the condition is followed by the else scope and the jump's condition is the same as the operator and it jumps to the Then label
False, then the condition is followed by the then scope and the jump's condition is negated and it jumps to the Else label

Here is an example:

C++

void WhileTest(bool Loop)
    var a = 0
    while Loop
        a++

void DoWhileTest(bool Loop)
    var a = 0

    do
        a++
    while Loop
    
    
; This is the assembly without jump replacements
_WhileTest:
    xor edx, edx
_90:
    test eax, eax
    jnz _18
    inc edx
    jmp _90
_18:
    ret

_DoWhileTest:
    xor edx, edx
_93:
    inc edx
    test eax, eax
    jz _93
    ret
	
	
; After jump replacements
_WhileTest:
    xor edx, edx
    test eax, eax
    jnz _18
_89:
    inc edx
    test al, al
    jnz _89
_18:
    ret

_DoWhileTest:
    xor edx, edx
_93:
    inc edx
    test eax, eax
    jz _93
    ret

In the WhileTest function, the condition is followed by the then scope, ElseAfterCondition is false. In the DoWhileTest it's followed by the return command (which happens if the condition is false), the ElseAfterCondition is true.

This is how the and and or operators work:

C++

void While_OrTest(bool x, y, z)
    var a = 0
    while x or y or z
        a++

void DoWhile_OrTest(bool x, y, z)
    var a = 0

    do
        a++
    while x or y or z

void While_AndTest(bool x, y, z)
    var a = 0
    while x and y and z
        a++

void DoWhile_AndTest(bool x, y, z)
    var a = 0

    do
        a++
    while x and y and z
    
    
_While_OrTest:
    xor ecx, ecx
_45:
    test al, al
    jnz _44
    test ah, ah
    jnz _44
    test dl, dl
    jz _7
_44:
    inc ecx
    jmp _45
_7:
    ret

_DoWhile_OrTest:
    xor ecx, ecx
_50:
    inc ecx
    test al, al
    jnz _50
    test ah, ah
    jnz _50
    test dl, dl
    jnz _50
    ret

_While_AndTest:
    xor ecx, ecx
_57:
    test al, al
    jz _9
    test ah, ah
    jz _9
    test dl, dl
    jz _9
    inc ecx
    jmp _57
_9:
    ret

_DoWhile_AndTest:
    xor ecx, ecx
_62:
    inc ecx
    test al, al
    jz _10
    test ah, ah
    jz _10
    test dl, dl
    jnz _62
_10:
    ret

In all the while and do-while tests, the last condition remains the same, the ElseAfterCondition is only changed for the first two conditions. For the and conditions, the ElseAfterCondition is false, for the or conditions it's true.

Here is an example that contains both operators:

void While_AndOrTest(bool x, y, z, w)
    var a = 0
    while (x or y) and (z or w)
        a++

void DoWhile_AndOrTest(bool x, y, z, w)
    var a = 0

    do
        a++
    while (x or y) and (z or w)
    
    
_While_AndOrTest:
    xor ecx, ecx
_71:
    test al, al
    jnz _72
    test ah, ah
    jz _11
_72:
    test dl, dl
    jnz _70
    test dh, dh
    jz _11
_70:
    inc ecx
    jmp _71
_11:
    ret

_DoWhile_AndOrTest:
    xor ecx, ecx
_77:
    inc ecx
    test al, al
    jnz _79
    test ah, ah
    jz _12
_79:
    test dl, dl
    jnz _77
    test dh, dh
    jnz _77
_12:
    ret

License

This article, along with any associated source code and files, is licensed under The GNU General Public License (GPLv3)