Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles / Languages / C

Designing a Compiler

4.46/5 (22 votes)
22 Oct 2008CPOL10 min read 1   10.3K  
this article talks about compiler design and implementation

Introduction

This is my project during my bechlor degree program. I have designed the partial C-Compiler. Though it is C-compiler the concept of all the compilers will be almost same. I have used LEX and YACC tools to generate the Lexical and Syntax analysis. I did the code optimized face as well. And I have generated final code in X86 machine and used MASM assembler to run the code. It was very successful.

       Compiler is a program that reads a program written in one language, called source language, and translated it in to an equivalent program in another language, called target language. It reports errors detected during the translation of source code to target code.Source program can be of any programming language. Here our source program is ANSI C program. Target program can be either Assembly language code or machine code. Here we have produced 8086 Assembly level code for the given source code. The first phase of the compiler, Lexical Analyzer, is being implemented by using LEX tool provided with Linux. The second phase, Syntax Analyzer, is being implemented by using Yacc tool provided by Linux. The third phase, Semantic Analyzer, and the fourth phase, Intermediate Code Generation, are carried out as the part of action corresponding to the production rules in the parser. The intermediate code so produced is 8086 Assembly code, which is converted into an executable by using MASM.

Phases of Compiler

There are mainly SIX phases of a compiler. The first four Lexical analysis, Syntax analysis, Semantic analysis and Intermediate Code Generation are part of Analysis phases, which are machine independent phases. While the other two Code Optimization and Code Generation are part of Synthesis phases, which are highly machine dependent phases.

The phases of a compiler are:

I)Lexical Analysis 

Lexical Analysis or Linear Analysis or Scanning, in which the stream of characters making up the source program is read from left-to-right and grouped in to tokens, sequence of characters having a collective meaning. The blanks separating the characters of the tokens and the comments statements appearing within the program would normally be eliminated during lexical analysis. There are several reasons for separating the analysis phase of compiling into lexical analysis and parsing.

1)Simpler design is perhaps the most important consideration. The separation of lexical analysis from syntax analysis often allows us to simplify one or the other of these phases.

For example, a parser embodying the conventions for comments and white spaces is significantly more complex than one that can assume comments and white space have already been removed by a lexical analyzer. if we are designing a new language, separating the lexical and syntactic conventions can lead to a cleaner overall language design.

2)Compiler efficiency is improved. A separate lexical analyzer allows us to construct a specialized and potentially more efficient processor for the task. A large amount of time
is spent reading the source program and partitioning it into tokens. Specialized buffering techniques for reading input characters and processing tokens can significantly speed up the performance of a compiler.

3)Compiler portability is enhanced. Input alphabets peculiarities and other device-specific anomalies can be restricted to the lexical analyzer. The representation of special or non-standard symbols can be isolated in the lexical analyzer.

Specialized tools have been designed to help automate the construction of lexical analyzers and parsers when they are separated. LEX is a widely used tool to specify lexical analyzers for a variety of languages. We refer to the tool as the LEX compiler, and to its input specification as the LEX language. LEX is generally used in the manner of a lexical analyzer, is prepared by creating a program lex.l in the LEX language. Then, lex.l is run through the LEX compiler to produce a C program lex.yy.c. The program lex.yy.c consists of a tabular representation of a transition diagram constructed from the regular expressions of lex.l, together with a standard routine that uses the table to recognize lexemes. The actions associated with regular expressions in lex.l are pieces of C code and are carried over directly to lex.yy.c. Finally, lex.yy.c is run through the compiler to produce an object program a.out, which is the lexical analyzer that transforms an input stream into sequence of tokens.

Please refer figure: interaction_with_lex and lexical_analyzer

LEX Specifications    

    A LEX program consists of three parts:

    declarations
    %%
    transition rules
    %%
    auxiliary procedures

    The declaration section includes declarations section includes declarations of variables, manifest constants, and regular definitions. (A manifest constant is an identifier that is declared to represent a constant.) The regular definitions are used as components of the regular expressions appearing in the translation rules. The transition rules of a LEX program are statements of the form P1          {action1}
    P2          {action2}
    ….        ………..
    Pn          {actionn}
   
    Where each P1 is a regular expression and each action1 is a program fragment describing what action the lexical analyzer should take when pattern P1 matches lexeme. In LeX, the actions are written in C; in general, however, they can be in any implementation language.
 

II)Syntax Analysis (I have attached our Syntax analyzer rules and YACC usage)


Syntax Analysis or Hierarchical Analysis, in which characters or tokens are grouped hierarchically into nested collections with collective meaning. Hierarchical analysis also termed as Parsing, involves grouping the tokens of the source program into grammatical phrases that are used by the compiler to synthesize output. Usually, the grammatical phrases of the source program are represented by a Parse tree.

III)Semantic Analysis


Semantic Analysis, in which certain checks are performed to ensure that the components of a program fit together meaningfully. The semantic analysis phase checks the source program for semantic errors and gathers type information for the subsequent code-generation phase. It uses hierarchical structure determined by the syntax-analysis phase to identify the operators and operands of expressions and statements.

An important component of semantic analysis is type checking. Here the compiler checks that each operator has operands that are permitted by the source language specification.

Moreover, a compiler must check that the source program follows both the syntactic and semantic conventions of the source language. This checking, called static checking ensures
that certain kinds of programming errors will be detected and reported. The checking done when the target program runs is termed as dynamic checking. In principle, any check can be done dynamically, if the target code carries the type of an element along with the value of that element. A sound system eliminates the need for dynamic checking for type errors because it allows us to determine statistically that these errors cannot occur when the target program runs. That is, if a sound system assigns a type other than type_error to a program part, then type errors cannot occur when the target code for the program part is run. A language is strongly typed if its compiler can guarantee that the programs it accepts will execute without type errors.
  

Static checks include:


1.    Type checks: A compiler should report an error if an error if an operator is applied to an incompatible operand.
2.    Flow-of-control checks: Statements that cause flow of control to leave a construct must have some place to which hto transfer the flow of control.
3.    Uniqueness checks:    There are situations in which an object must be defined exactly once.
4.    Name-related checks: Sometimes, the same name must appear two or more times. The compiler must check that the same name is used at both places.
 

IV)Intermediate Code Generation


After the syntax and semantic analysis, some compilers generate as explicit intermediate representation of the source program. We can think of this intermediate representation as a program for an abstract machine. This intermediate representation should have two important properties: it should be easy to produce and easy to translate into the target program.

The intermediate representation can have a variety of forms. We consider an intermediate form of 8086 Assembly language, which is like three address code. This intermediate form has several properties:
    1)Each instruction has at most one operator in addition to the assignment. Thus, generating these  instructions, the compiler has to decide on the order in which operations are to be done.
    2)The compiler must generate a temporary name to hold the value computed by each instruction.
    3)Some instructions may have fewer than three operands.

In the analysis–synthesis model of a compiler, the front end translates a source program into an intermediate representation from which the back end generates target code.

Details of the target language are confined to the back end, as far as possible. Although a source program can be translated directly into the target language, some benefits of using a machine–independent intermediate form are:
1.Retargeting is facilitated; a compiler for a different machine can be created by attacking a back end for the new machine to an existing front end.
2.A machine–independent code optimizer can be applied to the intermediate representation.

V)Code Optimization


The code optimization phase attempts to improve intermediate code, so that faster-running machine code will result. There is a great variation in the amount of code optimization
different compilers perform. In those that do the most, called “optimizing compilers,” a significant amount of the time of the compiler is spent on this phase. However, there
are simple optimizations that significantly improve the running time of the target program without slowing down compilation too much.  

VI)Machine Code Generation


The final phase of the compiler is the generation of target code, consisting normally relocatable machine code or assembly code. Memory locations are selected for each of the variables used by the program. Then, intermediate instructions are each translated into a sequence of machine instructions that perform the same task. A crucial aspect is the assignment of variables to registers.   

Symbol Table Management


    An essential function of a compiler is to record the identifiers used in the source program and collect information about various attributes of each identifier. These attributes may provide information about the storage allocated for an identifier (if applicable), its type, its scope and in the case of procedure names, such things as the number and types of arguments, the method of passing each argument and the type of return value if the functions returns any value.

    A symbol table is a data structure containing a record for each identifier, with fields for the attributes of the identifier. The data structure allows us to find the record for each identifier quickly and to store or retrieve data from that record quickly.

    When an identifier in the source program is detected by the lexical analyzer, the identifier is entered into the symbol table. However, the attributes of an identifier cannot normally be determined during lexical analysis. The remaining phases enter information about identifiers into the symbol table and then use this information in various ways.

Error Detection and Reporting:

    Each phase can encounter errors. However, after detecting an error, a phase must somehow deal with that error, so that compilation can proceed, allowing further errors in the source program to be detected. A compiler that stops when it finds error is not as helpful as it could be.   
    The syntax and semantic analysis phases usually handle a large fraction of the errors detectable by the compiler. The lexical phase can detect errors where the characters remaining in the input do not form any token of the language. Errors where the token stream violates the structure rules (syntax) of the language are determined by the syntax analysis phase. During semantic analysis the compiler tries to detect constructs then have the right syntactic structure but no meaning to the operation involved.  

Sources:

1.   Compilers: Principles, Techniques and Tools By,
            Alfred V. Aho
            Ravi Sethi
            Jeffery D. Ullman
2.    Compiler Design in C By Allen I. Holub
3.    Linux manual pages for LEX and Yacc
 

 

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)