Introduction
GOLD Parser (see [1]) is a partial, open-source parsing system that can be used to parse existing programming languages, scripts or interpreted languages. It can even be used to develop your own programming language. GOLD Parser was developed by Devin Cook for his Masters Report in Computer Science at California State University, Sacramento. Devin still supports an active community through his web site and Yahoo! news groups.
There are many other fine articles on the Code Project web site about what a parser and lexer are (see [2]) and how they work. In this introductory article, I will focus, instead, on the basic features of the GOLD Parser system.
Background
The first time you learn about a parser or lexer, you are immediately linked to the classic Lex and YACC tools developed many years ago. You may also be referred to the open-source, Bison tool provided with the GNU distribution or ANTLR, a commercial parsing tool or Spirit, an open-source parsing system based on C++ templates. What's common to all these tools is that the code generated for parsing is always C or C++.
For those using .NET, there are really only two choices; Grammatica, which was recently ported from its original Java implementation to C# and GOLD Parser.
The GOLD Parsing System
The GOLD Parsing system consists of three phases;
Phase 1: the developer creates the grammar of the programming language, script or interpreter and compiles it into an intermediate, binary file using the GOLD Parser Builder tools.
Phase 2: the developer decides on a programming language to implement the parser. Besides the C# and VB.NET languages, implementations of the GOLD parser engine are provided for VB 6.0, Java, Delphi, Python and, of course, C and C++. The developer chooses the programming language template of choice from the the GOLD Parser Builder and generates a parser engine shell in that language. Note that some templates generate a more complete parser engine than others.
Phase 3: the developer takes each parsed token and, in the case of an interpreter, executes the command immediately or, in the case of a compiled language, begins to build an abstract syntax tree for later optimization, storage and/or execution of the input language.
Phase 1: Creating the Grammar
A grammar can be as simple as a basic calculator or as complex as the C# language or Structured Query Language (SQL). To get you started, the GOLD Parser Builder provides a wizard that helps you to identify the common features that your language might include such as line based or free-flowing, C++ or COBOL style identifiers, string literal representation and basic mathematical/comparison expressions. Clicking Finish, opens the generated grammar in the Grammar Editor window of the builder.
Like most other tools that convert grammars into working code, the input grammar allows you to enter meta data about the author, version, details about the grammar and the starting rule.
"Name" = 'Enter the name of the grammar'
"Author" = 'Enter your name'
"Version" = 'The version of the grammar and/or language'
"About" = 'A short description of the grammar'
"Start Rule" = <Start>
A grammar will typically include definitions for how to recognize a floating point number, hex integer or string while parsing. Regular expressions are used to build the definition for these terminals.
{ID Head} = {Letter} + [_]
{ID Tail} = {Alphanumeric} + [_]
{String Chars} = {Printable} - ["]
Identifier = {ID Head}{ID Tail}*
StringLiteral = '"' {String Chars}* '"'
GOLD has several built-in character-sets from which to build a terminal plus other examples on the web site.
The final and most interesting part of the grammar is the rules. Rules in a grammar are declared using Backus-Naur Form (BNF) statements. This notation consists of a series of 0 or more symbols where non-terminals are delimited by the angle brackets '<' and '>' and terminals are not delimited. For instance, the following declares the common if-statement.
<Statement> ::= if <Expression> then <Statements> end if
The symbols 'if
', 'then
', 'end
', and 'if
' are terminals and <Expression>
and <Statements>
are non-terminals. If you are declaring a series of rules that derive the same non-terminal (i.e. different versions of a rule), you can use a single pipe character '|' in the place of the rule's name and the "::=" symbol. The following declares a series of three different rules that define a 'Statement'.
<Statement> ::= if <Expression> then <Statements> end if
| while <Expression> do <Statements> end while
| for Id = <Range> loop <Statements> end for
Once the grammar is ready, it's time to compile it into a binary form. The Builder provides a simple 4-step, single button wizard to compile the grammar. Clicking the initial "1. Enter Grammar" button checks the grammar for syntax errors and warns of any unreachable rules or unused terminals. If all goes well the button caption changes to "2. Compute Parser". Clicking this button, computes the Look-ahead Left to Right Parsing (LALR) state tables. Again, if no errors are generated, the button caption changes to "3. Compute Tokenizer". Clicking this button computes the Deterministic Finite Automata (DFA) tables used by the tokenizer of the resulting parser engine. Finally, if there were no errors, the button's caption changes to "4. Save the Tables". Clicking this button allows the compiled LALR and DFA tables to be saved to a .CGT file for later use by the parser engine.
Phase 2: Choose a Language and Generate Parser
When the GOLD parsing system was initially released, the Visual Basic parser engine code for reading a .CGT file and parsing an input stream was included. Before long, contributors took it on themselves to port this code to their favorite programming language. Today, there are many language implementations, all available in source code form for inclusion in your project.
Implementing a parser is a coding intensive task. Manually typing in each symbol and rule constant can be tedious. Fortunately, the GOLD Parser Builder includes a template mechanism that provides a powerful approach to generating parser code. A simple pre-processor provides tags for emitting attributes of the grammar's symbols and rules. The tags can be used to create lists of constants, case statements, or whatever the engine developer needs. Templates are available for each parser engine language that has been contributed, so some of the work has already been done for you.
From within the Builder, select the Tools -> Create Skeleton Program menu option to see the list of templates available.
Clicking Create allows you to enter the name and extension of the file into which the code is generated. At this point, you can open the generated code in your favorite development environment and reference the corresponding parser engine component or include its source code.
Phase 3: Implementing the Grammar Specific Parser
Most language templates include some kind of function into which you pass the string or file to be parsed, and either through callbacks, delegates, overrides or inline code allow you to intercept each state as the input is being parsed. For example:
Public Sub DoParse(ByVal Source As String)
If Not Parser Then Parser = LoadParser("grammar.cgt")
Parser.Source = Source
Done = False
Do Until Done
Select Case Parser.Parse()
Case LexicalError
Done = True
Case SyntaxError
Done = True
Case Reduction
With .CurrentReduction
Select Case .ParentRule.TableIndex
Case 1
Case 2
End Select
End With
Case Accept
Done = True
Success = True
Case CommentError
Done = True
End Select
Loop
End Sub
The function loads the .CGT file if it is not already and tells the parser about the the input file or string to be parsed. Then, in a loop, the parser parses the input and returns when something significant happens. There are distinct error conditions for lexical, syntax and comment errors. When a rule is encountered (reduction), the index and tokens are available for the current state. This is where you add code to either execute the meaning of the rule (in the case of a command language) or create an object representing the state for later optimization and/or execution (in the case of an interpreted language). When the end of input is encountered (accept), the loop breaks and the function returns. For an interpreted language, the developer can now traverse the tree of objects that was created and perform the meaning of the input grammar.
About the Code
The GOLD parser web-site provides a simple interpreted language to demonstrate how to implement a working parser. The sample is written in Visual Basic 6.0. Since I was interested in a solution for VB.NET, I took this code and created the attached VB.NET version. The demo includes a modified copy of the Klimstra C# Parser engine and a working version of the simple-interpreter sample, all compiled against .NET 1.1.
Simple Interpreter Grammar
The simple interpreter grammar allows basic variable declarations, execution flow, data input and output, and expression evaluation. The following grammar shows these basic rules:
<Statement> ::= display <Expression>
| display <Expression> read ID
| assign ID '=' <Expression>
| while <Expression> do <Statements> end
| if <Expression> then <Statements> end
| if <Expression> then <Statements> else <Statements> end
This was compiled into the simple2.cgt file using the GOLD Parser Builder. Next, using the included Visual Basic .NET - Template Parser.pgt file, I used the GOLD Parser Builder to generate the file Template.vb. Then, I used the Visual Basic .NET - Sample Parser to generate the file Simple.vb.
I used Visual Studio 2003's Visual Basic Convert tool to convert the project to VB.NET, added a reference to the Klimstra Parser engine DLL, and included the Template and simple.vb files. Then, in Simple.vb, for each rule, I created the corresponding class for each rule. The following code snippet shows associating a class with a rule:
Public Class SimpleParser
Inherits TemplateParser
Protected Overrides Function _
CreateRule_Statement_If_Then_Else_End(ByVal Tokens _
As ArrayList) As Reduction
Dim Obj As New SimpleIfStm
Obj.IfClause = Tokens(1).Data
Obj.ThenClause = Tokens(3).Data
Obj.ElseClause = Tokens(5).Data
return Obj
End Function
End Class
The following code snippet shows the class that stores information about the rule and implements the method to execute the rule:
Private Class SimpleIfStm
Inherits SimpleReduction
Implements IContextMethod
Public IfClause As IContextValue
Public ThenClause As IContextMethod
Public ElseClause As IContextMethod
Public Function Execute() As Object Implements IContextMethod.Execute
If IfClause.Value = "True" Then
ThenClause.Execute()
Else
If Not ElseClause Is Nothing Then
ElseClause.Execute()
End If
End If
End Function
End Class
Finally, in the MainForm.vb file, in the method that handles the Parse button click event, the string containing the program to execute is passed into the parser as follows:
Parser.DoParse(New StringReader(txtProgram.Text))
If the input does not generate errors, the Execute button is enabled. The Click
event handler for the Execute button simply takes the root node of the tree generated during parsing and calls the corresponding Execute
method.
If Not Parser.CurrentReduction Is Nothing Then
CType(Parser.CurrentReduction, IContextMethod).Execute()
End If
This walks the internal tree and calls each object's corresponding Execute
or Value
method depending on the rule.
The sample source code includes several example programs using the simple2 language grammar.
Points of Interest
I should mention that none of the templates provided with GOLD, take into consideration the lifecycle of developing a parser. When you are building a language, you typically never completely finish it before writing the parser. More typically, you get something basic working, then you can start to add language features. After code is generated from a template, it's easy to fall into the trap of adding your code into the generated file. However, the next time the grammar changes and you need to regenerate, you hopefully realize it would wipe out your code.
When developing the VB.NET version of the simple interpreter, I created two templates. The first template generates an abstract class with the DoParse
function and must-override methods to handle each rule. The second template generates a concrete class that provides an overridden implementation for each rule and is where my custom code is added. The second template is typically only used once. This way, when the grammar changes, the first template updates the abstract definition without wiping out my changes and also makes it easy to identify any new overridable methods added to the parser.
Oh!, and in case you were wondering, GOLD stands for Grammar Oriented Language Developer. But I'm sure it has more to do with the Golden Gate Bridge in San Francisco near where Devin lives.
History
- 26-05-2005 - Initial release.
Reference
- GOLD Parser web site
- Other Parser articles on Code Project.