Parsing text is a tedious, error-prone process, so rather than code a parser from scratch, I prefer to use tools that generate parsers. There are many options available, including ANTLR, bison, and yacc. But for my current project, I need a compact C/C++ parser that executes quickly. Therefore, I've chosen two tools: re2c to generate the lexer and lemon to generate the parser.
It's important to understand the difference between them. re2c's job is to generate a lexer, which analyzes incoming text and divides it into meaningful elements called tokens. lemon's job is to generate a parser, which accepts tokens from the lexer and constructs a model of the text's hierarchical structure.
The good news is that re2c and lemon are freely available and both generate compact, high-speed C code. The bad news is that they both have significant learning curves. In writing these articles, my goal is to lessen the pain associated with learning re2c and lemon.
I'm not an expert with either tool, and I won't pretend that these articles provide a thorough treatment. But I have successfully used re2c and lemon to generate quality lexers/parsers, and I hope these articles will help you do the same. The first article focuses on re2c and the second article discusses lemon.
1. Preliminary Requirements
To explain how re2c can be used to generate a lexer, this article adopts a hands-on approach. I don't expect you to know the theory of lexical analysis, but I do expect that you can program in C.
The example code presented in this article has been compiled with gcc. However, it should be usable with any C/C++ build system.
2. Lexer Overview
This article focuses on using re2c to generate a lexer capable of scanning C code. This lexer can't recognize all of C's syntax, but it can scan the following code:
int main() {
printf("Hello World");
return 0;
}
The lexer reads each character and uses regular expression matching to return a sequence of tokens. For example, a left parenthesis corresponds to the LPAREN
token and the return
keyword corresponds to the RETURN
token. The tokens.h file associates each token name with a number.
The lexer discussed in this article splits the text into tokens and then prints the type, line number, and position of each token. For the above code, this output is given as follows:
INT_TYPE found on Line 3 at Position 0
NAME ('main') found on Line 3 at Position 4
LPAREN found on Line 3 at Position 8
RPAREN found on Line 3 at Position 9
LBRACE found on Line 3 at Position 11
NAME ('printf') found on Line 4 at Position 2
LPAREN found on Line 4 at Position 8
STRING_LITERAL ('"Hello World"') found on Line 4 at Position 9
RPAREN found on Line 4 at Position 22
SEMICOLON found on Line 4 at Position 23
RETURN_KW found on Line 5 at Position 2
INT_LITERAL ('0') found on Line 5 at Position 9
SEMICOLON found on Line 5 at Position 10
RBRACE found on Line 6 at Position 0
The lexer doesn't create a token for the initial multi-line comment, but it adds an offset to the line numbers following the comment.
3. Writing the Token File
The name re2c stands for regular expressions to C and that's exactly what it accomplishes. It generates code that checks text against a series of regular expressions. When it finds a match, it executes C code associated with the expression. For example, the following text tells re2c to return the predefined INT_TYPE
token whenever it encounters the int
string in code:
"int" { return(INT_TYPE); }
The associations between text and tokens is defined in a file called a token file (*.re). The hard part of using re2c is writing this file so that it can be converted into lexer code. For the lexer discussed in this article, the token file is simple_c.re. The process of writing this file consists of four steps:
- Define a
Scanner
structure - Provide code to execute the scanner
- Associate regular expressions with code
- Assign values to re2c-specific macros
This section describes how each of these steps can be accomplished for a simple lexer. Where possible, I'll keep to the conventions set by the example code in the re2c distribution.
3.1 Define the scanner
For the sake of convenience, the example code stores re2c-specific data in a structure called Scanner
. This contains the current line number in the text (line
) and four pointers that reference memory locations in the text. The following code shows what this structure looks like:
typedef struct Scanner {
char *top, *cur, *ptr, *pos;
int line;
} Scanner;
Table 1 lists each pointer and the purpose it serves.
Table 1: Eight Pointers of the Scanner Structure Pointer | Purpose |
top | Points to the start of the current token |
cur | Points to the character being examined |
ptr | Used by re2c for back-tracking |
pos | Points to the start of the line |
In the example code, each pointer references the position of a char
. This tells the lexer that each input symbol occupies one byte. Depending on the size of the character set, this data type could also be set to short
, int
, or an unsigned integer type.
3.2 Provide code to execute the scanner
The token file needs a function to start the scanning process. If the program's only purpose is to scan text, this would be the main
function. Whatever the name is, this function has three responsibilities:
- Create the
Scanner
structure - Provide the scanner with data
- Create a scan loop to process the incoming text
The first two steps are easy. The following code reads input text from test.dat into a buffer. Then it sets three of the Scanner
's pointers to point to the start of the buffer.
fp = fopen("test.dat", "r");
fread(buff, 1, size, fp);
Scanner scanner;
scanner.top = buff;
scanner.cur = buff;
scanner.pos = buff;
scanner.line = 1;
The third step is more involved. The scan loop cycles through the buffer's content, performing one iteration for each token it finds. The following code gives an idea of how this works:
int token = 0;
buff_end = (char*) (((char*)buff) + size);
while(token = scan(&scanner, buff_end)) {
... display output for the returned token ...
}
The real work of re2c is performed in the scan
function, which returns an integer that uniquely identifies the token. In general, the scan
function contains a little customization code and a lot of re2c configuration.
If each input symbol is a char
, the scan
function might look like this:
int scan(Scanner* s, char *buff_end) {
regular:
if (s->cur >= buff_end) {
return END_TOKEN;
}
s->top = s->cur;
comment:
}
The scan
function starts by checking if the current character (s->cur
) is at the end of the text buffer (buff_end
). If so, the function returns END_TOKEN
. If not, the start of the token (s->top
) is set equal to the current character (s->cur
).
This code has two labels: regular
and comment
. They're needed because the goto
statements in the lexer code repeat processing starting from one of the two labels. Like all proper programmers, I abhor the use of goto
statements, but in lexical analysis, there's no getting around them.
Both comments in the code start with /*!re2c
. This tells the re2c executable that the comment block contains instructions related to generating the lexer code. The following discussion explains how this works.
3.3 Associate regular expressions with code
In the scan
function, instructions to re2c are provided inside comment blocks that start with /*!re2c
. These instructions can take several forms, and the three most common statements are:
name = regexp;
- associates an identifier with a regular expression (called a named definition) regexp {code}
- associates a regular expression with C code to be executed (called a rule) re2c:name = value;
- configures re2c's name
parameter (called an inplace configuration)
Named definitions clarify the use of regular expressions. For example, instead of using [a-zA-Z]
in a rule, a named definition such as 'letter = [a-zA-Z]
' allows the name letter
to be used in its place.
re2c recognizes a variety of regular expressions. Table 2 lists each of them with a description.
Table 2: Accepted Regular Expression Formats Expression Format | Description |
"foo" | Literal string (case-sensitive) |
'foo' | Literal string (case-insensitive) |
[xyz] | Character class (matches 'x', 'y', or 'z') |
[^class] | Inversion of character class |
a-z | Range (any character from 'a' to 'z') |
r \ s | Match any character class r that isn't s |
r* | Zero or more r expressions |
r+ | One or more r expressions |
r? | Zero or one r expressions |
name | The named definition |
(r) | Overrides precedence for r |
r s | Expression r followed by s |
r|s | Either r or s |
r/s | r only if followed by s |
r{n} | Matches r exactly n times |
r{n,} | Matches r at least n times |
r{n, m} | Matches r at least n times,
but no more than m times |
. | Any character except a newline |
The first two expressions are particularly important. The following statement tells re2c to return a RETURN token when the lexer encounters the return string:
"return" { return RETURN; }
If the double-quotes are replaced with single-quotes, the pattern matching will be case-insensitive. That is, the token will be returned when the lexer encounters return
or RETURN
.
To make the rules easier to read and write, simple_c.re uses the following named definitions:
dig = [0-9];
let = [a-zA-Z_];
These names are used in the pattern for matching identifiers. An identifier can contain numbers, but the first character must be a letter or underscore. Therefore, the rule for identifiers is given as follows:
let (let|dig)* { return NAME; }
Whitespace consists of one or more spaces, \t
, \v
, or \f
characters. This definition is given as:
whitespace = [ \t\v\f]+;
When the lexer encounters whitespace in regular code, it should ignore the characters and restart processing from the regular label. This is given in the following rule:
whitespace { goto regular; }
In addition to rules and named definitions, the re2c-specific statements can include inplace configuration statements. These associate an re2c property with a value. For example, the re2c:yyfill:enable
property identifies whether the lexer should refill the buffer. This isn't necessary in simple_c.re because the buffer contains all of the text. Therefore, the following configuration statement is used:
re2c:yyfill:enable = 0;
re2c has many other properties that can be set. The full list is provided by the re2c manual, located here.
3.4 Assign values to macros
The Scanner
structure clarifies memory operations for developers, but the generated lexer code doesn't know anything about it. It interacts with the application using macros whose names start with YY
. Table 3 lists three important macros:
Table 3: Important re2c Macros Macro | Description | Usual Value |
YYCTYPE
| Data type of each input symbol | char , short , int ,
or an unsigned type |
YYCURSOR
| The current cursor location | A Scanner pointer (s->cur ) |
YYMARKER
| Used by re2c to store the
position for back-tracking
| A Scanner pointer (s->ptr ) |
In simple_c.re, the following code assigns values to these macros:
#define YYCTYPE char
#define YYCURSOR s->cur
#define YYMARKER s->ptr
When re2c generates code for the lexer, it accesses the input text using macros like YYCTYPE
and YYCURSOR
. With these #define
statements in place, the compiler will associate re2c's macros with the developer's Scanner
structure.
4. Generating the Lexer
The token file consists of C code but it's not meant to be directly compiled. Before compiling, it should be processed with the re2c executable, which can be downloaded here. This command-line utility accepts a number of helpful flags, including the following:
-d
- dumps information about the lexer's current position and state -u
- supports the full Unicode character set -o
- sets the name of the file to contain the lexer code -D
- produce a Graphviz dot view of the deterministic finite automata (DFA) results
By default, re2c prints its results to standard output. But the following command tells re2c to convert simple_c.re into a lexer source file called simple_c.c:
re2c -o simple_c.c simple_c.re
If the conversion is successful, the simple_c.c file can be compiled into a lexer with a regular build command such as the following:
gcc -O3 -o simple_c simple_c.c
As simple_c executes, it reads text from test.dat into a buffer and calls the scan
function in a loop. Each iteration of scan
returns a token, and in the case of simple_c, each iteration prints the type of token and the line and position where it was encountered.
5. Using the code
The code for this article generates a simple application that reads simple C code. It's provided in an archive called re2c_demo.zip, which contains four files:
- simple_c.re - lexer definition file, re2c can use this to generate a lexer
- tokens.h - associates each token with an integer value
- test.dat - Contains C code to be scanned
- Makefile - Contains instructions for generating the lexer and building the application
If you have GNU's make and gcc utilities, you can execute the make
command. This reads the Makefile instructions and automatically builds the application.
If you want to build the application manually, you need to perform two steps:
- Use re2c to generate the lexer code (
re2c -o simple_c.c simple_c.re
) - Compile the lexer code into an executable (
gcc -o simple_c simple_c.c
)
As the application runs, it reads characters from test.dat and prints information about each token it recognizes.
History
11/21/2015 - The article was completed and released for the first time