Introduction
With Bison, the GNU version of Yacc, one is able to generate a C program that
parses a given LALR(1) context-free grammar. Often when creating the grammar and
running it through Bison, one gets warnings about shift/reduce and reduce/reduce
conflicts. Although a functional parser is created, there is a possibility that
any given text is parsed incorrectly as the parser may make incorrect
choices/operations when it comes to resolving the conflicts. If you are an
experienced user of Flex and Bison then stop reading and click on the
Back button. If you are a beginner then continue reading.
To help resolve these conflicts, one can make use of the -v or --verbose
option when invoking Bison. This results in a *.output file being
created, which can be used to find and resolve these conflicts. I have looked
for documentation that describes the output file in more detail but haven't been
able to find anything decent. In this article I am putting forward my
interpretation of the contents of the *.output file.
Just a warning!!! - The grammar found in the structgrammar.y
file is very badly written but is done so that the output file can contain all
the relevant sections, especially those dealing with conflicts.
Background
This wasn't what I was hoping to be my first article for Code Project. I was
hoping my first article would be an introduction to one of the applications that
I have been working on and off for the past three years. But at least this way I
start writing. OK, enough procrastination.
I have been trying on and off to create a parser with the help of GNU tools
Flex (Lex) and Bison (Yacc). My attempts have all failed so far, that is until I
bought the book Lex & Yacc. Die Profitools zur lexikalischen und syntaktischen
Textanalyse by Helmut Herold.
Contents of the output file
The output file provides descriptions of the parser states and can consist of
up to 7 sections.
This section lists all of the non-terminals that have been defined but that
are not used nor referred to in any other rules. That is to say that the
non-terminal doesn't occur on the right side of any of the other rules.
Useless nonterminals:
template
This section lists the rules that are not used in the definition of any other
rules and is similar to the Useless non-terminals
section above.
Useless rules:
#45 template : TEMPLATE;
#46 template : epsilon;
The conflicts section says where the shift/reduce conflict
s and
reduce/reduce conflict
s occur in the various parser states,
e.g.
State 43 contains 4 reduce/reduce conflicts.
The fourth section contains the grammar which is divided up into rules,
e.g.
rule 1 class_specifier -> class_head OPENBRACE
member_specification CLOSEBRACE
while the corresponding line from the context-free grammar is as follows:
class_specifier : class_head OPENBRACE member_specification CLOSEBRACE
The fifth section provides a list of all terminals that have been defined in
the grammar. As can be seen in the example below, the name of each terminal is
followed by two numbers, one that is enclosed in brackets and one that is not.
STRUCT (258) 9
The first number, 258, refers to the number generated by the Bison program
and is used in the C definition found in the structgrammar.tab.c
file.
#define STRUCT = 258
while the second number, 9, is the number of the rule where the terminal is
used.
rule 9 class_key -> STRUCT
The sixth section provides a list of all the non-terminals that have been
defined in the grammar.
class_head (50)
on left: 3 4 5 6 7 8, on right: 1 2
Here again there are a few numbers that are displayed. I think the number
enclosed in brackets directly after the non-terminal is the index into the
yytname
array found in the structgrammar.tab.c file. The
numbers found after on left
indicate the rules where the
non-terminal occurs on the left side of the rule, while the numbers found after
on right
indicate the rules where the non-terminal occurs on the
right side of the rule.
on left:
rule 3 class_head -> class_key
rule 4 class_head -> class_key IDENTIFIER
rule 5 class_head -> class_key base_clause
rule 6 class_head -> class_key IDENTIFIER base_clause
rule 7 class_head -> class_key nested_name_specifier IDENTIFIER
rule 8 class_head -> class_key nested_name_specifier
IDENTIFIER base_clause
on right:
rule 1 class_specifier -> class_head OPENBRACE
member_specification CLOSEBRACE
rule 2 class_specifier -> class_head
OPENBRACE CLOSEBRACE
The seventh section contains details about each parser state. The
structgrammar.output file contains a total of 219 states. A lot of
information is contained in each state description. First, the rules that the
parser has worked through to get to the current state are listed. Some rules
listed haven't been parsed totally. How far the parser has worked through each
rule is indicated by the full stop (.
). The rest of the state
description shows the states to which the parser will proceed or which
operations will occur, depending on the value of the lookahead token. Conflicts
are demarcated by square brackets.
state 43
class_name -> IDENTIFIER . (rule 17)
template_id -> IDENTIFIER . LESSTHAN
template_argument_list GREATERTHAN (rule 59)
type_name -> IDENTIFIER . (rule 92)
LESSTHAN shift, and go to state 76
COMMA reduce using rule 17 (class_name)
COMMA [reduce using rule 92 (type_name)]
OPENBRACKET reduce using rule 17 (class_name)
OPENBRACKET [reduce using rule 92 (type_name)]
SEMICOLON reduce using rule 17 (class_name)
SEMICOLON [reduce using rule 92 (type_name)]
OPERATOR reduce using rule 17 (class_name)
OPERATOR [reduce using rule 92 (type_name)]
$default reduce using rule 92 (type_name)
As an example, state 43 (the conflicts section
above, refers to state 43 having 4 reduce/reduce
conflicts) is
used. To reach state 43 the token IDENTIFIER
has been parsed and is
situated on top of the stack, as is seen in all three rules by the position of
the full stop after the IDENTIFIER
token. If the lookahead token is
LESSTHAN
then a shift
operation occurs and the parser
goes to state 76. If the lookahead token is COMMA
then a
reduce
operation will occur, and it is here that a
reduce/reduce conflict
arises. The COMMA
lookahead
token can be reduced either by rule 17 or rule 92 and this is where the conflict
arises, the parser doesn't know which rule to use in the reduction. The
following is a guess but I think that the line beginning with $default indicates
the action taken when the lookahead token doesn't match any of the given tokens.
An example of a shift/reduction
conflict as it appears in the state
description is as follows:
state 79
...
IDENTIFIER shift, and go to state 70
IDENTIFIER [reduce using rule 14 (nested_name_specifier)]
If the lookahead token is IDENTIFIER
then either a
shift
or reduce
operation can occur. The conflict here
again is that the parser doesn't know which operation to perform.
Comments
I will gladly appreciate any comments on this article.
History
- October 2003 - First public release.
I am a qualified Veterinary Surgeon who prefers treating computers with viruses than animals with viruses. I have recently completed a MEng German Informatics degree at the University of Reading with a 2:1. I also have the ISEB Foundation Certificate in Software Testing.
Currently I am umemployed and desparately looking for a job in the IT industry.