Introduction
In part 1 of this series I discussed the basics of using lex to create a lexical analyser (token recogniser) and yacc to create a parser. In this, the second part, I'll look at using these tools to create a parser capable of reading Visual Studio 6 resource files. Part 2 (this part) will concentrate on the lexical analyser.
However, considering how closely intertwined the two tools are, it will be necessary to switch back and forth between the parser and the lexical analyser.
The project download files contain both the lexical analyser and the parser source code.
Lex 201
As discussed in the first article, the lexical analyser is responsible for reading a stream of input from somewhere and breaking it up into a stream of tokens. Each token represents a lowest level building block representing such things as a string, a number or keyword.
The lexical analyser achieves this by matching the input data to a set of regular expressions (rules). When it finds a match to a specific rule it executes some action code written by the programmer and then returns a token to the parser indicating which rule was matched. It may also return some data associated with the rule. For example, the regular expression
-?[0-9]+ {
yylval.ival = atoi(yytext);
return NUMBER;
}
matches a sequence of 1 or more decimal digits optionally preceded by a minus sign. (For a discussion of the format and meaning of regular expressions you can see
http://www.monmouth.com/~wstreett/lex-yacc/flex_1.html and search for "Patterns"). When it sees the match it calls
atoi
passing it the string
yytext
which just happens to be the string of text which matched the rule. The return value from
atoi
in turn is assigned to the
ival
member of the
yylval
union defined by the parser. The action then returns the
NUMBER
token to the parser and terminates.
As with most regular expression matchers, lex is 'greedy'. It will try and match the longest possible string of input to a rule. That's what you would expect of course. Given an input of digits, 123 you want it to match 123, returning just one NUMBER
token and not a 1 followed by a 2 followed by a 3, returning three NUMBER
tokens, one per digit. But sometimes this present a problem. Suppose you try and code a rule to recognised strings like this
\".*"\" {
yylval.pcs = new CString(yytext);
return STRING;
}
You're saying that when the lexer sees a double quote it should then consume all input (the
.*
) until it sees a closing double quote. And it will do just that. Unfortunately, it will also happily consume double quote characters until it reaches the very last one in the input. The 'greedy' algorithm ensures this.
At this point you might be tempted to try and write a more complex regular expression that takes this into account. Perhaps something like
\"[^"]*\" {
yylval.pcs = new CString(yytext);
return STRING;
}
This rule will stop at the first double quote following the opening double quote. But what it won't do is let you embed an escaped double quote. At some point you have to decide how complex you want your regular expressions to be or whether it might be simpler (and more maintainable) to write some c/c++ code to help the lexer out. Don't forget that you might be looking at the regular expressions six months from now. I chose the latter approach.
So my string matcher looks like this
\" {
/* Seen start of a string */
int ch;
BOOL bExit = FALSE;
yylval.pcs = new CString;
while (!bExit && (ch = yyinput()))
{
switch (ch)
{
case '\\':
switch (ch = yyinput())
{
case 'r':
ch = '\r';
break;
case 'n':
ch = '\n';
break;
case 't':
ch = '\t';
break;
}
break;
case '"':
bExit = TRUE;
continue;
}
*yylval.pcs += ch;
}
return STRING;
}
This combination of regular expressions and procedural code neatly solves both the 'greed' problem and allows for the translation of escape sequences into the characters they represent. The
yyinput()
function simply gets the next character from the input stream feeding lex.
Lex 202
Visual Studio 6 Resource Scripts use an awful lot of keywords and things that look like keywords. There are keywords such as
DIALOG
and
MENU
and then there are window styles such as
WS_TABSTOP
or
WS_BORDER
that also look like keywords.
My first draft of the lext script was written to parse a very simple DIALOG
template with only one or two window styles. It made sense (and was quick and dirty) to simply code a regular expression per window style and return a token to the parser for each window style.
After I had the basics of the parser working I started copying and pasting other DIALOG
templates from other projects into my test file. As each new template was added I'd note the errors the parser encountered and add yet another rule to lex to cover the new window style. It began to look like this
WS_TABSTOP { return L_WS_TABSTOP; }
WS_BORDER { return L_WS_BORDER; }
.
.
.
TCS_OWNERDRAWFIXED { return L_TCS_OWNERDRAWFIXED; }
TCS_RAGGEDRIGHT { return L_TCS_RAGGEDRIGHT; }
TCS_RIGHT { return L_TCS_RIGHT; }
TCS_RIGHTJUSTIFY { return L_TCS_RIGHTJUSTIFY
.
.
.
Have you actually counted how many window styles are defined in the various Platform SDK Headers? I make it about 225 window styles and a further 40 or so extended styles. That makes for about 265 regular expressions just to cope with window and extended window styles.
In addition to having the styles defined here I also had to define a token for each style over in the parser source file and add another parser rule to deal with the token. It all added up to significant cutting and pasting and was rapidly starting to look like a maintainers nightmare (to say nothing of RSI from all that mousing).
At about this point I realised that a windows style looks exactly like an identifier. I already had a rule in place in the lexer to recognise identifiers (things like IDOK
and IDCANCEL
)
[[:alpha:]_][[:alnum:]_]* {
yylval.pcs = new CString(yytext);
return IDENTIFIER;
}
so it wasn't that much of a step to define two static tables of window styles, one for regular and one for extended styles and then to dynamically build up maps relating the name of the window style to the value of the style. Once that was done I changed the identifier rule to this
[[:alpha:]_][[:alnum:]_]* {
DWORD data;
if (m_styleMap.Lookup(yytext, data))
{
yylval.dval = data;
return STYLE;
}
else if (m_extStyleMap.Lookup(yytext, data))
{
yylval.dval = data;
return EXTENDEDSTYLE;
}
yylval.pcs = new CString(yytext);
return IDENTIFIER;
}
This bought me a few things. Firstly, the parser itself was simplified. Instead of 265 tokens it had 2 tokens, one for
STYLE
and one for
EXTENDEDSTYLE
where the exact style was passed as the data associated with the token. I could add new window styles as and when they were defined (or I became aware of their existence) simply by adding them to the static table, without needing any changes whatsoever to the parser source file.
Incidentally I use dynamically created maps (created when the lexer is created) for performance reasons. A linear search of a 265 entry table may not take very long in these days of super Pentiums but the time does add up when you're parsing a typical resource script and, in any case, there's no excuse for inefficient code.
This change also cut the time lex required to process the source file and create the output c++ file from 20 or so seconds to less than 1 second. My time is also important.
Thirdly, this change cut the size of the lex generated file from well over 12000 lines to around 2000 lines. This makes for a significant change in both compile time and in the size of the compiled executable.
Finally, the lexical analyser itself, at runtime, is significantly faster.
Lex 203
The entire purpose of this parser is to parse dialog templates in Visual Studio 6 resource scripts. I wanted my parser to be able to handle the .rc files created and maintained by AppStudio without any other processing. This meant that it had to be able to handle any data constructs edited by AppStudio. On the other hand, my project isn't particularly interested in
MENU
or
TOOLBAR
or
STRINGTABLE
or any of the dozens of other things to be found in a .rc file.
In order to avoid having to write parser rules that could handle everything in a resource script I used a feature of lex which allows us to create sub-lexers within a lex script.
When a lexer begins running it's in what is called the INITIAL
state. However, at any time, it can be switched into another state which causes a different set of rules to be followed. For example.
%exclusive comment
%%
"/*" BEGIN comment;
<comment>"*/" BEGIN INITIAL;
<comment>. |
<comment>\n { ; }
This simple set of rules allows the lexer to switch into the
comment
state when it sees a
sequence. Once it's in the
comment
state it matches all further input against rules prefixed with
<comment>
. Start states can be exclusive or not exclusive. If they're exclusive you use the
%exclusive
keyword, otherwise use the
%start
keyword in the first part of the lex script. When you're using exclusive start states the input will only be matched against the rules defined for that state. Non exclusive states match input against rules without a
<state>
prefix and also those rules with the corresponding start state.
In the case of the preceding code fragment if the lexer is in the INITIAL
state and it sees a
sequence it matches that rule and executes the action. This action changes the lexer state to comment
and continues to read input. However, now the lexer only considers rules commencing with <comment>
(because we declared the comment
state as exclusive). If the input is anything other than */
the input is discarded. When, however, the lexer input contains */
it matches that rule and the lexer switches back to the INITIAL
state. Obvious! But note that none of the actions within the comment
state return anything to the parser. And since they return no tokens to the parser the parser sees nothing between the start of the comment and the end. Comments effectively vanish from the input stream as far as the parser is concerned.
If we can make comments vanish from the input what else can vanish? How about code in #ifdef
blocks where the symbol wasn't defined? This is a little harder to do because the parser is the rightful place for code deciding whether or not an #ifdef
condition was satisfied. But not that much harder to do. Consider this fragment from the parser.
ifdef : L_IFDEF IDENTIFIER
{
if (!m_symbols.IsDefined(*$2))
{
LEXER->m_ifNesting++;
LEXER->yybegin(ifnest);
}
delete $2;
}
;
Here the parser sees the token for
#ifdef
and an
IDENTIFIER
. It checks if the
IDENTIFIER
is in it's symbol table and if not it moves the lexer into the
ifnest
state. It also increments the lexers
m_ifNesting
counter. Now look at the rules for the
ifnest
state in the lexer.
<ifnest>#ifdef |
<ifnest>#if |
<ifnest>#ifndef { m_ifNesting++; }
<ifnest>#endif {
m_ifNesting--;
if (m_ifNesting == 0)
BEGIN INITIAL;
}
<ifnest>. |
<ifnest>\n { ; }
Firstly, the
ifnest
state is declared as exclusive. Secondly, the
ifnest
state needs to count how many #if(def) statements it ignores. This is so it can handle such things as
#if !defined(AFX_RESOURCE_DLL) || defined(AFX_TARG_ENU)
#ifdef _WIN32
LANGUAGE 9, 1
#pragma code_page(1252)
#endif
#include "res\Prober.rc2" // non-Microsoft Visual C++ edited resources
#include "afxres.rc" // Standard components
#endif
where if the very first
#if
fails it will stay in the
ifnest
state until it reaches the last
#endif
statement (the one after the
#include "afxres.rc"
statement).
The net effect of these constructs is that once the parser determines that it's not interested in an input stream it hands responsibility for ignoring all further input until a specified state is reached over to the lexer. Once the responsibility has been handed over the parser sees no more tokens until the end state is reached.
I used the same techniques to allow the parser to ignore, for example, MENU
statements in the resource script. The parser relies on the knowledge that a MENU
statement consists of an IDENTIFIER
followed by the MENU
keyword then a bunch of stuff follwed by the BEGIN
keyword then more stuff and all terminated with an END
keyword. Within the MENU
statement there may be nested BEGIN/END
blocks. By counting BEGIN
and END
keyword pairs we can determine when the MENU
statement has finished.
Many other constructs within the resource file follow a similar pattern and are ignored in the same way.
Lex 204
Resource script files use
#include
statements to include the contents of another file at the point where the
#include
statement occurs. Implementing this functionality takes a little sleight of hand (and some knowledge of the exact method the lexer uses to get it's input). It's pretty easy to establish that the lexer calls a function somewhere to get it's input. Normally that input comes from a file, so you need to store that file handle somewhere, substitute a new one and return to the lexer. Then, when the lexer hits end of file on the newly substituted file handle you close that file and restore the original file handle. That's the theory. Exactly how you do this depends on the lexer you're using. I used Pargen (see Notes below) as my lexer so I did it this way. First the parser needs to recognise that there's a
#include
statement to process.
include : L_INCLUDE STRING
{
DoInclude(*$2);
delete $2;
}
| L_INCLUDE BRACESTRING
{
DoInclude(*$2);
delete $2;
}
;
BRACESTRING
is a token that recognises filenames of the form
<filename.ext>
.
DoInclude()
does
void YYPARSERNAME::DoInclude(LPCTSTR szFileName)
{
ASSERTSTRING(szFileName);
#ifdef NDEBUG
try
#endif
{
FILE *f = fopen(szFileName, "rb");
if (f != (FILE *) NULL)
{
if (yydebug)
TRACE("#including '%s'\n", szFileName);
sInclude *include = new sInclude;
include->file = LEXER->yyin;
include->line = LEXER->yylineno;
LEXER->m_csCurrentFile = include->m_csFileName = szFileName;
m_includeStack.AddHead(include);
LEXER->yyin = f;
LEXER->yylineno = 0;
LEXER->yyreset();
}
}
#ifdef NDEBUG
catch(...)
{
CString csMessage;
csMessage.Format(_T("Error processing %s\nin #include"), szFileName);
AfxMessageBox(csMessage, MB_OK);
}
#endif
}
If this code succeeds in opening the file it allocates a new
sInclude
structure and stores away the existing input file handle which is stored in the
yyin
member of the lexer. It also stores away the current line number and the filename. Then it substitues the new file handle for the existing one in the lexer, resets the linenumber to zero (you want error reporting to tell you the correct linenumber for the current file) and resets the lexer. We also add the
sInclude
structure to a stack. We then returns to normal parsing.
Notice I said if this code succeeds in opening the file. There's no path logic included here. For the purposes of my project I'm not interested in symbols or dialog templates defined in the Microsoft supplied directories. Thus, if an attempt to #include
a file fails I don't consider that to be an error - I simply ignore the situation and continue parsing at the following line.
In the lexer we continue reading input as normal. Eventually the lexer reaches end of file and the yywrap()
function is called.
int YYLEXERNAME::yywrap()
{
sInclude *include;
if (!PARSER->m_includeStack.IsEmpty())
{
fclose(yyin);
include = (sInclude *) PARSER->m_includeStack.RemoveHead();
if (PARSER->yydebug)
TRACE("Popping parser include stack '%s'\n",
include->m_csFileName);
yyin = include->file;
yylineno = include->line;
m_csCurrentFile = include->m_csFileName;
delete include;
return 0;
}
return 1;
}
This works well for our simple parser. It may not work properly for more complex parsers. The problem is that the lexer may need to read forward a long way to determine which rule to match, and it may also need to backtrack a long way if it fails to match a particular rule. If your grammar is such that the lexer needs to do this across a file boundary this technique won't work. Fortunately a resource file is simple enough that it's unlikely such a situation will arise.
Notes on the sample files
This project was developed using Parser Generator from
http://www.bumblebeesoftware.com which has versions of lex and yacc (actually called alex and ayacc) that can generate c++ classes. I don't expect that classical lex and yacc or flex and bison or flex++ or bison++ will be able to compile the source files. However, Bumble-bee Software provide a 30 day evaluation of Parser Generator (I used most of the 30 days developing and tweaking this project). I liked it enough to consider the US$60 for a license a bargain.
I also rebuilt the libraries Bumble-Bee provide to use classic c style FILE
pointers instead of c++ iostreams. You don't care that I did this except that the code to implement #include
uses FILE
pointers. If you attempt to rebuild my project files using Parser Generator you need either to change my code to use c++ iostreams or rebuild their libraries to use FILE
pointers.
The files include the first draft of the project for which I wrote the parser. It currently has only the most basic functionality. To play with the code open a new project and select a .rc file. It should parse the script and present a list of dialogs. Double clicking on a dialog name should launch the dialog.
As has been pointed out in the first comment below - the project won't compile. Well it will actually, but I didn't think to point out that in addition to using files passed through the Parser Generator alex and ayacc it also links with boilerplate libraries and header files provided by Parser Generator. I'm not licensed to distribute those files so they are not included. You can get your own copy by downloading and installing the evaluation version of Parser Generator from the link above.
Acknowledgements
This project contains code written by Herbert Menke (h.menke@gmx.de) but modified by me. The code in question allows resizable controls on dialogs. (Aha, so that's what this project is about).