Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles / Languages / C#

LL: A Parser Generator Playground

5.00/5 (4 votes)
25 Jul 2019CPOL3 min read 6.4K  
A parser generator and tool for exploring LL parsing algorithms

The source is too big to post here, because of the Visual Studio Integration project, but here's a link to the GitHub repository.

Introduction

This project was an accident. Or an experiment, rather. Sometimes, there's very little difference between them.

It started here with the codebase I used to teach you gentle readers how to write an LL parser.

My Everest is LL(k) parsing, and this is a step toward that. What I found with the source I used to demonstrate parsing is that it's also a pretty great source for experimenting.

It uses strings in the DebugLL1Parser and DebugTokenizer so it's really easy to examine in a debug session.

I've been using it as a platform to experiment with the LL(k) algorithm.

This project is useful as its own parser generator as well, since it generates efficient table parsers for LL(1) grammars. It's superior to Newt, and is more reliable. The grammar is slightly improved and the attribute system has been expanded. Overall, it uses the same grammar system though. More on Newt here but don't use Newt. Use this.

Note that this parser is still LL(1).

Background

In order to use this codebase to do more than generate parsers you can use, you will need to familiarize yourself with the codebase I used here, as this is simply an evolution of that.

The grammar format is as I said, similar to Newt. Here is the structure.

/****************************************************************\
*  ebnf.ebnf: a self describing grammar for describing grammars  *
*   by codewitch honey crisis, july 2019                         *
\****************************************************************/

grammar<start>= productions;

//
// productions
//
// marking a non-terminal with collapse removes it from 
// the tree and propagates its children to its parent
// marking a terminal with it similarly hides it from
// the result but not from the grammar, unlike hidden
//
productions<collapsed> = production productions | production;
production= identifier [ "<" attributes ">" ] "=" expressions ";";

//
// expressions
//
expressions<collapsed>= expression "|" expressions | expression;
expression= { symbol }+ |;
symbol= literal | regex | identifier | 
    "(" expressions ")" | 
    "[" expressions "]" |
    "{" expressions ("}"|"}+");

//
// attributes
//
// recognized attributes are hidden, collapsed, terminal, start, 
// followsConflict (which can be "error", "first" or "last")
// and blockEnd (string)
attributes<collapsed>= attribute "," attributes | attribute;
attribute<color="red">= identifier [ "=" attrvalue ];
attrvalue<color="orange">= literal | integer | identifier;

//
// terminals
//
literal= '"([^"\\]|\\.)*"';
regex= '\'([^\'\\]|\\.)*\'';
identifier= '[A-Z_a-z][\-0-9A-Z_a-z]*';
integer= '\-?[0-9]+';

// hide comments and whitespace
//
// if you mark a production with the terminal attribute, you
// can use ebnf to define it. Anything that's text or regex 
// is automatically a terminal
whitespace<hidden,terminal>= {" "|"\v"|"\f"|"\t"|"\r"|"\n"}+; 
// equiv regex would be the much shorter '[ \v\f\t\r\n]+';
// otherwise they are exactly the same

// single quotes denote regex. make sure to escape single
// quotes in the regex with \'
// attributes get passed through to the parser, and you can define your own
lineComment<hidden,color="green">= '//[^\n]*[\n]';
blockComment<hidden,blockEnd="*/",color="green">= "/*";

// defining these isn't required, but gives
// us better constant names
// double quotes denote a literal. the escapes are nearly the
// same as C#. \U and \x are not supported
or="|";
lt="<";
gt=">";
eq="=";
semi=";";
comma=",";
lparen="(";
rparen=")";
lbracket="[";
rbracket="]";
lbrace="{";
rbrace="}";
rbracePlus="}+"; 

Like I said, similar to Newt. The "color" attributes aren't defined by LL. They are just passed through to the parser. In this case, I use them for syntax highlighting. You can make whatever attributes you like, and then retrieve them during the parse. It's nice.

Using the Code

This repository contains several projects:

  • ll - the main runtime library, includes entire API and DOM for creating and rendering parsers and tables
  • llrt - the minimal runtime necessary to use the generated parsers. Do not include both this and ll
  • llgen - a tool to generate a parser (VB generation is broken due to a bug in Microsoft's VBCodeProvider, C# works great)
  • lltree - a tool to render an ASCII parse tree for a given grammar and input file
  • llvstudio - a custom tool "LL" used in Visual Studio 2017 (and 2019?) that can render like llgen does
  • llgui - a work in progress GUI for editing EBNF. Doesn't quite work yet
  • lltest - one of my tests written as a command line utility.

The DebugLL1Parser and DebugTokenizer classes use only strings for internal information so seeing what they do in the debugger is easy. I use these to prototype changes to the parser and lexer algorithms before baking those changes into the LL1TableParser and TableTokenizer classes.

Cfg and Ebnf are both pretty enormous and contain APIs for manipulating CFGs and EBNF documents. The latter exposes a full DOM object model.

The Main() function of lltree shows us how to perform a basic parse from an arbitrary grammar without generating a parser. It's all at runtime.

C#
/// <summary>
/// Usage: lltree $grammarfile $inputfile
/// </summary>
/// <param name="args">The grammar file and the input file to parse</param>
/// <returns></returns>
static int Main(string[] args)
{
    if (2!=args.Length)
    {
        _PrintUsage();
        return 1;
    }
    // read the ebnf document from the file.
    var ebnf = EbnfDocument.ReadFrom(args[0]);
    var hasErrors = false;
    // here we validate the document and print any
    // validation errors to the console.
    foreach (var msg in ebnf.Validate(false))
    {
        if (EbnfErrorLevel.Error == msg.ErrorLevel)
        {
            hasErrors = true;
            Console.Error.WriteLine(msg);
        }
    }
    // even if we have errors, we keep going.

    // create a CFG from the EBNF document
    var cfg = ebnf.ToCfg();

    // we have to prepare a CFG to be parsable by an LL(1)
    // parser. This means removing left recursion, and 
    // factoring out first-first and first-follows conflicts
    // where possible.
    // here we do that, and print any errors we encounter.
    foreach(var msg in cfg.PrepareLL1(false))
    {
        if(CfgErrorLevel.Error==msg.ErrorLevel)
        {
            hasErrors = true;
            Console.Error.WriteLine(msg);
        }
    }
    // if we don't have errors let's set up our parse.
    if(!hasErrors)
    {
        // the tokenizer is created from the EBNF document because
        // it has the terminal definitions, unlike the CFG,
        // see https://www.codeproject.com/Articles/5162249/How-to-Make-an-LL-1-Parser-Lesson-1

        // The FileReaderEnumerable takes a filename and exposes IEnumerable<char> from
        // them. Tokenizers expect IEnumerable<char> (typically a string or char array)
        var tokenizer = ebnf.ToTokenizer(cfg, new FileReaderEnumerable(args[1]));

        // now create our parser. and since the parser *might* return multiple parse trees
        // in some cases, we keep reading until the end of document, calling ParseSubtree()
        // each time to get the result back as a ParseNode tree. We then take those nodes and
        // write them to the console via an implicit call to their ToString method
        using (var parser = cfg.ToLL1Parser(tokenizer))
            while (ParserNodeType.EndDocument != parser.NodeType)
                Console.WriteLine(parser.ParseSubtree());
                
        return 0;
                
    }
    return 1;
}

Generating a parser is a bit more involved, but using a generated parser is even easier than using the runtime ones. To initialize one, all you do is the following:

C#
var parser = new MyParser(new MyTokenizer(inputSource));

While using them is exactly the same regardless, other than obvious performance differences. The runtime parser however, runs about as fast as the generated one, but the debug parsers are much slower, obviously.

Points of Interest

There's a huge API to explore here, and I honestly can't be bothered to document it all yet, but if you follow this page, and this GitHub, it will be expanded over time. This project is still research.

History

  • 25th July, 2019 - Initial submission

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)