The source is too big to post here, because of the Visual Studio Integration project, but here's a link to the GitHub repository.
Introduction
This project was an accident. Or an experiment, rather. Sometimes, there's very little difference between them.
It started here with the codebase I used to teach you gentle readers how to write an LL parser.
My Everest is LL(k) parsing, and this is a step toward that. What I found with the source I used to demonstrate parsing is that it's also a pretty great source for experimenting.
It uses string
s in the DebugLL1Parser
and DebugTokenizer
so it's really easy to examine in a debug session.
I've been using it as a platform to experiment with the LL(k) algorithm.
This project is useful as its own parser generator as well, since it generates efficient table parsers for LL(1) grammars. It's superior to Newt, and is more reliable. The grammar is slightly improved and the attribute system has been expanded. Overall, it uses the same grammar system though. More on Newt here but don't use Newt. Use this.
Note that this parser is still LL(1).
Background
In order to use this codebase to do more than generate parsers you can use, you will need to familiarize yourself with the codebase I used here, as this is simply an evolution of that.
The grammar format is as I said, similar to Newt. Here is the structure.
/****************************************************************\
* ebnf.ebnf: a self describing grammar for describing grammars *
* by codewitch honey crisis, july 2019 *
\****************************************************************/
grammar<start>= productions;
//
// productions
//
// marking a non-terminal with collapse removes it from
// the tree and propagates its children to its parent
// marking a terminal with it similarly hides it from
// the result but not from the grammar, unlike hidden
//
productions<collapsed> = production productions | production;
production= identifier [ "<" attributes ">" ] "=" expressions ";";
//
// expressions
//
expressions<collapsed>= expression "|" expressions | expression;
expression= { symbol }+ |;
symbol= literal | regex | identifier |
"(" expressions ")" |
"[" expressions "]" |
"{" expressions ("}"|"}+");
//
// attributes
//
// recognized attributes are hidden, collapsed, terminal, start,
// followsConflict (which can be "error", "first" or "last")
// and blockEnd (string)
attributes<collapsed>= attribute "," attributes | attribute;
attribute<color="red">= identifier [ "=" attrvalue ];
attrvalue<color="orange">= literal | integer | identifier;
//
// terminals
//
literal= '"([^"\\]|\\.)*"';
regex= '\'([^\'\\]|\\.)*\'';
identifier= '[A-Z_a-z][\-0-9A-Z_a-z]*';
integer= '\-?[0-9]+';
// hide comments and whitespace
//
// if you mark a production with the terminal attribute, you
// can use ebnf to define it. Anything that's text or regex
// is automatically a terminal
whitespace<hidden,terminal>= {" "|"\v"|"\f"|"\t"|"\r"|"\n"}+;
// equiv regex would be the much shorter '[ \v\f\t\r\n]+';
// otherwise they are exactly the same
// single quotes denote regex. make sure to escape single
// quotes in the regex with \'
// attributes get passed through to the parser, and you can define your own
lineComment<hidden,color="green">= '//[^\n]*[\n]';
blockComment<hidden,blockEnd="*/",color="green">= "/*";
// defining these isn't required, but gives
// us better constant names
// double quotes denote a literal. the escapes are nearly the
// same as C#. \U and \x are not supported
or="|";
lt="<";
gt=">";
eq="=";
semi=";";
comma=",";
lparen="(";
rparen=")";
lbracket="[";
rbracket="]";
lbrace="{";
rbrace="}";
rbracePlus="}+";
Like I said, similar to Newt. The "color
" attributes aren't defined by LL. They are just passed through to the parser. In this case, I use them for syntax highlighting. You can make whatever attributes you like, and then retrieve them during the parse. It's nice.
Using the Code
This repository contains several projects:
ll
- the main runtime library, includes entire API and DOM for creating and rendering parsers and tables llrt
- the minimal runtime necessary to use the generated parsers. Do not include both this and ll
llgen
- a tool to generate a parser (VB generation is broken due to a bug in Microsoft's VBCodeProvider
, C# works great) lltree
- a tool to render an ASCII parse tree for a given grammar and input file llvstudio
- a custom tool "LL
" used in Visual Studio 2017 (and 2019?) that can render like llgen
does llgui
- a work in progress GUI for editing EBNF. Doesn't quite work yet lltest
- one of my tests written as a command line utility.
The DebugLL1Parser
and DebugTokenizer
classes use only string
s for internal information so seeing what they do in the debugger is easy. I use these to prototype changes to the parser and lexer algorithms before baking those changes into the LL1TableParser
and TableTokenizer
classes.
Cfg
and Ebnf
are both pretty enormous and contain APIs for manipulating CFGs and EBNF documents. The latter exposes a full DOM object model.
The Main()
function of lltree
shows us how to perform a basic parse from an arbitrary grammar without generating a parser. It's all at runtime.
static int Main(string[] args)
{
if (2!=args.Length)
{
_PrintUsage();
return 1;
}
var ebnf = EbnfDocument.ReadFrom(args[0]);
var hasErrors = false;
foreach (var msg in ebnf.Validate(false))
{
if (EbnfErrorLevel.Error == msg.ErrorLevel)
{
hasErrors = true;
Console.Error.WriteLine(msg);
}
}
var cfg = ebnf.ToCfg();
foreach(var msg in cfg.PrepareLL1(false))
{
if(CfgErrorLevel.Error==msg.ErrorLevel)
{
hasErrors = true;
Console.Error.WriteLine(msg);
}
}
if(!hasErrors)
{
var tokenizer = ebnf.ToTokenizer(cfg, new FileReaderEnumerable(args[1]));
using (var parser = cfg.ToLL1Parser(tokenizer))
while (ParserNodeType.EndDocument != parser.NodeType)
Console.WriteLine(parser.ParseSubtree());
return 0;
}
return 1;
}
Generating a parser is a bit more involved, but using a generated parser is even easier than using the runtime ones. To initialize one, all you do is the following:
var parser = new MyParser(new MyTokenizer(inputSource));
While using them is exactly the same regardless, other than obvious performance differences. The runtime parser however, runs about as fast as the generated one, but the debug parsers are much slower, obviously.
Points of Interest
There's a huge API to explore here, and I honestly can't be bothered to document it all yet, but if you follow this page, and this GitHub, it will be expanded over time. This project is still research.
History
- 25th July, 2019 - Initial submission