Introduction
This is a topic I've covered before, but the code was getting rusty, and a bit bloated, so I've decided to streamline, rewrite, and produce an accompanying article.
I write a lot of parsers and lexers and generators for those, and in the course of doing this, I've developed some simple tools to cover common pitfalls in doing so. The class presented here is absolutely foundational to a lot of my code. I use it or one of its variants in many of my projects, including Parsley. It provides a unified facade over disparate input sources, managing lifetime, location information such as line and column info, capturing input, error reporting, and a better cursor than TextReader
or IEnumerator<char>
provides, all in a lightweight class.
I'm going to recover ground I've covered in my previous article mentioned in the introduction. If you've followed along so far from the previous article, the main changes I've made are removing lookahead to eke a little bit more performance out of the class, and separating some of the helper methods into a separate file, making the core API extremely spartan, while keeping the extended functionality as a separate included file.
Conceptualizing this Mess
There are several pitfalls to lexing and parsing this class covers, despite the API being small, so we'll cover them in sections.
Unified Input and Lifetime Management
LexContext
provides Create()
and CreateFrom()
which take an IEnumerator<char>
or a TextReader
respectively. There is also a CreateFrom()
overload that takes a filename. Furthermore, there is CreateFromUrl()
which allows you to create a LexContext
over a remote or local source indicated by URL.
In addition, LexContext
provides a Close()
method and is disposable so that you can close the underlying file or network stream, if any.
Regardless of how the instance was created, the interface for accessing it is the same.
Cursor Management
When writing lexers or lexerless parsers, you often need to keep the current character under the cursor around, and pass it to the various parse or lex functions you've created. I'll demonstrate later, but basically, you'll find yourself using Peek()
on TextReader
if you don't, and that's Bad(TM) as it prevents your code from working over network streams or other non-seekable sources, like Console input.
IEnumerator<char>
is better for this as it provides the Current
property which provides access to that character under the cursor that we need. However, unlike TextReader
, IEnumerator<char>
will throw if the cursor is positioned past the end of the stream, and does not give you any indication of that except while you MoveNext()
. This is extremely undesirable as it means you have to track the ending condition yourself, which can complicate the parsing code considerably if you have to incorporate that bookkeeping into your parsing routines.
LexContext
, like TextReader
will report the current cursor value even if the cursor is not over a character. Like IEnumerator<char>
, it reports the current character through the Current
property and does so without seeking. It will report the end of input with -1/EndOfInput
or before the beginning with -2/BeforeInput
when the cursor is in the initial position. It will also report -3/Disposed
if the LexContext
has been closed/disposed of.
Advance()
works like TextReader.Read()
in that it advances the cursor and returns the character that was read. After Advance()
, that character can be accessed again through the Current
property. You can use whichever mechanism for retrieving the current character that suits your code in the moment.
Using these makes the above pitfalls vanish, and in doing so, makes your code to execute a lex/parse almost trivial while preserving the ability to examine non-seekable streams.
Location Information
LexContext
tracks position, line and column info, and the current file or URL where the document resides, if any. This is often used for error reporting, but can be used for whatever it's needed for. Tab width is 4 unless otherwise specified using the TabWidth
property. Setting this to the input decide's tab width allows column information to be accurately tracked. For the console, as well as most underlying input sources, the default should be appropriate. The Line
and Column
properties are one based, while the Position
property is zero based. FileOrUrl
holds the source location of the current document.
Error Handling
ExpectingException
is the exception that is thrown when the lexer/parser encounters an error. It reports error location information, a message, and an optional collection of expected characters or symbols. You can throw it manually, but typically it will be raised as a result of calling Expecting()
.
While parsing, you call Expecting()
to indicate that you expect Current
to be one of a series of characters. If Current
is not one of those characters, an ExpectingException
is automatically filled with the current context information and then thrown.
Capturing Input
As you parse or lex, you'll typically need to capture the input you're examining to be retrieved later during the parse/lex, as part of a subparse, or simply to report from your lex or parse functions. LexContext
provides a CaptureBuffer
to help do just that.
CaptureBuffer
itself is a StringBuilder
, while LexContext
provides Capture()
to capture the current character under the cursor, if any, GetCapture()
to retrieve all or part of the captured buffer, and ClearCapture()
to clear the capture buffer.
If you need to execute a subparse, simply capture what you need to subparse into the capture buffer, and then create another LexContext
over the captured string! You can then feed that LexContext
to your subparse functions.
Using this Mess
I've provided a simply JSON minifier by way of example. It is about half of a JSON parser as well in that it loads the parsed text into dictionaries and lists depending on whether they are JSON objects or JSON arrays. What it doesn't do is normalize scalar values like strings. Strings are returned with quotes and embedded escapes, numbers, bools and nulls are returned as strings.
static void Main(string[] args)
{
IDictionary<string, object> json = null;
using (var pc = LexContext.CreateFrom(@"..\..\Burn Notice.2919.tv.json"))
json = _ParseJsonObject(pc);
_WriteJsonTo(json, Console.Out);
}
static object _ParseJson(LexContext pc)
{
pc.TrySkipWhiteSpace();
switch (pc.Current)
{
case '{':
return _ParseJsonObject(pc);
case '[':
return _ParseJsonArray(pc);
default:
return _ParseJsonValue(pc);
}
}
static IDictionary<string, object> _ParseJsonObject(LexContext pc)
{
var result = new Dictionary<string, object>();
pc.TrySkipWhiteSpace();
pc.Expecting('{');
pc.Advance();
pc.Expecting();
while ('}' != pc.Current && -1 != pc.Current)
{
pc.TrySkipWhiteSpace();
pc.Expecting('"');
var fn = _ParseJsonValue(pc);
pc.TrySkipWhiteSpace();
pc.Expecting(':');
pc.Advance();
result.Add(fn, _ParseJson(pc));
pc.TrySkipWhiteSpace();
pc.Expecting('}', ',');
if (',' == pc.Current) pc.Advance();
}
pc.Expecting('}');
pc.Advance();
return result;
}
static IList<object> _ParseJsonArray(LexContext pc)
{
var result = new List<object>();
pc.TrySkipWhiteSpace();
pc.Expecting('[');
pc.Advance();
pc.Expecting();
while (-1 != pc.Current && ']' != pc.Current)
{
pc.TrySkipWhiteSpace();
result.Add(_ParseJson(pc));
pc.TrySkipWhiteSpace();
pc.Expecting(']', ',');
if (',' == pc.Current) pc.Advance();
}
pc.Expecting(']');
pc.Advance();
return result;
}
static string _ParseJsonValue(LexContext pc)
{
pc.TrySkipWhiteSpace();
pc.Expecting();
pc.ClearCapture();
if ('\"' == pc.Current)
{
pc.Capture();
pc.Advance();
pc.TryReadUntil('\"', '\\', true);
return pc.GetCapture();
}
pc.TryReadUntil(false, ',', '}', ']', ' ', '\t', '\r', '\n', '\v', '\f');
return pc.GetCapture();
}
You can see for starters, that there are API functions like TrySkipWhiteSpace()
. Those are in LexContext.BaseExtensions.cs. All these do is drive the LexContext
in order to fulfill the requested operation, like skipping over whitespace. TrySkipUntil()
/TryReadUntil()
are very helpful as they are escape aware. Above, we use TryReadUntil()
to capture a string with \
as the escape character.
After we parse the JSON into the appropriate objects, we simply write them to the console, as shown here:
static void _WriteJsonTo(object json, TextWriter writer)
{
var d = json as IDictionary<string, object>;
if (null != d)
_WriteJsonObjectTo(d, writer);
else
{
var l = json as IList<object>;
if (null != l)
_WriteJsonArrayTo(l, writer);
else
writer.Write(json);
}
}
static void _WriteJsonObjectTo(IDictionary<string, object> json, TextWriter writer)
{
var delim = "{";
foreach (var field in json)
{
writer.Write(delim);
_WriteJsonTo(field.Key, writer);
writer.Write(":");
_WriteJsonTo(field.Value, writer);
delim = ",";
}
if ("{" == delim)
writer.Write(delim);
writer.Write("}");
}
static void _WriteJsonArrayTo(IList<object> json, TextWriter writer)
{
var delim = "[";
foreach (var item in json)
{
writer.Write(delim);
_WriteJsonTo(item, writer);
delim = ",";
}
if ("[" == delim)
writer.Write(delim);
writer.Write("]");
}
Note that this is suitable for minification, but it is not a standards compliant JSON parser. In order to keep things simple for the demo, it's a bit "looser" than the JSON spec in that it allows more than the JSON spec does. Namely, it doesn't enforce valid string escapes, accepts more kinds of whitespace, and doesn't enforce termination of a string on a newline. Despite that, as long as the JSON is valid to begin with, this will parse it.
History
- 17th January, 2020 - Initial submission