Visual FA Part 5: Real World Lexing Example w/ JSON

honey the codewitch

5.00/5 (5 votes)

14 Apr 2024MIT6 min read

6.3K

Implementing a simple JSON parser with Visual FA

In this article we walk through implementing the parsing component of a simple JSON engine as an exercise in using Visual FA to lex/tokenize real world text content.

Download VisualFA.zip - 828.6 KB

Download latest from GitHub

Introduction

Article list

We've covered a lot of ground with Visual FA, but in all that territory we have yet to explore a concrete example. Here we're going to implement the parsing component of a simple JSON engine using Visual FA.

You might be wondering why we're parsing JSON when .NET ships with a perfectly serviceable JSON parser. The reason is that JSON is very simple to parse, and yet complicated enough to lex /tokenize that it makes for a virtually perfect scenario for demonstrating Visual FA without the parsing portion taking so much space that it gets in the way and becomes a distraction. I deemed avoiding extraneous code (for the purpose of demonstration anyway) to be more valuable than avoiding reinventing the wheel.

I've shipped a Json project with the Visual FA solution. We'll be exploring the JsonParser.cs file therein and some project settings that facilitate building the tokenizer/lexer runners as part of the build process.

Background

You'll want to read at least part 1 before diving in here.

There are several ways to generate lexers with Visual FA. Probably the simplest way is to use the VisualFA.SourceGenerator NuGet package but that only works with C#9 or better.

The option we'll be using is the LexGen tool as a pre-build step. This is more compatible with more .NET versions as well as more .NET languages (though we're using C#), but requires just a little bit more legwork to set up initially.

We're going to generate JsonStringRunner to lex JSON content from a string, and JsonTextReaderRunner to lex JSON from a TextReader.

Using the code

Setting up the project with the lexer build steps

Whenever you use LexGen, you'll need it accessible from somewhere you can use it as a build step. In order to keep everything together I usually package build tools at the root of the solution folder, but if you find that sloppy or otherwise undesirable you can alter the build step to draw the executable files from wherever.

Let's take a look at the pre-build step for the Json project:

BAT

dotnet "$(SolutionDir)LexGen.dll" "$(ProjectDir)json.rl" /class JsonRunner /dual /namespace Json /nospans /output "$(ProjectDir)JsonRunners.cs"

It is relatively self explanatory except for a couple of arguments. /nospans which indicates that we will not be generating code to use spans in .NET This option is more compatible with more .NET versions. /dual indicates that we want both JsonStringRunner and JsonTextReaderRunner to be generated. This is mangled based on the /class when /dual is indicated. You can see we're using json.rl as an input file. We'll cover that.

Defining the lexer(s) using an .rl file

The json.rl file is a lexer specification file in the Rolex lexer format. It extends the format slightly by allowing the blockEnd attribute to be a regular expression if you use single quotes instead of double quotes for the value. Otherwise it is identical. Here are our JSON lexer symbol names and definitions:

Object = "{"
ObjectEnd = "}"
Array = "["
ArrayEnd = "]"
FieldSeparator = ":"
Comma = ","
Number = '-?(?:0|[1-9][0-9]*)(?:\.[0-9]+)?(?:[eE][+-]?[0-9]+)?'
Boolean = 'true|false'
Null = "null"
String = '"([^\n"\\]|\\([btrnf"\\/]|(u[0-9A-Fa-f]{4})))*"'
WhiteSpace = '[ \t\r\n]+'

Note that in some places we used single quotes, which denotes a regular expression. In other places we used double quotes which denotes a string literal.

This will ultimately create two classes for us: JsonStringRunner which accepts a string and returns a series of FAMatch instances, and JsonTextReaderRunner which accepts a TextReader and returns a series of FAMatch instances. We've generated both only because the string runner is slightly faster. Obviously we could have avoided generating that code if we just created a wrapper using StringReader but here we demonstrate the full Monty.

Using the lexers to parse code

While it's possible to generate parser code given a context free grammar specification I don't recommend it in most cases. The reason being is that generated parser code tends to be heavy handed both in terms of size, and in terms of how they parse. Rolling by hand allows one to switch between "lazy" and "greedy" matching and is ultimately more flexible, compact and performant. Often massaging a grammar to work with a particular parsing algorithm is just as difficult as hand tooling the code. That's not the case for JSON, but JSON was designed to be parsed simply. This isn't true of a lot of languages.

We will be using recursive descent parsing to process the incoming FAMatch tokens. The JsonParser class (JsonParser.cs) handles this.

Starting with a core function which is used to skip whitespace, we'll explore the code:

static void _SkipWS(IEnumerator<FAMatch> cursor)
{
    while (cursor.Current.SymbolId == JsonStringRunner.WhiteSpace 
        && cursor.MoveNext()) ;
}

Okay, this is fairly simple. All we do is keep advancing until we don't have any whitespace under the cursor, or until there is no more input. Note that there's no common lexer base class with the symbols defined on them, but the symbol ids for both JsonStringRunner and JsonTextReaderRunner are the same, so we just access them via the former class.

Next we have the array parsing routine. This is actually pretty simple because most of what it's doing is delegating to _ParseValue():

static JsonArray _ParseArray(IEnumerator<FAMatch> cursor)
{
    var position = cursor.Current.Position;
    var line = cursor.Current.Line;
    var column = cursor.Current.Column;
    var result = new JsonArray();
    _SkipWS(cursor);
    if (cursor.Current.SymbolId != JsonStringRunner.Array) 
        throw new Exception("Expected an array");
    if (!cursor.MoveNext()) 
        throw new JsonException("Unterminated array", position, line, column);
    while (cursor.Current.SymbolId != JsonStringRunner.ArrayEnd)
    {
        result.Add(_ParseValue(cursor));
        _SkipWS(cursor);
        if (cursor.Current.SymbolId == 
            JsonStringRunner.Comma)
        {
            cursor.MoveNext();
            _SkipWS(cursor);
        } else if(cursor.Current.SymbolId==JsonStringRunner.ArrayEnd)
        {
            break;
        }
    }
    return result;
}

Here we store the cursor position information, and then parse each value until we find a ], skipping over commas.

Next we handle parsing the fields, which is a string, followed by a field separator, followed by a value:

static KeyValuePair<string,object> _ParseField(IEnumerator<FAMatch> cursor)
{
    var position = cursor.Current.Position;
    var line = cursor.Current.Line;
    var column = cursor.Current.Column;
    if (cursor.Current.SymbolId != JsonStringRunner.String) 
        throw new JsonException("Expecting a field name", position, line, column);
    var name = JsonUtility.DeescapeString(
        cursor.Current.Value.Substring(1, cursor.Current.Value.Length - 2));
    _SkipWS(cursor);
    if (!cursor.MoveNext()) 
        throw new JsonException("Unterminated JSON field", position, line, column);
    if (cursor.Current.SymbolId != JsonStringRunner.FieldSeparator) 
        throw new JsonException("Expecting a field separator", position, line, column);
    _SkipWS(cursor);
    if (!cursor.MoveNext()) 
        throw new JsonException("JSON field missing value", position, line, column);
    var value = _ParseValue(cursor);
    return new KeyValuePair<string, object>(name, value);
}

Now we deal with parsing objects, which is just { followed by zero or more fields separated by commas, followed by a }:

static JsonObject _ParseObject(IEnumerator<FAMatch> cursor)
{
    var position = cursor.Current.Position;
    var line = cursor.Current.Line;
    var column = cursor.Current.Column;
    var result = new JsonObject();
    _SkipWS(cursor);
    if (cursor.Current.SymbolId != JsonStringRunner.Object) 
        throw new JsonException("Expecting a JSON object", position, line, column);
    if (!cursor.MoveNext()) 
        throw new JsonException("Unterminated JSON object", position, line, column);
    while (cursor.Current.SymbolId != JsonStringRunner.ObjectEnd)
    {
        _SkipWS(cursor);
        var kvp = _ParseField(cursor);
        result.Add(kvp.Key, kvp.Value);
        _SkipWS(cursor);
        if (cursor.Current.SymbolId == JsonStringRunner.Comma)
        {
            cursor.MoveNext();
        } else if(cursor.Current.SymbolId == JsonStringRunner.ObjectEnd)
        {
            break;
        }
    }
    return result;
}

_ParseValue() is a relatively meaty function in that it can handle any situation where a JSON value is expected. A value can be anything - either a scalar value like a boolean or a string, or it can be an object or array. We already have functions for some of this, so basically what's left is numbers, booleans, strings, and null:

static object _ParseValue(IEnumerator<FAMatch> cursor)
{
    var position = cursor.Current.Position;
    var line = cursor.Current.Line;
    var column = cursor.Current.Column;

    object? result = null;
    _SkipWS(cursor);
    switch (cursor.Current.SymbolId)
    {
        case JsonStringRunner.Object:
            result = _ParseObject(cursor);
            break;
        case JsonStringRunner.Array:
            result = _ParseArray(cursor);
            break;
        case JsonStringRunner.Number:
            result = double.Parse(
                cursor.Current.Value, 
                CultureInfo.InvariantCulture.NumberFormat);
            break;
        case JsonStringRunner.Boolean:
            result = cursor.Current.Value[0] == 't';
            break;
        case JsonStringRunner.Null:
            break;
        case JsonStringRunner.String:
            result = JsonUtility.DeescapeString(
                cursor.Current.Value.Substring(1, 
                    cursor.Current.Value.Length - 2));
            break;
        default:
            throw new JsonException("Expecting a value", 
                position, 
                line, 
                column);
    }
    cursor.MoveNext();
    return result!;
}

Note that we're not validating anything. For example, we just check the first character of the match for a "t" to indicate true. Upon initial consideration this may not seem very robust, but it is, because the lexer code already handled the validation as part of the tokenizing process. This simplifies your code, and actually makes it more robust since it reduces the chance of errors, and you're not duplicating effort, which creates more maintenance.

The root parsing functions themselves are trivial by comparison:

static object? _Parse(FARunner runner)
{
    var e = runner.GetEnumerator();
    if (e.MoveNext())
    {
        // _ParseObject() would be more compliant
        // but some services will return arrays
        // and this can handle that
        return _ParseValue(e);
    }
    throw new JsonException("No content", 0, 0, 0);
}
public static object? Parse(string json)
{
    var runner = new JsonStringRunner();
    runner.Set(json);
    return _Parse(runner);
}
public static object? Parse(TextReader json)
{
    var runner = new JsonTextReaderRunner();
    runner.Set(json);
    return _Parse(runner);
}

The main thing here is taking a runner, getting an enumerator off of it, and moving to the first FAMatch token. We can then pass the enumerator to our parse functions from earlier.

The public functions just wrap that, spinning off the appropriate runner based on the type of input.

Hopefully I've illustrated how using Visual FA can reduce your effort, increase robustness, and make your parsing code more comprehensible.

Points of Interest

This library may not be entirely academic. Visual Studio Code uses "JSON"-like files that are JSON but with C style comments in them. This parser could be trivially extended to support comments by augmenting json.rl to lex them and _SkipWS() to move past them if parsing such files is necessary.

History

14^th April, 2024 - Initial submission

License

This article, along with any associated source code and files, is licensed under The MIT License