In this article we walk through implementing the parsing component of a simple JSON engine as an exercise in using Visual FA to lex/tokenize real world text content.
Introduction
Article list
We've covered a lot of ground with Visual FA, but in all that territory we have yet to explore a concrete example. Here we're going to implement the parsing component of a simple JSON engine using Visual FA.
You might be wondering why we're parsing JSON when .NET ships with a perfectly serviceable JSON parser. The reason is that JSON is very simple to parse, and yet complicated enough to lex /tokenize that it makes for a virtually perfect scenario for demonstrating Visual FA without the parsing portion taking so much space that it gets in the way and becomes a distraction. I deemed avoiding extraneous code (for the purpose of demonstration anyway) to be more valuable than avoiding reinventing the wheel.
I've shipped a Json project with the Visual FA solution. We'll be exploring the JsonParser.cs file therein and some project settings that facilitate building the tokenizer/lexer runners as part of the build process.
Background
You'll want to read at least part 1 before diving in here.
There are several ways to generate lexers with Visual FA. Probably the simplest way is to use the VisualFA.SourceGenerator NuGet package but that only works with C#9 or better.
The option we'll be using is the LexGen tool as a pre-build step. This is more compatible with more .NET versions as well as more .NET languages (though we're using C#), but requires just a little bit more legwork to set up initially.
We're going to generate JsonStringRunner
to lex JSON content from a string
, and JsonTextReaderRunner
to lex JSON from a TextReader
.
Using the code
Setting up the project with the lexer build steps
Whenever you use LexGen, you'll need it accessible from somewhere you can use it as a build step. In order to keep everything together I usually package build tools at the root of the solution folder, but if you find that sloppy or otherwise undesirable you can alter the build step to draw the executable files from wherever.
Let's take a look at the pre-build step for the Json project:
dotnet "$(SolutionDir)LexGen.dll" "$(ProjectDir)json.rl" /class JsonRunner /dual /namespace Json /nospans /output "$(ProjectDir)JsonRunners.cs"
It is relatively self explanatory except for a couple of arguments. /nospans
which indicates that we will not be generating code to use spans in .NET This option is more compatible with more .NET versions. /dual indicates that we want both JsonStringRunner
and JsonTextReaderRunner
to be generated. This is mangled based on the /class
when /dual
is indicated. You can see we're using json.rl as an input file. We'll cover that.
Defining the lexer(s) using an .rl file
The json.rl file is a lexer specification file in the Rolex lexer format. It extends the format slightly by allowing the blockEnd
attribute to be a regular expression if you use single quotes instead of double quotes for the value. Otherwise it is identical. Here are our JSON lexer symbol names and definitions:
Object = "{"
ObjectEnd = "}"
Array = "["
ArrayEnd = "]"
FieldSeparator = ":"
Comma = ","
Number = '-?(?:0|[1-9][0-9]*)(?:\.[0-9]+)?(?:[eE][+-]?[0-9]+)?'
Boolean = 'true|false'
Null = "null"
String = '"([^\n"\\]|\\([btrnf"\\/]|(u[0-9A-Fa-f]{4})))*"'
WhiteSpace = '[ \t\r\n]+'
Note that in some places we used single quotes, which denotes a regular expression. In other places we used double quotes which denotes a string literal.
This will ultimately create two classes for us: JsonStringRunner
which accepts a string
and returns a series of FAMatch
instances, and JsonTextReaderRunner
which accepts a TextReader
and returns a series of FAMatch
instances. We've generated both only because the string
runner is slightly faster. Obviously we could have avoided generating that code if we just created a wrapper using StringReader
but here we demonstrate the full Monty.
Using the lexers to parse code
While it's possible to generate parser code given a context free grammar specification I don't recommend it in most cases. The reason being is that generated parser code tends to be heavy handed both in terms of size, and in terms of how they parse. Rolling by hand allows one to switch between "lazy" and "greedy" matching and is ultimately more flexible, compact and performant. Often massaging a grammar to work with a particular parsing algorithm is just as difficult as hand tooling the code. That's not the case for JSON, but JSON was designed to be parsed simply. This isn't true of a lot of languages.
We will be using recursive descent parsing to process the incoming FAMatch
tokens. The JsonParser
class (JsonParser.cs) handles this.
Starting with a core function which is used to skip whitespace, we'll explore the code:
static void _SkipWS(IEnumerator<FAMatch> cursor)
{
while (cursor.Current.SymbolId == JsonStringRunner.WhiteSpace
&& cursor.MoveNext()) ;
}
Okay, this is fairly simple. All we do is keep advancing until we don't have any whitespace under the cursor
, or until there is no more input. Note that there's no common lexer base class with the symbols defined on them, but the symbol ids for both JsonStringRunner
and JsonTextReaderRunner
are the same, so we just access them via the former class.
Next we have the array parsing routine. This is actually pretty simple because most of what it's doing is delegating to _ParseValue()
:
static JsonArray _ParseArray(IEnumerator<FAMatch> cursor)
{
var position = cursor.Current.Position;
var line = cursor.Current.Line;
var column = cursor.Current.Column;
var result = new JsonArray();
_SkipWS(cursor);
if (cursor.Current.SymbolId != JsonStringRunner.Array)
throw new Exception("Expected an array");
if (!cursor.MoveNext())
throw new JsonException("Unterminated array", position, line, column);
while (cursor.Current.SymbolId != JsonStringRunner.ArrayEnd)
{
result.Add(_ParseValue(cursor));
_SkipWS(cursor);
if (cursor.Current.SymbolId ==
JsonStringRunner.Comma)
{
cursor.MoveNext();
_SkipWS(cursor);
} else if(cursor.Current.SymbolId==JsonStringRunner.ArrayEnd)
{
break;
}
}
return result;
}
Here we store the cursor position information, and then parse each value until we find a ]
, skipping over commas.
Next we handle parsing the fields, which is a string, followed by a field separator, followed by a value:
static KeyValuePair<string,object> _ParseField(IEnumerator<FAMatch> cursor)
{
var position = cursor.Current.Position;
var line = cursor.Current.Line;
var column = cursor.Current.Column;
if (cursor.Current.SymbolId != JsonStringRunner.String)
throw new JsonException("Expecting a field name", position, line, column);
var name = JsonUtility.DeescapeString(
cursor.Current.Value.Substring(1, cursor.Current.Value.Length - 2));
_SkipWS(cursor);
if (!cursor.MoveNext())
throw new JsonException("Unterminated JSON field", position, line, column);
if (cursor.Current.SymbolId != JsonStringRunner.FieldSeparator)
throw new JsonException("Expecting a field separator", position, line, column);
_SkipWS(cursor);
if (!cursor.MoveNext())
throw new JsonException("JSON field missing value", position, line, column);
var value = _ParseValue(cursor);
return new KeyValuePair<string, object>(name, value);
}
Now we deal with parsing objects, which is just {
followed by zero or more fields separated by commas, followed by a }
:
static JsonObject _ParseObject(IEnumerator<FAMatch> cursor)
{
var position = cursor.Current.Position;
var line = cursor.Current.Line;
var column = cursor.Current.Column;
var result = new JsonObject();
_SkipWS(cursor);
if (cursor.Current.SymbolId != JsonStringRunner.Object)
throw new JsonException("Expecting a JSON object", position, line, column);
if (!cursor.MoveNext())
throw new JsonException("Unterminated JSON object", position, line, column);
while (cursor.Current.SymbolId != JsonStringRunner.ObjectEnd)
{
_SkipWS(cursor);
var kvp = _ParseField(cursor);
result.Add(kvp.Key, kvp.Value);
_SkipWS(cursor);
if (cursor.Current.SymbolId == JsonStringRunner.Comma)
{
cursor.MoveNext();
} else if(cursor.Current.SymbolId == JsonStringRunner.ObjectEnd)
{
break;
}
}
return result;
}
_ParseValue()
is a relatively meaty function in that it can handle any situation where a JSON value is expected. A value can be anything - either a scalar value like a boolean
or a string
, or it can be an object
or array
. We already have functions for some of this, so basically what's left is numbers, booleans, strings, and null:
static object _ParseValue(IEnumerator<FAMatch> cursor)
{
var position = cursor.Current.Position;
var line = cursor.Current.Line;
var column = cursor.Current.Column;
object? result = null;
_SkipWS(cursor);
switch (cursor.Current.SymbolId)
{
case JsonStringRunner.Object:
result = _ParseObject(cursor);
break;
case JsonStringRunner.Array:
result = _ParseArray(cursor);
break;
case JsonStringRunner.Number:
result = double.Parse(
cursor.Current.Value,
CultureInfo.InvariantCulture.NumberFormat);
break;
case JsonStringRunner.Boolean:
result = cursor.Current.Value[0] == 't';
break;
case JsonStringRunner.Null:
break;
case JsonStringRunner.String:
result = JsonUtility.DeescapeString(
cursor.Current.Value.Substring(1,
cursor.Current.Value.Length - 2));
break;
default:
throw new JsonException("Expecting a value",
position,
line,
column);
}
cursor.MoveNext();
return result!;
}
Note that we're not validating anything. For example, we just check the first character of the match for a "t" to indicate true
. Upon initial consideration this may not seem very robust, but it is, because the lexer code already handled the validation as part of the tokenizing process. This simplifies your code, and actually makes it more robust since it reduces the chance of errors, and you're not duplicating effort, which creates more maintenance.
The root parsing functions themselves are trivial by comparison:
static object? _Parse(FARunner runner)
{
var e = runner.GetEnumerator();
if (e.MoveNext())
{
return _ParseValue(e);
}
throw new JsonException("No content", 0, 0, 0);
}
public static object? Parse(string json)
{
var runner = new JsonStringRunner();
runner.Set(json);
return _Parse(runner);
}
public static object? Parse(TextReader json)
{
var runner = new JsonTextReaderRunner();
runner.Set(json);
return _Parse(runner);
}
The main thing here is taking a runner, getting an enumerator off of it, and moving to the first FAMatch
token. We can then pass the enumerator to our parse functions from earlier.
The public functions just wrap that, spinning off the appropriate runner based on the type of input.
Hopefully I've illustrated how using Visual FA can reduce your effort, increase robustness, and make your parsing code more comprehensible.
Points of Interest
This library may not be entirely academic. Visual Studio Code uses "JSON"-like files that are JSON but with C style comments in them. This parser could be trivially extended to support comments by augmenting json.rl to lex them and _SkipWS()
to move past them if parsing such files is necessary.
History
- 14th April, 2024 - Initial submission