ParseContext 2.0: Easier Hand Rolled Parsers

honey the codewitch

5.00/5 (2 votes)

21 Jul 2019CPOL5 min read

13K

126

Quickly build simple parsers with this drop-in source

Download source code - 102.1 KB

Introduction

I write a lot of parsers, and I write parser generators. Despite having access to some amazing tools for generating parsers programmatically, I still write about 80% of my parsing code by hand, because for simple documents, it's just quicker. We're about to make that process even faster.

Background

If you've ever tried to use TextReader or IEnumerator<char> to parse a document, you've probably run into these common frustrations: The enumerator doesn't have a way to check if you're at the end of the enumeration after-the-fact, and the text reader's Peek() function is unreliable on certain sources, like NetworkStream. Both of these limitations require extra bookkeeping to overcome, complicating the parsing code and distracting from the core responsibility - parsing some input!

Another significant limitation is lack of lookahead. You can peek at the one character in front of you and that's it, typically, but sometimes, you just need to look ahead further to complete a parse.

There's also nothing to help with error handling and reporting during a parse.

We're going to address those limitations with a class called ParseContext.

Using the Code

ParseContext works quite a bit like TextReader in that it returns input characters as an int. It signals end of input with -1. Unlike the TextReader, it also signals before start of input with -2, and disposed with -3.

Advance() works like the Read() function does on the TextReader class. It advances the input by one character and returns the result as an int.

However, we'll most often be getting our current character from the Current property, which is an int that always holds the character under the cursor (or one of the negative number signals as outlined above) while simply using Advance() to move through the text. Either one works fine. Use whichever suits you in the moment.

To peek ahead in the input without advancing the cursor, we have the Peek() method which takes an optional int lookAhead parameter that indicates how many characters to look forward. Specifying zero simply retrieves the character under the cursor. In any case, the cursor position remains unchanged. This method is safe for non-seekable sources like NetworkStream.

For tracking the position of the input cursor, we have Line, Column, Position, and TabWidth. The first three report the location of the cursor while the latter should be set to the width of the tab on the input device. Doing so ensures that Column will be reported properly in the even of a tab. The default is 8, and they work like tab stops do on a console window, with the screen laid out in virtual columns, meaning an actual tab isn't always the same width.

For error reporting, we have the Expecting() method which takes a variable list of arguments that represent the list of inputs allowed to be present under the cursor. This can include -1 to signify that end of input is one of the allowable values, while passing no parameters is a way to indicate that anything except end of input will be accepted. In the event that the current character is not accepted, an ExpectingException is thrown that reports a detailed error message.

We have an internal capture buffer based around a StringBuilder which we can use to hang on to the input we've parsed so far. This is represented by the CaptureBuffer property. On the ParseContext, we also have CaptureCurrent() which captures the current character under the cursor, if there is one, ClearCapture() which clears the capture buffer, and GetCapture() which retrieves all or part of the capture buffer as a string.

Finally, for creating the ParseContect, we have several static methods: the Create() method which takes a string or a char array, CreateFrom() which takes a filename or a TextReader and CreateFromUrl() which takes a URL.

Remember that ParseContext implements IDisposable and it's very important to dispose of it when one is finished if one loaded text from anything other than a string or a char array. A Close() method would be the usual way to do so but I removed it because there are too many members that start with "C" and it was making intellisense arduous. Use Dispose() or the using keyword in C#.

In addition, I've included a tear-off partial class that includes several helpers including TrySkipWhitespace(), TryReadUntil() and various other methods for common parsing tasks.

This is in ParseContext.Helpers.cs. It is not required for the base functionality, but in practice, it can probably make itself useful in almost any parser.

I wanted to demonstrate the capabilities of the class without too much extra stuff getting in the way, so the demo project simply parses and minifies a large JSON file.

The JSON grammar is pretty simple, and you can see all the particulars at json.org.

The parsing implementation is below. Roughly, it's divided into 3 major parts to represent the major components of a JSON tree: A JSON object, a JSON array, and a JSON value.

Notice how easily one can call subfunctions in this parse without having to worry about transferring the current character - something that's difficult to do with an enumerator or a text reader. Overall, this leads to much cleaner separation of the various parsing functions, and an easier time writing them. The comments should explain the particulars:

static object _ParseJson(ParseContext pc)
{
    pc.TrySkipWhiteSpace();
    switch(pc.Current)
    {
        case '{':
            return _ParseJsonObject(pc);
        case '[':
            return _ParseJsonArray(pc);
        default:
            return _ParseJsonValue(pc);
    }
}
static IDictionary<string, object> _ParseJsonObject(ParseContext pc)
{
    // a JSON {} object - our objects are dictionaries
    var result = new Dictionary<string, object>();
    pc.TrySkipWhiteSpace();
    pc.Expecting('{');
    pc.Advance();
    pc.Expecting(); // expecting anything other than end of input
    while ('}' != pc.Current && -1 != pc.Current) // loop until } or end
    {
        pc.TrySkipWhiteSpace();
        // _ParseJsonValue parses any scalar value, but we only want 
        // a string so we check here that there's a quote mark to 
        // ensure the field will be a string.
        pc.Expecting('"');
        var fn = _ParseJsonValue(pc);
        pc.TrySkipWhiteSpace();
        pc.Expecting(':');
        pc.Advance();
        // add the next value to the dictionary
        result.Add(fn, _ParseJson(pc));
        pc.TrySkipWhiteSpace();
        pc.Expecting('}', ',');
        // skip commas
        if (',' == pc.Current) pc.Advance();
    }
    // make sure we're positioned on the end
    pc.Expecting('}');
    // ... and read past it
    pc.Advance();
    return result;
}
static IList<object> _ParseJsonArray(ParseContext pc)
{
    // a JSON [] array - our arrays are lists
    var result = new List<object>();
    pc.TrySkipWhiteSpace();
    pc.Expecting('[');
    pc.Advance();
    pc.Expecting(); // expect anything but end of input
    // loop until end of array or input
    while (-1!=pc.Current && ']'!=pc.Current) 
    {
        pc.TrySkipWhiteSpace();
        // add the next item
        result.Add(_ParseJson(pc));
        pc.TrySkipWhiteSpace();
        pc.Expecting(']', ',');
        // skip the comma
        if (',' == pc.Current) pc.Advance();
    }
    // ensure we're on the final position
    pc.Expecting(']');
    // .. and read past it
    pc.Advance();
    return result;
}
static string _ParseJsonValue(ParseContext pc) {
    // parses a scalar JSON value, represented as a string
    // strings are returned quotes and all, with escapes 
    // embedded
    pc.TrySkipWhiteSpace();
    pc.Expecting(); // expect anything but end of input
    pc.ClearCapture();
    if ('\"' == pc.Current)
    {
        pc.CaptureCurrent();
        pc.Advance();
        // reads until it finds a quote
        // using \ as an escape character
        // and consuming the final quote 
        // at the end
        pc.TryReadUntil('\"', '\\', true);
        // return what we read
        return pc.GetCapture();
    }
    pc.TryReadUntil(false,',', '}', ']', ' ', '\t', '\r', '\n', '\v', '\f');
    return pc.GetCapture();
}

History

21^st July, 2019 - Initial submission

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)