Easier Hand Rolled Parsers

honey the codewitch

4.98/5 (22 votes)

18 Mar 2019CPOL3 min read

7.6K

120

Easier Hand Rolled Parsers

Download source - 28.4 KB

Introduction

Parsing is a common need in modern software. With so many data interchange formats, at some point, most developers will need to write a parser at some point.

Most commercial products with complex parsers used hand rolled parsing for at least a portion of their parsing process.

The code here aims to provide a small, light solution for easing the creation of hand rolled parsers.

Background

Hand rollled parsers can be difficult to write and maintain. One of the main problems is proper factoring, but factoring is made more difficult because of lookahead.

Lookahead, quite simply, is the symbols or characters the parser needs to read ahead of the cursor in order to choose the next branch.

Many parsers get by on 1 character of lookahead. The grammar for the JSON language for example, only requires one character of lookahead in order to parse it. More complex grammars may require more at points, but generally, the bulk will be only 1 character of lookahead.

One of the ways to handle lookahead would be to use the TextReader Peek() function, but this will break on a NetworkStream. It requires some amount of "seeking" in order to work. That means "backtracking" (going over the same character more than once), which should be unnecessary.

Using the Code

Enter the ParseContext:

The ParseContext is a class that wraps an underlying TextReader or IEnumerable<char> (including a string) and provides several methods for parsing and capturing content.

It provides one character of "lookahead" by keeping the stream advanced by one character. Instead of reading the next character from the stream, every Advance() call changes the Current member to reflect the current input, such that Current always contains the character under the cursor.

This allows for reading from input and parsing very easily.

Capture contains the current contents of the Capture buffer, and CaptureCurrent() stores the current character (if any) into the Capture buffer. CaptureBuffer accesses the underlying StringBuilder used for storing the capture.

Implementing one of the TryParseXXXXX methods:

partial class ParseContext
...
public bool TryParseCSharpLiteral(out object result)
{
    result = null;
    EnsureStarted(); // make sure we've moved the cursor to a valid position.
    switch (Current)
    {
        case '@':case '\"':
            string s;
            if(TryParseCSharpString( out s))
            {
                result = s;
                return true;
            }
            break;
        case '\'':
            if(TryParseCSharpChar(out s))
            {
                if (1 == s.Length)
                    result = s[0];
                else
                    result = s;
                return true;
            }
            break;
        case '0':case '1':case '2':case '3':case '4':case '5':case '6':
              case '7':case '8':case '9':case '.':case '-':case '+':
            if (TryParseCSharpNumeric(out result))
                return true;
            break;
        case 't':case 'f':
            bool b;
            if(TryParseCSharpBool(out b))
            {
                result = b;
                return true;
            }
            break;
        case 'n':
            if (TryReadLiteral("null"))
                return true;
            break;
    }
    return false;
}

And using it:

object v;
var val =
        //@"""\U00000022'foobar'\U00000022""";
        //@"""\U00000022\U00000022\t\""""";
        //"@\"\"\"foobar\"\"\"";
        //"null";
        "-"+(long.MaxValue);

Console.WriteLine("Original value: {0}", val);

var pc = ParseContext.Create(val);
Console.WriteLine("TryRead:");
if (pc.TryReadCSharpLiteral())
    Console.WriteLine("\tCapture: {0}", pc.Capture);
else
    Console.WriteLine("\tError: {0}", pc.Capture);
Console.WriteLine();

pc = ParseContext.Create(val); // reset
Console.WriteLine("TryParse:");
if (pc.TryParseCSharpLiteral(out v))
{
    Console.WriteLine("\tCapture: {0}", pc.Capture);
    Console.WriteLine("\tValue: {0}, Type {1}", 
    v??"<null>",(null!=v)?v.GetType().Name:"<void>");
} else
    Console.WriteLine("\tError: {0}", pc.Capture);
Console.WriteLine();

Console.WriteLine("Parse:");
pc = ParseContext.Create(val); // reset
v = pc.ParseCSharpLiteral();
Console.WriteLine("\tValue: {0}, 
    Type {1}", v ??"<null>", (null != v) ? v.GetType().Name : "<void>");

In addition, the sample code contains a merge-minifier for C# source code, and some methods for parsing C# literals and identifiers.

Points of Interest

In this code, note several things:

ParseContext is factored into a partial class so that it is easy to create additional parse methods written on the ParseContext, as a separate file which can be included only if needed.

ParseContext contains methods like TryReadXXXXX, TrySkipXXXXX, and sometimes TryParseXXXXX and ParseXXXXX.

These methods try to read - with capture, try to skip - no capture, try to parse - with capture, and parse - with no capture.

Only the ParseXXXXX method will throw exceptions. The other methods will return false if the parse was unsuccessful. In the case of methods with capturing, the capture will contain all the text currently parsed. In the case of an error, Current will contain the character that the parse failed on and Capture will contain the characters parsed up to that point.

It's recommended that you follow this pattern when you create new parsing methods, but it's not required.

History

18^th March, 2019: Initial release

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)