Introduction
Parsing is a common need in modern software. With so many data interchange formats, at some point, most developers will need to write a parser at some point.
Most commercial products with complex parsers used hand rolled parsing for at least a portion of their parsing process.
The code here aims to provide a small, light solution for easing the creation of hand rolled parsers.
Background
Hand rollled parsers can be difficult to write and maintain. One of the main problems is proper factoring, but factoring is made more difficult because of lookahead.
Lookahead, quite simply, is the symbols or characters the parser needs to read ahead of the cursor in order to choose the next branch.
Many parsers get by on 1 character of lookahead. The grammar for the JSON language for example, only requires one character of lookahead in order to parse it. More complex grammars may require more at points, but generally, the bulk will be only 1 character of lookahead.
One of the ways to handle lookahead would be to use the TextReader Peek()
function, but this will break on a NetworkStream
. It requires some amount of "seeking" in order to work. That means "backtracking" (going over the same character more than once), which should be unnecessary.
Using the Code
Enter the ParseContext
:
The ParseContext
is a class that wraps an underlying TextReader
or IEnumerable<char>
(including a string
) and provides several methods for parsing and capturing content.
It provides one character of "lookahead" by keeping the stream advanced by one character. Instead of reading the next character from the stream, every Advance()
call changes the Current
member to reflect the current input, such that Current
always contains the character under the cursor.
This allows for reading from input and parsing very easily.
Capture
contains the current contents of the Capture
buffer, and CaptureCurrent()
stores the current character (if any) into the Capture
buffer. CaptureBuffer
accesses the underlying StringBuilder
used for storing the capture.
Implementing one of the TryParseXXXXX
methods:
partial class ParseContext
...
public bool TryParseCSharpLiteral(out object result)
{
result = null;
EnsureStarted();
switch (Current)
{
case '@':case '\"':
string s;
if(TryParseCSharpString( out s))
{
result = s;
return true;
}
break;
case '\'':
if(TryParseCSharpChar(out s))
{
if (1 == s.Length)
result = s[0];
else
result = s;
return true;
}
break;
case '0':case '1':case '2':case '3':case '4':case '5':case '6':
case '7':case '8':case '9':case '.':case '-':case '+':
if (TryParseCSharpNumeric(out result))
return true;
break;
case 't':case 'f':
bool b;
if(TryParseCSharpBool(out b))
{
result = b;
return true;
}
break;
case 'n':
if (TryReadLiteral("null"))
return true;
break;
}
return false;
}
And using it:
object v;
var val =
"-"+(long.MaxValue);
Console.WriteLine("Original value: {0}", val);
var pc = ParseContext.Create(val);
Console.WriteLine("TryRead:");
if (pc.TryReadCSharpLiteral())
Console.WriteLine("\tCapture: {0}", pc.Capture);
else
Console.WriteLine("\tError: {0}", pc.Capture);
Console.WriteLine();
pc = ParseContext.Create(val);
Console.WriteLine("TryParse:");
if (pc.TryParseCSharpLiteral(out v))
{
Console.WriteLine("\tCapture: {0}", pc.Capture);
Console.WriteLine("\tValue: {0}, Type {1}",
v??"<null>",(null!=v)?v.GetType().Name:"<void>");
} else
Console.WriteLine("\tError: {0}", pc.Capture);
Console.WriteLine();
Console.WriteLine("Parse:");
pc = ParseContext.Create(val);
v = pc.ParseCSharpLiteral();
Console.WriteLine("\tValue: {0},
Type {1}", v ??"<null>", (null != v) ? v.GetType().Name : "<void>");
In addition, the sample code contains a merge-minifier for C# source code, and some methods for parsing C# literals and identifiers.
Points of Interest
In this code, note several things:
ParseContext
is factored into a partial
class so that it is easy to create additional parse methods written on the ParseContext
, as a separate file which can be included only if needed.
ParseContext
contains methods like TryReadXXXXX
, TrySkipXXXXX
, and sometimes TryParseXXXXX
and ParseXXXXX
.
These methods try to read - with capture, try to skip - no capture, try to parse - with capture, and parse - with no capture.
Only the ParseXXXXX
method will throw exceptions. The other methods will return false
if the parse was unsuccessful. In the case of methods with capturing, the capture will contain all the text currently parsed. In the case of an error, Current
will contain the character that the parse failed on and Capture
will contain the characters parsed up to that point.
It's recommended that you follow this pattern when you create new parsing methods, but it's not required.
History
- 18th March, 2019: Initial release