LookAheadEnumerator: Implement Backtracking in Your Parsers

honey the codewitch

5.00/5 (13 votes)

23 Dec 2019MIT3 min read

18.2K

124

Easily implement efficient backtracking capabilities over any enumeration

Download source files - 8.3 KB

Introduction

Complex parsers often need to support backtracking, which is a way to revisit items you've already encountered. The trick with this is twofold, doing it efficiently, and doing it transparently. The LookAheadEnumerator<T> class provides both.

Update: Bug fix, more robust, doc comments, and 90% slangified the code (removed switch/nameof and the like) so that Slang can cook it and I can use it in my generated parser code.

Update: LookAheadEnumerator is now used extensively in my Parsley parser generator which can parse C# (an example of parsing a large subset is at the link)

Background

It's often desirable in parser code to use some sort of "streaming" interface for its input like a TextReader class or an class implementing IEnumerator<T>. I prefer enumerators because of their ubiquity and simplicity. However, it can be difficult to backtrack on streaming sources without preloading it into memory before parsing. This is fine for small text, but not reams of say, bulk JSON.

Normally with an enumerator, all you can do is MoveNext() and sometimes Reset() if you're lucky. There is no way to seek back to a particular previous position, and even if there was, it probably wouldn't work of a true streaming source, like an HTTP response, or console input.

A backtracking parser on the other hand, needs to "bookmark" its current position, before trying several alternatives until it finds one that parses. That means revisiting the same sequence of text several times.

Backtracking parsers are inherently less efficient but far more flexible than non-backtracking parsers. I've made a fair effort to optimize this class to make it as efficient as possible for this purpose.

Conceptualizing This Mess

I've embedded an array backed queue into this class, which it uses to back the the lookahead buffer. The queue starts with 16 elements and grows as needed (almost doubling in size each time to avoid too many reallocations - heap in .NET is cheaper than CPU) depending on how much lookahead is needed. When a LookAheadEnumeratorEnumerator<T> (the lookahead cursor) is advanced, it often requires the primary class to read more data into the queue in order to satisfy it. When the main cursor is advanced, it will discard items in the queue (simply incrementing _queueHead which is really fast) . It's not a good idea to advance or reset the main cursor while using the lookahead cursor. The results in this case, are undefined, as I haven't implemented versioning in these enumerators.

Using This Mess

You use the code like a standard IEnumerator<T> with an additional property - LookAhead that allows you to foreach from your current position without advancing the cursor. There's also Peek() and TryPeek() which look ahead a specified number of positions, and DiscardLookAhead() which simply moves the cursor to the physical position and clears the buffer.

var text = "fubarfoobaz";
var la = new LookAheadEnumerator<char>(text.GetEnumerator());
la.MoveNext(); // can't look ahead until we're already over the position 
               // we want to start look at.
foreach (var ch in la.LookAhead)
    Console.Write(ch);
Console.WriteLine();
while (la.MoveNext())
{
    foreach (var ch in la.LookAhead)
        Console.Write(ch);
    Console.WriteLine();
}

This would print the following to the console:

fubarfoobaz
ubarfoobaz
barfoobaz
arfoobaz
rfoobaz
foobaz
oobaz
obaz
baz
az
z

As you can see, we're incrementing the primary cursor by one in each iteration, and then we're enumerating over LookAhead from there. Enumerating over LookAhead does not affect the primary cursor*.

* The underlying physical read cursor is advanced, as it must be, but a facade is presented using a queue that buffers the already read input for you, presenting it as the next input.

Typically in a parser, you'll use it over tokens, like LookAheadEnumerator<Token> and then each time you need to do backtracking, you parse along LookAhead instead of the primary cursor. When you find a match, you'll have to discard all the tokens you matched, either by reparsing along the primary cursor or by counting and advancing if you know how many tokens you parsed. If you're only parsing one alternative to the main parse, you can simply use DiscardLookAhead() once you've matched the alternate.

That's about all. This is an extremely specialized class, but when you need it, you really need it.

History

19^thDecember, 2019: Initial submission
23^rd December, 2019: Update

License

This article, along with any associated source code and files, is licensed under The MIT License