Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles / Languages / C#

Introduction to Text Parsing in C# using Parakeet

5.00/5 (6 votes)
19 Mar 2024MIT4 min read 9.2K  
An introduction to building recursive descent parsers in C# using the Parakeet parsing library
This article describes how to use the Parakeet parsing library in C#. Using a recursive-descent parsing library can make parsing complex patterns in text much easier than writing code from hand, or having to learn special languages invented by parsing tools. Parakeet allows grammars to be defined directly in C# using operator overloading to map rule combinators to PEG operations.

Parsing Text in C# Using Parakeet

Parakeet is an open-source C# recursive descent parsing library that uses operator overloading to make it easy to declare grammars in a readable format. It is an attempt to combine the advantages of parser generator tools and hand-rolled parsers.

Parakeet was designed particularly for the parsing of non-trivial programming languages, and supports error recovery, tokenization, and parse tree generation. The Plato language implementation is being built using Parakeet.

To get you started, the following is an example of a grammar for parsing CSV files:

C#
public class CsvGrammar : BaseCommonGrammar
{
    public static readonly CsvGrammar Instance = new CsvGrammar();
    public override Rule StartRule => File;

    public Rule StringChar => Named(AnyChar.Except('"') | "\"\"");
    public Rule String => Node(DoubleQuotedString(StringChar));
    public Rule Text => Node(AnyChar.Except(",\n\r\"").OneOrMore());
    public Rule Field => Node(Text | String);
    public Rule Row => Node(Field.ZeroOrMore() + Optional('\r') + '\n');
    public Rule File => Node(Row.ZeroOrMore());
}

Parsing Rules

Parsers are defined as set of parsing rules defined in terms of each other. Rules are combined together using "combinators". In some contexts, a single rule might also be known as a "parser". A rule is an instance of a class derived from Rule which provides a member function:

C#
ParserState Match(ParserState state)

If the rule matches the input at the position represented by the ParserState, a new ParserState instance will be returned, otherwise, the function returns null.

Rule Combinator

A rule combinator is a function that creates a rule from other rules, similar to the core operations of a PEG grammar. For example: Choice, Sequence, ZeroOrMore, OneOrMore, NotAt, etc.

In Parakeet, there are several ways to create rules from other rules:

  • Creating instances of the combinator classes (i.e., using new)
  • Calling one of the extension methods on Rule (e.g., Optional(), NotAt(), etc.)
  • Using operator overloading
    • + => Sequence
    • | => Choice
    • ! => NotAt

For more information on rule combinators, see the Wikipedia Article on Parsing Expression Grammars.

ParserState - The Input and Output of Rule Matching

A ParserState is an immutable class that contains:

  • A pointer to the input - a string combined with look-up tables for quickly determining line and columns from indexes)
  • An offset from the beginning of the input string
  • A pointer to a linked list of parse nodes
  • A pointer to a linked list of error

The fields of a ParserState instance are:

C#
public class ParserState 
{
    public readonly ParserInput Input;
    public readonly int Position;
    public readonly ParserNode Node;
    public readonly ParserError LastError;
    ...
}

Defining Grammars

A grammar is a collection of parsing rules that together describe how to parse input strings. A Parakeet grammar is a class derived from the Grammar class which provides an overidden definition of the starting rule, and an optional whitespace rule.

  • abstract Rule StartRule {get;}
  • virtual Rule WS {get;}

Most rules are defined as computed properties, but can also be functions or fields. It is up to the programmer.

Rules that are directly associated with properties in the grammar are usually either Node rules or Named rules. Named rules simply return the result of matching a child rule and are associated with the property name, and help with grammar debugging and diagnostics.

Node Rules

A Node rule is like a NamedRule in that it also has a name but if matched successfully will return a ParserState that has at least two new ParserNode instances added to the linked list of nodes.

One node points the beginning of the match, and the other node points to the end.
A ParserNode looks like this:

C#
public class ParserNode
{
    public readonly ParserRange Range;
    public readonly string Name;
    public readonly ParserNode Previous;
    ...
}

A linked list of ParserNode instances can be converted into a parse tree using the ParseTreeNode ParserNode.ToParseTree() method.

Generating Parsing Errors with OnFail Rules

There are several ways that a parsing rule might not match successfully:

  • Returning null - this represents the normal failure to match
  • Not reaching the end of the input: ParserState.AtEnd == false
  • Accumulating one or more ParserError instances.

A special rule called OnFail is used to generate ParserError instances. The OnFail rule is expected to appear as a child of a SequenceRule.

The OnFail rule indicates that if the preceding child rules succeed, then any failure in the following rule will generate a ParserError. An optional recovery rule can be provided as a parameter, allowing the parser to advance to the next location, such as just past the end of statement, and attempt to continue.

The following is a snippet from the Plato grammar which demonstrates how the error handling and recovery occurs.

C#
public Rule AdvanceOnFail => 
    OnFail(Token.RepeatUntilPast(EOS | "}"));

public Rule IfStatement =>
    Node(Keyword("if") + AdvanceOnFail + ParenthesizedExpression 
       + Statement + ElseClause.Optional());

The IfStatement rule indicates that if the keyword if is matched, then it must be followed by an expression enclosed in parenthesis, then a valid statement, and then optionally an else clause.
If a failure occurs after the keyword if, then we know there was a parsing error. The parser will consume tokens until it passes the end of statement (EOS) marker (;) or a closing brace (}). The IfStatement rule will return a valid ParserState, but it will have a new ParserError prepended to the linked list of errors.

Final Words

The best way to learn about Parakeet is to read the tests and samples grammars, and to give it a try. The source code and several example grammars are hosted on Github at https://github.com/ara3d/parakeet.

History

  • 19th March, 2024: Initial version

License

This article, along with any associated source code and files, is licensed under The MIT License