Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles / programming / regular-expression

Visual FA Part 4: Generating Matchers and Lexers with the Visual FA C# Source Generator

5.00/5 (2 votes)
5 Feb 2024MIT4 min read 6.9K   36  
Easily add lexers to your project with this simple drop in NuGet package and a few annotations
Using the new C#9+ Source Generation features and this NuGet package you can add fast lexers and regex matchers to your project.

Introduction

Article list

Update: Now you can mark up whole classes as well. Instructions below.

I fibbed. In Part 3, the previous article, I claimed we'd be covering the code behind the source code generation features in Visual FA. As it happens, I added an entirely different source generation mechanism in the interim, and it's worth exploring it in its own right.

This requires C#9 so you should be using at least .NET 6, and I'll assume you're using Visual Studio if necessary.

I'm also assuming you've been following along, and have read at least part 1 this series before getting here. I won't be covering what Visual FA is, and I'll briefly touch on its facilities as a lexer and matcher generation engine.

Background

C#9 introduced the ability to hook the compiler process in C# and get it to inject additional dynamically created code during the process of compiling a binary. This technology is known as Source Generators and they are a powerful way to augment the language with additional functionality, even if it is somewhat quirky to use now and again.

I've produced a NuGet package that, once referenced by your project allows you to mark up partial methods with attributes that tell the compiler how to implement a matcher or lexer for you. The generated code will use the VisualFA library, if it is referenced by your project, or otherwise it will inject core shared dependency code into your project as necessary*.

* This source generation feature does not really work with VisualFA.brick.cs because it can't detect the presence of that file in your project at the stage in compilation where my code gets called. You either have to reference the VisualFA.dll or you have to forgo referencing Visual FA altogether. Either option works, just not the brick file.

Using the Code

At the heart of the markup is the FARuleAttribute:

C#
[AttributeUsage(AttributeTargets.Method,AllowMultiple = true,Inherited = false)]
class FARuleAttribute : System.Attribute
{
    public FARuleAttribute() { }
    public FARuleAttribute(string expression)
    { 
        if (expression == null) 
            throw new ArgumentNullException(nameof(expression));
        if (expression.Length == 0) 
            throw new ArgumentException(
                "The expression must not be empty", 
                nameof(expression));
        Expression = expression;
    }
    public string Expression { get; set; } = "";
    public string BlockEnd { get; set; } = null;
    public int Id { get; set; } = -1;
    public string Symbol { get; set; } = null;
}

Note that you won't find this attribute in the library, or in your source code. It is injected into your code after the first time you use it, as are any supporting types. Therefore, the first time you use these, you'll get red squiggles in your code. Just build, because as long as the code is copacetic, the squiggles will resolve themselves.

We'll take a look at it in code before exploring what all of it does:

C#
[FARule(@"\/\*", Id = 0, Symbol = "commentBlock", BlockEnd = @"\*\/")]
[FARule(@"\/\/[^\n]*", Id = 1, Symbol = "commentLine")]
[FARule(@"[ \t\r\n]+", Id = 2, Symbol = "whiteSpace")]
[FARule(@"[A-Za-z_][A-Za-z0-9_]*", Id = 3, Symbol = "identifier")]
internal partial FAStringRunner MyLexer(string text);

Here, we've defined the MyLexer() method as having four rules.

Each rule has a regular expression - typically, the first unnamed argument, and then several named optional arguments: BlockEnd, Id and Symbol. Each of these represents a single rule in a lexer. Here, we've defined commentBlock, commentLine, whiteSpace, and identifier as symbols and given each rule an id although we didn't have to - they will be filled in during the generation process if not provided. If you don't remember the purpose of block ends, see the first article in the series again, but basically it's an additional expression used to match a multicharacter ending condition.

Now on to the method signature itself. It must be a partial method in a partial class. It must either take no arguments, a single string argument, or a single TextReader argument. It must return an FAStringRunner, an FATextReaderRunner, an FAStringDfaTableRunner or an FATextReaderDfaTableRunner. It's also possible to return FARunner, but only if your function takes an argument.

Essentially, the return type is used to suss out what kind of runner to generate, such as whether it should operate on strings or text readers, and whether it is compiled or table driven.

Anyway, once you've created your class and method, you can use it like this:

C#
var exp = "the 10 quick brown #@%$! foxes jumped over 1.5 lazy dogs";
foreach (var match in MyClass.MyLexer(exp))
{
    Console.WriteLine(match);
}
return;

Given the above definitions, what would yield the following to the console:

Terminal
[SymbolId: 3, Value: "the", Position: 0 (1, 1)]
[SymbolId: 2, Value: " ", Position: 3 (1, 4)]
[SymbolId: -1, Value: "10", Position: 4 (1, 5)]
[SymbolId: 2, Value: " ", Position: 6 (1, 7)]
[SymbolId: 3, Value: "quick", Position: 7 (1, 8)]
[SymbolId: 2, Value: " ", Position: 12 (1, 13)]
[SymbolId: 3, Value: "brown", Position: 13 (1, 14)]
[SymbolId: 2, Value: " ", Position: 18 (1, 19)]
[SymbolId: -1, Value: "#@%$!", Position: 19 (1, 20)]
[SymbolId: 2, Value: " ", Position: 24 (1, 25)]
[SymbolId: 3, Value: "foxes", Position: 25 (1, 26)]
[SymbolId: 2, Value: " ", Position: 30 (1, 31)]
[SymbolId: 3, Value: "jumped", Position: 31 (1, 32)]
[SymbolId: 2, Value: " ", Position: 37 (1, 38)]
[SymbolId: 3, Value: "over", Position: 38 (1, 39)]
[SymbolId: 2, Value: " ", Position: 42 (1, 43)]
[SymbolId: -1, Value: "1.5", Position: 43 (1, 44)]
[SymbolId: 2, Value: " ", Position: 46 (1, 47)]
[SymbolId: 3, Value: "lazy", Position: 47 (1, 48)]
[SymbolId: 2, Value: " ", Position: 51 (1, 52)]
[SymbolId: 3, Value: "dogs", Position: 52 (1, 53)]

The other option is to markup a class. This has the advantage of giving you access to the final runner's type, including its symbol constants (if Symbol in the definitions is used):

C#
[FARule(@"\/\*", Id = 0, Symbol = "commentBlock", BlockEnd = @"\*\/")]
[FARule(@"\/\/[^\n]*", Id = 1, Symbol = "commentLine")]
[FARule(@"[ \t\r\n]+", Id = 2, Symbol = "whiteSpace")]
[FARule(@"[A-Za-z_][A-Za-z0-9_]*", Id = 3, Symbol = "identifier")]
partial class MyLexer : FAStringRunner {

}

You can then use it like this:

C#
var exp = "the 10 quick brown #@%$! foxes jumped over 1.5 lazy dogs";
var runner = new MyLexer();
runner.Set(exp);
foreach (var match in runner)
{
    Console.WriteLine(match);
}
// you can also do things like if(match.SymbolId == FooLexer.whiteSpace) ...
return;

That's about all there is to it for now. I'll be adding more features later.

History

  • 2nd February, 2024 - Initial submission
  • 5th February, 2024 - Added class markup to the generator

License

This article, along with any associated source code and files, is licensed under The MIT License