Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles / desktop / WinForms

Easily Create Your Own Parser

4.81/5 (56 votes)
9 Jul 2011CPOL7 min read 191.8K   9.6K  
Create a hand made parser in VB.NET or C# easily and fast
TokenicerSS.jpg

Introduction

One of the more difficult tasks in computer science is building parsers and compilers. There are a lot of tools available that aid in the tedious task, most notably, Flex and Yacc, both available on the Linux/UNIX platform. The program I present here in this article is called TokenIcer. It is similar to Flex, but TokenIcer provides a nice easy to use GUI that serves as an editor for your rules, as well as a test bed for testing your rules. In addition, once your parsing rules have been defined, TokenIcer can create a parser class, based on your rules, in either C# or VB.NET

Background

To be able to use TokenIcer well, you should have a pretty good understanding of how Regular Expressions work. Each rule you enter into TokenIcer will be based on a Regular Expression. Any regular Expression that the .NET Regex library can parse, will also be valid in TokenIcer.

The way a parser works, and also the way TokenIcer will work, is that you feed into the parser some kind of input string. For example, if we feed the following line into a parser:

3+2 * (6 + 1)  

We should expect our parser to provide us output like this:

{Integer}{Plus}{Integer}{Whitespace}{Asterisk}{Whitespace}{LeftParen}
{Integer}{Whitespace}{Plus}{Whitespace}{Integer}{RightParen}{Newline} 

What we do with this parser output depends on exactly what we are trying to accomplish. Maybe you are building a language compiler, or perhaps a math parser. This is what TokenIcer does. It takes input like "3+2 * (6 + 1)" and converts it into a series of enumerated values.

Using the Code

When you run TokenIcer, you will be presented with a screen with 3 text boxes and an TreeView. The first text box is where we enter our rules. Rules are simply regular expressions wrapped between quotes. Immediately following the rule is a space (or multiple spaces, if you prefer). Following the space is the identifier for that rule. As in the example above, an identifier can be something like Integer, Whitespace or whatever you want to use. Any valid VB.NET or C# identifier can be used. Go ahead and enter the following rule in the text box:

"[0-9]+" INTEGER 

This rule will correctly identify an integer number. In the middle text box, is our test bed. Anything you enter in here will be parsed according to the rules you have in the text box above it. Go ahead and enter any integer number you wish in the test box. Once you enter your number, click the "Test Grammar" button at the bottom left of the window. Once you do, you will see the third textbox, the output text box, did, indeed, correctly identify our number as an {INTEGER}. The output tree also shows INTEGER, but if you expand the tree, you will see the actual value the parser parsed. Now go ahead and enter the following two rules in the first text box, immediately following the first rule:

"[0-9]+\.+[0-9]+" FLOAT
"[ \t]+" WHITESPACE

In your test box, try testing the following:

3.2 15 4.932 

Go ahead and click "Test Grammar". You may be surprised to see a bunch of "UNDEFINED" tags and no FLOAT tags. The reason for this (gotcha #1) is because the parser will parse your input using the first rule it comes to that matches. When making your rules, you must place "higher priority" rules first. The parser came to the "3" and said "ok 3 is an integer". It didn't even look past the 3 to see if there was a decimal point. To fix this, simply put the FLOAT line above the INTEGER line. This way, the parser will try to match the FLOAT rule first and if there is no decimal point, move the INTEGER rule. Change your rules around in the rule text box so it looks like this:

"[0-9]+\.+[0-9]+" FLOAT
"[0-9]+" INTEGER
"[ \t]+" WHITESPACE 

Now if you click "Test Grammar", the output box and the output tree will look as expected. It is now properly parsing as you would expect.

To wrap up this section, I will now show you how TokenIcer can create C# or VB.NET classes that you can include in your own projects. Since we now have our three rules and they are tested without input, go ahead and click the button that says "Generate Class...". When you do that, a window pops up with a drop down list asking which language you prefer. You can select either C# or VB.NET. Also there is a checkbox for including comments. If you check this box, some simple comments will also be generated to help you understand the code. Once you've selected your language, click "Generate my Class" and another window will pop up with the generated C# or VB.NET code. You can hit <Ctrl>+A to select it all and <Ctrl>+C to copy the code into your own project. It's all that simple!

The TokenParser class exposes the following property:

  • InputString -- After instantiating a copy of TokenParser, you must set this string property to the value of the string you want parsed.

The TokenParser class exposes the following methods:

  • GetToken() -- Call this method to retrieve the next token from the input string. The GetToken() method returns a Token object. A Token object contains the TokenName (which is an enum of tokens that you defined in your rules earlier, like INTEGER, WHITESPACE, FLOAT) and it contains the TokenValue. The TokenValue will be the value retrieved from the parser. For example, an INTEGER might return "53" for example.
  • Peek() -- This method will return the next token that GetToken() will return. It allows you to look ahead in the token buffer without actually pulling anything off the queue. The return value of Peek() is a PeekToken. A PeekToken is a special object that contains a Token object and an index. By calling Peek() and passing a PeekToken as an argument to Peek(), Peek() will return the Token that is returned after the last Peek() call. In this way, you can peek ahead several tokens deep.
  • ResetParser() -- This method resets the parser. After calling this, you must set the InputString property to a string again and then you can call GetToken() or PeekToken() as you normally would to parse a new string.

Comments

TokenIcer supports two types of comments. Inline comments and full line comments. Both comments are achieved by prefacing your comment with a hash symbol (a # sign). The first type of comment, inline comments, are a way to comment your enumerations. Look at this example:

"[a-zA-Z_][a-zA-Z0-9_]*" IDENTIFIER #This is an Identifier 

When you create your C# or VB.NET class, you have the option to comment rules. If this option is turned on, then whatever comment you have inline with your rule will be added as a comment to the generated enumeration.

The second type of comment is a full line comment, like so:

# This line is completely ignored.  

Any rule line beginning with a hash symbol is completely ignored. You may use this for your own purpose anyway you see fit.

Example Parser Rules

Here is an example of a very simple BASIC parser:

"[Ll][Ee][Tt]" LET
"[Pp][Rr][Ii][Nn][Tt]" PRINT
"[Cc][Ll][Ss]" CLS
"[Rr][Ee][Mm][^\r\n]*" REM
"[Ee][Nn][Dd]" END
"[Gg][Oo][Tt][Oo]" GOTO
"[Ff][Oo][Rr]" FOR
"[Ss][Tt][Ee][Pp]" STEP
"[Nn][Ee][Xx][Tt]" NEXT
"[Tt][Oo]" TO
"\:" COLON
"=" EQUALS
"\".*?\"" STRING
"[a-zA-Z_][a-zA-Z0-9_]*"     IDENTIFIER
"[ \t]+" WHITESPACE
"[\r\n]+" NEWLINE
"[0-9]?\.+[0-9]+" FLOAT
"[0-9]+" INTEGER
"'.*" APOSTROPHE
"\(" LPAREN
"\)" RPAREN
"\*" ASTERISK
"\/" SLASH
"\+" PLUS
"\-" MINUS 

These rules can properly parse the following program:

10 CLS
20 PRINT "Hello" : PRINT " World!"
30 X = 5
40 Y = X * 2 + 3.5
50 PRINT Y
60 FOR Z = 50 TO 1 STEP -1
70 PRINT "Z = " + Z
80 NEXT Z
90 END

Feel free to expand the parser rules and try to make your own BASIC compiler!

History

  • 7/3/2011 - Version 1.0 released
  • 7/5/2011 - Version 1.1 released. TokenIcer can now save grammar and test input files!
  • 7/7/2011 - Version 1.2 released. TokenIcer now works properly when using reserved VB.NET or C# keywords as identifiers. Also, did some code refactoring to clean some stuff up.
  • 7/8/2011 - Version 1.3 released. The GetToken() engine is now about 95% faster (thank you T_uRRiCA_N for the help). Also, the rule editor text box now has line numbers for easier reading.
  • 7/9/2011 - Version 1.4 released. Before I get into the new features, I just want to apologize for having so many updates in such a short amount of time. I promise to slow it down a bit and you will not see new updates, at least for a week or so (unless there is some major bug in this release). Okay new features for this release include syntax highlighting, as well as a new option menu to disable or change the colors, as well as the ability to disable line numbering. Also, this version now supports comments, both inline and full line comments. Also added is the option to create XML-style comments so you can create documentation using SandCastle or some other program.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)