Prologue
It was August the 18th 2011. The article was
http://blogs.msdn.com/b/netmfteam/archive/2011/08/18/netmf-4-2-regular-expressions.aspx.
The Regular Expression implementation for the
.NET Micro Framework has just been publicly mentioned. The pressure was on for the community to find bugs and sure enough after some time with hungry developers using the library a bug was eventually found however not in the RegEx engine, with the StringBuilder (which was also contributed by myself)
http://forums.netduino.com/index.php?/topic/2717-string-manipulation-in-net-41-micro/.
After promptly fixing the issue in less than 24 hours immediately tested other supported expressions and even utilized my new feature of cycle timing to ensure I was not wandering around into the unmanaged void and creating an access violations on accident.
The main purpose of this article is to decide if the missing methods have any impact on the developers utilizing the library and if further development is even warranted based on this article.
The latter purpose is to ensure that the differences as well as similarities are appropriately highlighted and that the cycle timing feature is explained in way which makes it easy for existing Regex users to understand and use in the same way and last but not least to provide a few laughs at the expense of time alone..
Introduction
With respect to the developers of the Regular Expression syntax and by virtue
of the inventors of the Unix Operating System, not to mention by merit the creator of the language
which powered it, I will say that Regular Expressions are nothing more than a mythical
black box.
It was with great pride that I learned how the great theory of Automata was formed and implemented. I was filled with the odes of joy when I learned there were even two separate
possible representations which reflected what I like to envision as biased and unbiased, Deterministic Finite Automata and Nondeterministic Finite Automata. It was even a greater
feeling when I learned that you could convert between the two and that there was really no difference, save performance or instructions required to complete the determination.
It was with great sadness and .... or ‘Murderous Intent’ that I learned via the history of time through research that both schools of thought were never unified and battled endlessly,
such as the physicists did which inspired some of their reasoning. The disease of difference was forever engrained into the minds of programmers and the symptoms were that each one would
be naive and utilize DFA minimization or other equivalence principles until they woke up and realized that the problem could and should be solved by combining compiled and evolving the
features into RegExNext or such. Even after most realized this, they rather implemented various interpreters into their language implementations and utilized them in an attempted conspiracy
to force the failure or Moore’s Law prematurely and also allow the notion of high capacity hard disks to take off. While initially successful, this plot was thwarted by a flood and newer
solid state technology which actually then benefited the campaign by forcing a rise in cost as well ensuring that solid state technology had the required money for research.
Without exposing my CIA handler and the methods in which I have confirmed these various operations, I know that my story may seem highly suspicious, however,
what I can say is that I
employ you to look into Einstein and how he just so happened to want to escape to America and how he then came up with General Relativity.
I will also say that a proper use of common sense combined with good design and optimization can always outperform a
Regular Expression engine (hands down/up).
Now with that out of the way, I will move onto the meat of the article which I hope is what you
came here looking for rather than my delusional ramblings which only have a slight possibility of being related and a even smaller margin of actually being true.
Please be active and leave feedback if you care about kittens.
Article
Users typically utilize Regular Expressions as a way to exploit what seems like a short cut through some seemingly complex string handling task. But before we begin to get into the
main point of how to use the library and how it differs from the full framework implementation, I will ask that we take a second to
realize other possible utilizations for Regular Expressions.
A developer can use Regular Expressions to actually do quite a bit of work on binary data as well. Once you learn a bit about string encoding and you get into more complex Regular
Expressions (herein known as Regex(‘s)), or if you already do, then you can skip down to the main points and examples.
You can use Regex’s to match sections of logic in a code analysis as well as use them to get relevant or similar portions of an image from another image among many other variations of the
string matching technique which is still applicable since a char is a byte and vice versa and there is no loss or even a conversion which is done. There is more like a
typedef
or an alias
which takes place to provide these constructs to a developer.
Now possibly with your mind even more confused than what it was before or possibly enlightened or even argumentative, I will ask that we let this go for now and move onto the main points.
The new Timing mechanism a.k.a. RegexOptions.Timed
is used to ensure that a match operation does not take longer than the given amount of time defined in ticks.
This was easily achieved by ensuring that the given amount of ticks elapsed since the
initial execution was compared via the following code:
if (timed && (DateTime.Now.Ticks - startTicks) >= MaxTick)
throw new RegexExecutionTimeException(idxStart);
else startTicks = DateTime.Now.Ticks;
It ensures that expressions which include possible endless results such as '\w+' o0r '\b+\' are safely executed and the system is not wasting time needlessly attempting to match.
I also pondered the possibility of adding a Regex.Lazy
flag which would optionally sleep for the given ticks and then possibly increase the allocated ticks until exception.
The RegexExecutionTimeException
class takes the position in the match stack and then allows the catcher to restart the matching process either over or later by providing the
index last evaluated and keeping the stack registers and the class state in the same state as before the exception. Below you will find a meaningless
example:
int catchCount = 0;
string pattern = @"\b(?:(\w+))\s+(\1)\b+";
Regex rx = new Regex(pattern, RegexOptions.IgnoreCase |
RegexOptions.Timed);
rx.MaxTick = ushort.MaxValue * 2;
string text = "The the quick brown fox fox jumped over the lazy dog dog.";
int start = 0;
TryToMatch:
try
{
MatchCollection matches = rx.Matches(text, start);
return MFTestResults.Fail;
}
catch (RegexExecutionTimeException ex)
{
if (++catchCount > 5)
return MFTestResults.Pass;
else
{
start = ex.Index;
goto TryToMatch;
}
}
catch
{
return;
}
A user can find a better value for the MaxTick
value by performing matches on strings with known results to obtain a statistic which may be applied in an algorithm or lack thereof to ensure
stricter operation. The default value is long.MaxValue / 2
.
I thought of adding a Profile
method which was internal, however, I could not get feedback on the implementation let alone did I want to start adding even more nonsense to a seemingly happy
theory and science which did not accept change without what seemed like great loss in both dignity and pride which is of obviously the worst philosophical sin in my humble opinion(s) at the
current time of this writing.
I am not sure of other implementations which have this feature let alone humans who have also expressed such concerns or ideas such as but not limited to the ones I have listed in this writing
nor am I sure that it is a aspect relevant to this article...
You will see from the following Test example that the syntax with respect to the perspective of the C# compiler from the full framework has been kept as close as possible and no such
deviation has been applied to either it or the StringBuilder
class.
string input = "int[] values = { 1, 2, 3 };\n" +
"for (int ctr = values.GetLowerBound(1); ctr <= values.GetUpperBound(1); ctr++)\n" +
"{\n" + " Console.Write(values[ctr]);\n" + " if (ctr < values.GetUpperBound(1))\n" +
" Console.Write(\", \");\n" + "}\n" + "Console.WriteLine();\n";
string pattern = "Console.Write(Line)?";
Match match = Regex.Match(input, pattern);
int matchCount = 0;
string expected0 = null;
string expected1 = null;
while (match.Success)
{
Log.Comment("'" + match.Value + "' found in the source code at position " + match.Index + ".");
if (matchCount == 0) expected0 = match.ToString();
else if (matchCount == 2) expected1 = match.ToString();
match = match.NextMatch();
++matchCount;
}
return matchCount == 3 && expected0 == "Console.Write" &&
expected1 == "Console.WriteLine" ? MFTestResults.Pass : MFTestResults.Fail;
You will also see that MatchCollection
is present.
Regex rx = new Regex(@"\b(?<word>\w+)\s+(\k<word>)\b",
RegexOptions.Compiled | RegexOptions.IgnoreCase);
string text = "The the quick brown fox fox jumped over the lazy dog dog.";
MatchCollection matches = rx.Matches(text);
Log.Comment(matches.Count + " matches found in:\n " + text);
foreach (Match match in matches)
{
GroupCollection groups = match.Groups;
Log.Comment("'" + groups["word"].Value + "'" +
" repeated at positions " + groups[0].Index + " and " + groups[1].Index);
}
return MFTestResults.Pass;
}
catch
{
return MFTestResults.KnownFailure;
}
We will now begin to explore the differences with respect to the Jakarta library from which the code was ported and the full framework implementation.
string pattern = @"\b(?:(\w+))\s+(\1)\b";
Regex rx = new Regex(pattern, RegexOptions.IgnoreCase);
string text = "The the quick brown fox fox jumped over the lazy dog dog.";
MatchCollection matches = rx.Matches(text);
TestTestsHelper.ShowParens(ref rx);
Log.Comment(matches.Count + " matches found in:\n " + text);
foreach (Match match in matches)
{
GroupCollection groups = match.Groups;
Log.Comment("'" + groups[0].Value + "'" + " repeated at positions " +
groups[0].Index + " and " + groups[1].Index);
}
return MFTestResults.KnownFailure;
}
catch
{
return MFTestResults.Fail;
}
From the code example and comments above you will see that the changes to the engine are giving some results which may not be expected at first
with respect to either of the other implementations. This is a classic problem in Regex implementations and great care is needed to ensure that the
theory is followed as closely as possible to ensure compatiblity.
Overall this was the hardest part because of the notion of UTF-8
encoding in the micro framework and the unicode in the full framework. I was even tempted to use shorts rather than char’s for the opcodes for
this very reason however it turned out that it would be okay in the end without such drastic changes.
The bottom line is that the results
are inline with what the Full Framework will return under mostly all circumstances when using syntax that is supported.
For example, Named capture is NOT supported however it would be trivial to add to the current implementation as the members are already present
and the engine just needs small changes to make it actually work.. (Which may end up being another article in this series depending on the feedback)
the same with various other parts of the syntax which could and may very well be added depending on how the feedback and community discussion coalesce
combined with how my work load is dealt with.
The implementation also keeps a Cache of recently compiled regular expressions (just like the full
framework implementation). The cache can be cleared by setting the static member Regex.CacheSize to -1 which clears the cache and prevents newly compiled
expressions from being added. The default cache size is 25 instances at which point they will be recycled on a FILO basis from the cache.
What’s Missing or Different?
In the 4.2 implementation there are some members and methods missing from the Regex class e.g. those for operating on a Capture.
I will elaborate further and you will hopefully decide if this impacts development....
For example take the previously stated issue on UTF-8 vs Unicode characters and the following code example.
bool result = true;
int expectedCount = 6;
string pattern = @"\b(\w+?)([ ])";
string input = "Microsoft Office Professional Edition combines several office " +
"productivity products, including Word, Excel , Access , Outlook , " +
"PowerPoint , and several others. Some guidelines for creating " +
"corporate documents using these productivity tools are available " +
"from the documents created using Silverlight on the corporate " +
"intranet site.";
Regex test = new Regex(pattern);
MatchCollection matches = test.Matches(input);
foreach (Match match in matches)
{
GroupCollection groups = match.Groups;
Log.Comment(groups[2] + ": " + groups[1]);
}
Log.Comment("Found "+matches.Count+" trademarks or registered trademarks.");
if (matches.Count != expectedCount)
{
result = false;
}
if (matches[0].ToString() != "Microsoft ")
{
result = false;
}
else if (matches[1].ToString() != "Excel ")
{
result = false;
}
else if (matches[2].ToString() != "Access ")
{
result = false;
}
else if (matches[3].ToString() != "Outlook ")
{
result = false;
}
else if (matches[4].ToString() != "PowerPoint ")
{
result = false;
}
else if (matches[5].ToString() != "Silverlight ")
{
result = false;
}
return result ? MFTestResults.Pass : MFTestResults.Fail;
The weird character is supposed to be either ©, ™, ℠, or ® however with the UTF-8 encoding at use in the framework this was not a ‘showstopper’ as people would basically just have to understand this platform difference, I could have ended up doing something weird or fancy to try and solve this however I also see the framework growing and supporting various different locales so I will have to wait for feedback to determine how much of a pain in the ass this really is, I suspect since that this can also technically happen on the full framework if you are not careful and that it is not a big as deal as I am making it however I will also say that the character literals may not been the best way to match and that there might be a need for specific codes to be given as well as encoding in future expression syntax parsers, either that or at least an Encoding member of the Regex class itself to ensure that it can optionally matching against alternate encoding.
The entire process of porting the code took all of 4 hours if that. I was compiling expression strings into complete RegexProgram’s and all of the tests passed! I was very happy with myself because I did not end up using a code tool to do the port, I did it by hand and the results spoke for themselves. Because of this I was able to jump in and out of the code as if it was my own.
Some of the comments were lacking but I did my best to ensure that a proper description was given to methods and that each line which was even remotely hard to read has some type of comment ensuring future developers would understand why I was using the variables in the series I was et al;
I basically then started top down with MSDN examples which I knew had to produce consistent results across both implementations.
This is where the headaches started, the matching was as per the theory in the engine however the names of the members were totally different that in the full framework implementations. The API was also vastly different.
This took about another 2 or 3 days to get completely right with no tests failing from the ported libraries unit test code.
The code
reviewer I was working with had some unfortunate problems so this delayed the review process for my assignment about 2 whole months. After Microsoft saw I was so bored that I did a complete WebSocket implementation in less than 48 hours they emailed me and the code reviewer to find out why I was now working on WebSockets... long story short Lorenzo directly reviewed and helped me make some decisions such as to redo the StringBuilder from the full framework code and then to also implement the minimal set of methods which would enable the MSDN examples to pass using the same code.
This sounded like it would be easy at first however I quickly learned that this is where the engines were most different and the the full framework implementation utilizes various specialized matching algorithms based on the expression given and the theories behind each of the matching algorithms.
Without getting too caught up in NFA and DFA or BoyerMoore and Thompson I just made the properties named as per the full framework implementation and this was easy to do, I even promoted and demoted a few fields and properties and things still worked but some of the MSDN examples required the
Match
, MatchCollection
, Capture
, and
Group
classes.
So I began porting them directly over from the full framework to ensure the API was the same and that the results returned from the examples would match.
I was not allowed to take the method such as Pop and Crawl from the full framework without first demonstrating that I understood the code and that I did have a conforming implementation which required the full framework methods not because I wanted to be lazy but because I wanted to ensure that the results returned from my class were consistent with the full framework and I also wanted to ensure the Reflected calls were similar to what was on the Full Framework.
This notion eventually caught on when I was able to demonstrate within 1 week that there were only some very small issues and that I had everything save a few small features missing.
We will now take some time to explore what is missing or different about the implementation and even give some interesting ideas for future implementations.
The main theory of the ported code is to create a
RegexProgram
from the given expression. This program is essentially a state machine which allows it to have certain functions e.g. IsMatch which allow the program to return meaningful results such as true or false or the index of the match inter alia.
There is a Regex Compiler class which is responsible for taking the given expression and compiling it into a
RegexProgram
instance.
There is a
DebugRegexCompiler
which exposes a few exceptions which are typically used for debugging and bug testing but can be used in runtime with a dynamically created regular expressions.
using System;
using System.Text;
using System.Collections;
namespace System.Text.RegularExpressions
{
#if DEBUG
public class RegexDebugCompiler : RegexCompiler
{
static Hashtable hashOpcode = new Hashtable()
{
{OpCode.ReluctantStar, "Star"},
{OpCode.ReluctantPlus, "ReluctantPlus"},
{OpCode.ReluctantMaybe, "ReluctantMaybe"},
{OpCode.EndProgram, "EndProgram"},
{OpCode.BeginOfLine, "BeginOfLine"},
{OpCode.EndOfLine, "EndOfLine"},
{OpCode.Any, "Any"},
{OpCode.AnyOf, "AnyOf"},
{OpCode.Branch, "Branch"},
{OpCode.Atom, "Atom"},
{OpCode.Star, "Star"},
{OpCode.Plus, "Plus"},
{OpCode.Maybe, "Maybe"},
{OpCode.Nothing, "Nothing"},
{OpCode.GoTo, "GoTo"},
{OpCode.Continue, "Continue"},
{OpCode.Escape, "Escape"},
{OpCode.Open, "Open"},
{OpCode.Close, "Close"},
{OpCode.BackRef, "BackRef"},
{OpCode.PosixClass, "PosixClass"},
{OpCode.OpenCluster, "OpenCluster"},
{OpCode.CloseCluster, "CloseCluster"}
};
String OpcodeToString(char opcode)
{
String ret = (String)hashOpcode[opcode];
if (ret == null)
{
ret = "UNKNOWN_OPCODE";
}
return ret;
}
String CharToString(char c)
{
if (c < ' ' || c > 127)
{
return "\\" + (int)c;
}
return c.ToString();
}
String NodeToString(int node)
{
char opcode = Instructions[node ];
int opdata = (int)Instructions[node + Regex.offsetOpdata];
return OpcodeToString(opcode) + ", opdata = " + opdata;
}
public void DumpProgram(System.IO.TextWriter p)
{
for (int i = 0, e = Instructions.Length; i < e; )
{
char opcode = Instructions[i ];
char opdata = Instructions[i + Regex.offsetOpdata];
int next = (short)Instructions[i + Regex.offsetNext];
p.Write(i + ". " + NodeToString(i) + ", next = ");
if (next == 0)
{
p.Write("none");
}
else
{
p.Write(i + next);
}
i += Regex.nodeSize;
if (opcode == OpCode.AnyOf)
{
p.Write(", [");
for (int r = 0; r < opdata; r++)
{
char charFirst = Instructions[i++];
char charLast = Instructions[i++];
if (charFirst == charLast)
{
p.Write(CharToString(charFirst));
}
else
{
p.Write(CharToString(charFirst) + "-" + CharToString(charLast));
}
}
p.Write("]");
}
if (opcode == OpCode.Atom)
{
p.Write(", \"");
for (int len = opdata; len-- != 0; )
{
p.Write(CharToString(Instructions[i++]));
}
p.Write("\"");
}
p.WriteLine("");
}
}
public void dumpProgram()
{
}
}
#endif
}
You can even compile regular expressions into RegexProgram
instance’s and then either serialize them or store their bytecode on a SD card using reflection and then set it back once you instantiate the Regex Class after a reboot. There is a class
RegexPrecompiler
which can be used either under the emulator or on the hardware which is specially designed for this purpose.
using System;
using Microsoft.SPOT;
namespace System.Text.RegularExpressions
{
public class RegexPrecompiler
{
static public void Main(String[] arg)
{
RegexCompiler r = new RegexCompiler();
if (arg.Length <= 0 || arg.Length % 2 != 0)
{
Debug.Print("Usage: recompile <patternname> <pattern>");
return;
}
for (int i = 0, end = arg.Length; i < end; i += 2)
{
try
{
String name = arg[i];
String pattern = arg[i + 1];
String instructions = name + "Instructions";
Debug.Print("\n // Pre-compiled regular expression '" + pattern + "'\n"
+ " private static char[] " + instructions + " = \n {");
RegexProgram program = r.Compile(pattern);
int numColumns = 7;
char[] p = program.Instructions;
for (int j = 0; j < p.Length; j++)
{
if ((j % numColumns) == 0) Debug.Print("\n ");
String hex = (0).ToHexString();
while (hex.Length < 4) hex = "0" + hex;
Debug.Print("0x" + hex + ", ");
}
Debug.Print("\n };");
Debug.Print("\n private static REProgram " + name + " = new REProgram(" + instructions + ");");
}
catch (RegexpSyntaxException e)
{
Debug.Print("Syntax error in expression \"" + arg[i] + "\": " + e.ToString());
}
catch (Exception e)
{
Debug.Print("Unexpected exception: " + e.ToString());
}
}
}
}
}
All in all the assignment took about 4 months from first copy / paste to the final submittal to Microsoft. (And mind you this was with no other documentation or help (by anyone) other than that you [The Reader] would have if you did this yourself).
Fortunately everyone was pleased and apparently there were no bugs which increased my confidence in my skills and caused my reputation to start to go to my head.
In the end If I could do it again I probably would take more of an initiative into getting the syntax and engine the way I wanted it while adding methods for special localized matching in a expression.
E.g. something ruff like this:
//In less than 65535 ticks match the given expression ‘(\.?\)’, pass the results into the function present and return the result.
“\ \In{65535} (\.?\) \-> EvaluateMatch\” //Where EvaluateMatch is a MatchEvaluator or similar
//In less than 65535 ticks match the given expression ‘(\b+)’, pass the results into the function which sleeps for 100 milliseconds and returns the match.
“\ \In{100} (\b+) \-> { Sleep(_GivenTime_); return $1; }\” //_GivenTime_ would implicitly be 100 because it has been given to the In construct.
I would also change how some of the compiling is done by making it use more of the CLR string methods where possible rather than compiling my own type but this may be too much for the micro framework to perform dynamically at this time although it technically is possible.
I imagine if language implementations in general utilized this syntax then reflection and scripting would be something built into every language after there was a conforming regular expression engine.
Just imagine how it would be to perform such a task such filtering raw data on a ethernet adapter if you would be able to write a regular expression to look for data and this expression had the ability to invoke delegates or if you were writing a virus scanner or memory monitor... you could basically use the matching engine to find out when code changes and last but not least imagine how it would be to be able to pop down to assembly / MSIL / bytecode in an expression and have it create code dynamically or patch segments intelligently or to have it call other functions to do so or a combination of both.
I honestly feel that given this opportunity developers would even be able to achieve better interoperability with interpreted code as it would be given directly to the Regular Expression layer and converted into the machine language and executed allowing a much higher degree of security and standardization then various different API’s which exist and classic pointer passing. Just remember it is easy to pass a pointer to another region of memory but it is almost always invalid to pass a pointer across machine boundaries, with this new notion developers would be able to pass Regular Expressions across a machine boundary and have the same expectation of code execution as per the system they were running on; offsets and pointers become localized to the expression and they again have some meaning and it also means they can be transported across machine boundaries with little or no trouble at all just so long as the receiver is parsing them according to the same standard constructs as per all network communication.
I think this is what the original implementer's had in mind when developing ‘Regular Expressions’ however everyone got caught up in the math and then we remembers it was math with strings and forgot that strings are just bytes as well...
Closing
In closing I would also have liked to see more methods related to binary matching however to each implementation their own, a developer could easily wrap the Regex class and still create it with ease.
If I could change the full framework implementation the only changes I make would be a reduction in the amount of classes that actually perform the various types of matches and state machine conversions.
A single superclass would be created and it would analyze the expression taking into account the string given to match. It would determine the branches to use and build up a RegexProgram which contained branches which were specially optimized for the given string rather than defaulting to a single algorithm based on a few cheesy checks. This would allow all types of node searching logics to be applied dynamically in a single match without using any more overhead than is currently utilized in the implementation.
You can see the two original MSDN articles by Collin Miller here:
Here’s cool article which has nothing to do with the Micro Framework but does explain how to convert Regular Expressions to NFA -> DFA works and does have illustrations
and is written by academic sources: http://lambda.uta.edu/cse5317/notes/node9.html.
[Cites / Thanks]
Myself,
CIA,
The ghosts of Dennis Ritchie, Boyer, Moore and Thompson, Fravia and Christmas Past
and last but not least Marijuana
[Shoutz]
Anonymous,
LulzSec,
AntiSec,
Identology - MYiDONE.com, (0d fa real and isocdftpz)
Federation Studios, Marsh, Gaijin, t4w2, Gracktov, Gellpak and the others @ Eclipse
iDenGodz (BTW I hacked you),
iDEN Insider, tommy and crazysim
PLA Blue,
Zi Hackademy,
MIT EECS
[Pokes]
Area 69, - All of you especially NexVision and his lackeys
howardforums, - All of you n00b’s who wish you could have been in Area 69 (especially yoDude)
iDENGodz, - M00$3
iDEN Insider, AaronASB & Rickster