This article is primarily of interest to fellow eggheads like me who want to understand more about the mechanics of pattern matching streams using finite automata. In this case, I propose an improved algorithm for converting automatons into regular expressions that can be used to recreate them. The regular expressions created using this algorithm are more readable than those using the vanilla state-removal algorithm.
Introduction
Licensing
The code presented here is MIT licensed, but the concepts I outline are public domain. You are welcome to use those concepts without attribution (although it is appreciated) but if you use the code or derive code from it, then the MIT license applies.
Software Prerequisites
In order to render the state graphs, GraphViz is required. You can download it here.
What is This All About?
Ultimately, this is about regular expressions, which are implemented using state machines. The most common algorithm for converting a regular expression into a state machine - Thompson's construction followed by subset construction - is well known. There are also three less well known, but viable algorithms for converting one of these state machines back into a regular expression. This article is an exploration and improvement of one of these.
Why Do it at All?
There are several reasons you may want to convert an finite automata based state machine back into a regular expression:
- Debugging: It makes it far easier to look at a machine or a subset of a machine as a more human readable regular expression string than it does a state object graph.
- Display: Particularly when dealing with subsets of a machine, like when building lexer code, or when rendering a state graph, being able to produce those subsets as regular expressions makes for readable presentation of those machine subsets.
- "Decompilation": Occasionally, you might run into binary file representations of state machines, like those produced by Gold Parser. Using this algorithm, you can turn those prebaked state machines back into an approximation of the regular expression that was used to create them. (It may not be the exact same expression, but it will match the same text.)
- Converting DFAs back to NFAs: The use case for this is narrow, because all DFAs are also NFAs (though the reverse is not true). However, it may be useful for comparing graphs, or displaying graphs. This is probably the best DFA to NFA conversion algorithm out there because it is holistic - it converts DFA constructs back into Thompson constructions that created them in the first place, which are then reflected in the NFA. Other algorithms do not.
- Curiosity: Even without my improvements, the FA state removal algorithm is interesting, and understanding it can help you become more familiar with the workings of regular expressions and pattern matching.
Background
We're not going to focus on matching code here. This is all about converting regular expressions to and from state machines. That being said, the engine is complete enough that it can simply be extended to do matching pretty easily.
We're going to represent a state in a state machine as a class. The class has an integer accept symbol id and a list of transitions to more instances of that class. Each transition in the list contains a minimum codepoint, a maximum codepoint, and a destination state. The minimum and maximum together indicate a range of matching codepoints in order to make Unicode matching realistic.
We'll be augmenting the state removal method for converting a state machine into a regular expression.
The idea with this technique in its vanilla form is each transition in a state machine is marked by part of a regular expression instead of a codepoint range. You remove states one at a time, and as you remove them, you augment the existing expressions that lead to them until you wind up with a single transition from a root state that contains the expression that represents the entire machine. See the link for the particulars.
In the vanilla algorithm, the expression is represented by a string which gets concatenated to as needed.
We'll be introducing an abstract syntax tree into the mix. An abstract syntax tree is (in this case) an expression tree. It has classes like RegexLiteralExpression
, RegexSetExpression
, RegexRepeatExpression
and RegexConcatExpression
. Instead of building out a string, we'll be building out our expression using these.
Doing this allows us to do higher level analysis on the expression after it's built out, potentially highlighting opportunities for simplifying it. This is important because while the state removal algorithm usually produces reasonable results for basic expressions, they can easily become unreadable as the expressions get more complicated. Part of this is because the state removal method will not use higher level operators like +
, preferring to render z+
as zz*
instead, for example. Another issue with it is overuse of parenthesis. We can fix this kind of thing by analyzing our syntax tree.
Understanding the Code
The Basics
The FA
class represents a single state in a state machine. We build state machines by creating instances of these and linking them together using AddTransition()
.
The RegexExpression
class is the base class of expression elements in the abstract syntax tree. All of the other expression classes derive from it. The rest of the RegexXXXXExpression
classes are concrete classes that derive from RegexExpression
.
We're going to focus on the RegexExpression
and derived classes since that's where the meat of the code is. Unlike my usual regex offerings, the parsing and Thompson construction logic is in the abstract syntax tree classes, not in the FA
class.
You are free to build up a regular expression using the abstract syntax tree's object model, but it's probably more expedient to call RegexExpression.Parse()
to turn a regular expression into a tree for you.
Once it's parsed, you can modify it using the various members, get its string representation using ToString()
, or convert it to a state machine using ToFA()
.
Converting From a State Machine
That being said, the method we're really interested in here is FromFA()
. This method takes a state machine and gives you an abstract syntax tree back that represents it. The basic idea is we're going to start with a machine that looks like the original state machine, and then rip states out of it, replacing them with subexpressions that represent them, until we have only the first and final states with a single transition joining them. That single transition contains the entire regular expression.
We need to make a special FA
variant. This will be basically a stripped down FA
class but the transitions hold RegexExpression
objects instead of codepoint ranges:
private class _EFATransition {
public RegexExpression Expression;
public _EFA To;
public _EFATransition(RegexExpression expression = null, _EFA to = null) {
Expression = expression;
To = to;
}
}
private sealed class _EFA {
public bool IsAccepting;
public int Accept;
public List<_EFATransition> Transitions { get; } = new List<_EFATransition>();
public IList<_EFA> FillClosure(IList<_EFA> result = null) {
if (result == null) result = new List<_EFA>();
if (result.Contains(this))
return result;
result.Add(this);
foreach (var t in Transitions) {
t.To.FillClosure(result);
}
return result;
}
public static IList<KeyValuePair<_EFA, int>>
GetIncomingTransitionIndices(IEnumerable<_EFA> closure,
_EFA efa,
bool includeLoops = true) {
var result = new List<KeyValuePair<_EFA, int>>();
foreach (var cfa in closure) {
var i = 0;
foreach (var t in cfa.Transitions) {
if (includeLoops || t.To != cfa) {
if (t.To == efa) {
var kvp = new KeyValuePair<_EFA, int>(cfa, i);
if (!result.Contains(kvp)) {
result.Add(kvp);
}
}
}
++i;
}
}
return result;
}
public IDictionary<_EFA, RegexExpression>
FillInputTransitionsGroupedByState(IDictionary<_EFA, RegexExpression> result = null) {
if (result == null) {
result = new Dictionary<_EFA, RegexExpression>();
}
for (var i = 0; i < Transitions.Count; ++i) {
var t = Transitions[i];
RegexExpression exp;
if (!result.TryGetValue(t.To, out exp)) {
var or = new RegexOrExpression(t.Expression);
result.Add(t.To, or);
} else {
var or = exp as RegexOrExpression;
var oor = t.Expression as RegexOrExpression;
if(oor!=null) {
or.Expressions.AddRange(oor.Expressions);
} else
or.Expressions.Add(t.Expression);
}
}
return result;
}
}
We have a function that's not present on the standard FA
class. GetIncomingTransitionIndices()
gives us all the state/transition combinations in the machine that lead to this state. Similarly to the function of the same name in the FA
class, FillInputTransitionsGroupedByState()
gives us destination state/expression pairs where each state is the destination state that its corresponding expression leads to. This is basically a clean way to get the outgoing transitions from a state.
In FromFA()
, the first thing we do is reconstruct an _EFA
based machine from the passed in FA
based machine:
var closure = fa.FillClosure();
IList<_EFA> efas = new List<_EFA>(closure.Count + 1);
var i = 0;
while (i <= closure.Count) {
efas.Add(null);
++i;
}
i = 0;
foreach (var cfa in closure) {
efas[i] = new _EFA();
++i;
}
var final = new _EFA();
final.IsAccepting = true;
final.Accept = 0;
efas[i] = final;
for (i = 0; i < closure.Count; ++i) {
var e = efas[i];
var c = closure[i];
if (c.AcceptSymbolId!=-1) {
e.Transitions.Add(new _EFATransition(null,final));
}
for(var j = 0;j<c.Transitions.Count;++j) {
var ct = c.Transitions[j];
if(ct.Min==-1 && ct.Max==-1) {
e.Transitions.Add(new _EFATransition(null, efas[closure.IndexOf(ct.To)]));
}
}
var rngGrps = c.FillInputTransitionRangesGroupedByState();
foreach (var rngGrp in rngGrps) {
var tto = efas[closure.IndexOf(rngGrp.Key)];
if (rngGrp.Value.Count==1) {
var r = rngGrp.Value[0];
if(r.Key==r.Value) {
var lit = new RegexLiteralExpression(r.Key);
e.Transitions.Add(new _EFATransition(lit, tto));
continue;
}
}
var sexpr = new RegexSetExpression(rngGrp.Value);
e.Transitions.Add(new _EFATransition(sexpr, tto));
}
}
This _EFA
machine is now the same as the FA
machine except that all the accepting states that were there previously are now non-accepting, and instead transition to the accepting final
state on the null
expression (epsilon). Every codepoint range has been translated into a basic equivalent expression. Consider the following _EFA
machine, resulting from (foo|ba[rz])+[A-Z_a-z][A-Z_a-z0-9]*
:
q0:
-> q1
-> q22
q1:
f -> q2
q2:
o -> q3
q3:
o -> q4
q4:
-> q5
q5:
-> q6
q6:
-> q7
-> q17
q7:
-> q8
-> q13
q8:
f -> q9
q9:
o -> q10
q10:
o -> q11
q11:
-> q12
q12:
-> q6
q13:
b -> q14
q14:
a -> q15
q15:
[rz] -> q16
q16:
-> q12
q17:
[A-Z_a-z] -> q18
q18:
-> q19
q19:
-> q26
-> q20
q20:
[0-9A-Z_a-z] -> q21
q21:
-> q19
q22:
b -> q23
q23:
a -> q24
q24:
[rz] -> q25
q25:
-> q5
*q26:
Now we need to start putting them together. What we're going to do is keep making changes to the machine until there are no more changes that can be made. Once that's done, we do one more pass to make sure there are no more changes that can be had, but basically the idea is we change the machine until we can't anymore.
There are several phases to this. In the first phase, we resolve simple cases, where there is only one transition leading to and from a state. We take runs of these simple state transitions and turn them into one:
while (!innerDone) {
innerDone = true;
i = 0;
foreach (var e in efas) {
if (e.Transitions.Count == 1) {
var its = _EFA.GetIncomingTransitionIndices(efas, e);
if (its.Count == 1 && its[0].Key.Transitions.Count == 1) {
if (e.Transitions[0].To == its[0].Key) {
var rep = new RegexRepeatExpression();
rep.Expression = e.Transitions[0].Expression;
rep.MinOccurs = rep.MaxOccurs = 0;
e.Transitions[0].Expression = rep;
} else {
var exp = its[0].Key.Transitions[0].Expression;
var cat = exp as RegexConcatExpression;
if (cat == null) {
cat = new RegexConcatExpression();
cat.Expressions.Add(exp);
exp = cat;
its[0].Key.Transitions[0].Expression = cat;
}
cat.Expressions.Add(e.Transitions[0].Expression);
its[0].Key.Transitions[0] = new _EFATransition(exp, e.Transitions[0].To);
}
innerDone = false;
efas = efas[0].FillClosure();
break;
} else {
foreach (var it in its) {
if (efas.IndexOf(it.Key) >= efas.IndexOf(e)) {
} else {
var t = it.Key.Transitions[it.Value];
it.Key.Transitions[it.Value] =
new _EFATransition(t.Expression, e.Transitions[0].To);
var exp = t.Expression;
var cat = exp as RegexConcatExpression;
if (cat == null) {
cat = new RegexConcatExpression();
cat.Expressions.Add(exp);
exp = cat;
it.Key.Transitions[it.Value].Expression = exp;
}
cat.Expressions.Add(e.Transitions[0].Expression);
innerDone = false;
efas = efas[0].FillClosure();
break;
}
}
}
}
++i;
}
if (innerDone) {
efas = efas[0].FillClosure();
} else
done = false;
}
...
That looks complicated but most of what it's doing is just checking to make sure there's a singular transition in and out of the state, and then it's ripping those apart and building new transitions that are concatenated off several runs of those. The other thing it does is resolve simple loops to self. After that pass, we get this _EFA
machine:
q0:
foo -> q1
ba[rz] -> q1
q1:
-> q2
[A-Z_a-z] -> q3
q2:
foo -> q1
ba[rz] -> q1
q3:
-> q4
[0-9A-Z_a-z] -> q3
*q4:
It's starting to get more manageable now! We've got a lot less states, and the beginnings of some regular expressions therein.
Next, we combine multiple transitions like the ones in q0:
innerDone = false;
while (!innerDone) {
innerDone = true;
foreach (var e in efas) {
var rgs = e.FillInputTransitionsGroupedByState();
if (rgs.Count != e.Transitions.Count) {
e.Transitions.Clear();
foreach (var rg in rgs) {
e.Transitions.Add(new _EFATransition(rg.Value, rg.Key));
}
innerDone = false;
efas = efas[0].FillClosure();
break;
}
}
}
if (innerDone) {
efas = efas[0].FillClosure();
} else
done = false;
That yields this:
q0:
foo|ba[rz] -> q1
q1:
-> q2
[A-Z_a-z] -> q3
q2:
foo|ba[rz] -> q1
q3:
-> q4
[0-9A-Z_a-z] -> q3
*q4:
Now we're cooking with gas! There's not much left to do now. Next, we fix loops like q3:
while (!innerDone) {
innerDone = true;
foreach (var e in efas) {
for (var ii = 0; ii < e.Transitions.Count; ++ii) {
var t = e.Transitions[ii];
if (t.To == e) {
var rep = new RegexRepeatExpression();
rep.Expression = t.Expression;
rep.MinOccurs = rep.MaxOccurs = 0;
for (var iii = 0; iii < e.Transitions.Count; ++iii) {
if (ii != iii) {
var tt = e.Transitions[iii];
if (tt.To != e) {
var cat = tt.Expression as RegexConcatExpression;
if (cat == null) {
cat = new RegexConcatExpression();
cat.Expressions.Add(rep);
cat.Expressions.Add(tt.Expression);
e.Transitions[iii].Expression = cat;
} else {
cat.Expressions.Insert(0, rep);
}
}
}
}
e.Transitions.RemoveAt(ii);
--ii;
innerDone = false;
efas = efas[0].FillClosure();
break;
}
}
}
}
if (innerDone) {
efas = efas[0].FillClosure();
} else
done = false;
That leaves us with this:
q0:
foo|ba[rz] -> q1
q1:
-> q2
[A-Z_a-z] -> q3
q2:
foo|ba[rz] -> q1
q3:
(?:[0-9A-Z_a-z])* -> q4
*q4:
That's all the major steps, so now all we need to do is repeat everything we just did until there's finally nothing left except one transition:
q0:
(?:foo|ba[rz])(?:(?:foo|ba[rz]))*[A-Z_a-z](?:[0-9A-Z_a-z])* -> q1
*q1:
Now hold on. We're not done yet. We've completed a basic FA to regular expression algorithm by way of the state removal method, but we haven't really improved on it yet.
We have however, built out an abstract syntax tree for this expression, so now our goal is to examine that tree, and simplify it where we can.
Reducing Our Expression
Reduce()
will return a simplified version of our expression if possible. It's in this operation where our improvement to the vanilla algorithm is. The idea is to examine the tree to see if anything is redundant or could be simplified. One example would be (foo|)
being turned into (foo)?
. Another might be turning aa*
into a+
. Still another might be turning (a|b|c|d|e|f|z|)
into [a-fz]?
.
The key to the implementation is the TryReduce()
method which is called repeatedly by Reduce()
until it returns false
, each time potentially returning a simplified expression. Each type of expression has its own reduction code.
Here's the one for the RegexConcatExpression
. It basically reduces nested expressions, flattens nested concat expressions into one, removes concats with a single expression, and resolves expressions like zz*
into z+
:
private bool _AddReduced(RegexExpression e) {
if (e == null) return true;
var r = false;
while (e!=null && e.TryReduce(out e)) r = true;
if (e == null) return true;
var c = e as RegexConcatExpression;
if(null!=c) {
for(var i = 0;i<c.Expressions.Count;++i) {
var ce = c.Expressions[i];
if(ce!=null) {
_AddReduced(ce);
}
}
return true;
}
Expressions.Add(e);
return r;
}
public override bool TryReduce(out RegexExpression reduced) {
var result = false;
var cat = new RegexConcatExpression();
for(var i = 0;i<Expressions.Count;++i) {
var e = Expressions[i];
if(e==null) {
result = true;
continue;
}
if(cat._AddReduced(e)) {
result = true;
}
}
switch(cat.Expressions.Count) {
case 0:
reduced = null;
return true;
case 1:
reduced = cat.Expressions[0].Reduce();
return true;
default:
for(var i = 1;i<cat.Expressions.Count;++i) {
var e = cat.Expressions[i].Reduce();
var rep = e as RegexRepeatExpression;
if (rep != null) {
var ee = rep.Expression;
var cc = ee as RegexConcatExpression;
if (cc != null) {
var k = 0;
for (var j = i - cc.Expressions.Count; j < i; ++j) {
if (!cc.Expressions[k].Equals(cat.Expressions[j])) {
reduced = result ? cat : this;
return result;
}
++k;
}
cat.Expressions[i] = new RegexRepeatExpression(cc,
rep.MinOccurs + 1,
rep.MaxOccurs > 0 ? rep.MaxOccurs + 1 : 0).Reduce();
cat.Expressions.RemoveRange(i - cc.Expressions.Count,
cc.Expressions.Count);
result = true;
} else {
if (cat.Expressions[i - 1].Equals(ee)) {
cat.Expressions[i] = new RegexRepeatExpression(ee,
rep.MinOccurs + 1,
rep.MaxOccurs > 0 ? rep.MaxOccurs + 1 : 0).Reduce();
cat.Expressions.RemoveAt(i - 1);
result = true;
}
}
}
}
reduced = result?cat:this;
return result;
}
}
We won't be exploring all of the reduction code, since the concept of each routine is the same - find ways to simplify an expression. RegexRepeatExpression
has its own reduction code, but it amounts to the same idea.
Where to Go From Here
Both the state removal algorithm implementation and the reduction implementations have room for improvement. Particularly the state removal algorithm will tend to yield expressions like (foobar|bazbar|fubar)
instead of (foo|baz|fu)bar
and this is less than ideal. The reduction algorithm will not reduce something like barbarbar
to (bar){3}
even though it should. The character set reduction will not resolve to character classes, and character class subtraction is not yet implemented.
That being said, even with those limitations, I believe this technique yields a substantial improvement over the existing textbook state removal algorithm.
History
- 6th January, 2022 - Initial submission