Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles / programming / regular-expression

An Improved Algorithm for Converting State Machines to Regular Expressions

4.99/5 (13 votes)
6 Jan 2022MIT9 min read 16.3K   121  
This article describes an improvement to the state removal algorithm for converting FAs to regular expressions
This article is primarily of interest to fellow eggheads like me who want to understand more about the mechanics of pattern matching streams using finite automata. In this case, I propose an improved algorithm for converting automatons into regular expressions that can be used to recreate them. The regular expressions created using this algorithm are more readable than those using the vanilla state-removal algorithm.

Introduction

Licensing

The code presented here is MIT licensed, but the concepts I outline are public domain. You are welcome to use those concepts without attribution (although it is appreciated) but if you use the code or derive code from it, then the MIT license applies.

Software Prerequisites

In order to render the state graphs, GraphViz is required. You can download it here.

What is This All About?

Ultimately, this is about regular expressions, which are implemented using state machines. The most common algorithm for converting a regular expression into a state machine - Thompson's construction followed by subset construction - is well known. There are also three less well known, but viable algorithms for converting one of these state machines back into a regular expression. This article is an exploration and improvement of one of these.

Why Do it at All?

There are several reasons you may want to convert an finite automata based state machine back into a regular expression:

  1. Debugging: It makes it far easier to look at a machine or a subset of a machine as a more human readable regular expression string than it does a state object graph.
  2. Display: Particularly when dealing with subsets of a machine, like when building lexer code, or when rendering a state graph, being able to produce those subsets as regular expressions makes for readable presentation of those machine subsets.
  3. "Decompilation": Occasionally, you might run into binary file representations of state machines, like those produced by Gold Parser. Using this algorithm, you can turn those prebaked state machines back into an approximation of the regular expression that was used to create them. (It may not be the exact same expression, but it will match the same text.)
  4. Converting DFAs back to NFAs: The use case for this is narrow, because all DFAs are also NFAs (though the reverse is not true). However, it may be useful for comparing graphs, or displaying graphs. This is probably the best DFA to NFA conversion algorithm out there because it is holistic - it converts DFA constructs back into Thompson constructions that created them in the first place, which are then reflected in the NFA. Other algorithms do not.
  5. Curiosity: Even without my improvements, the FA state removal algorithm is interesting, and understanding it can help you become more familiar with the workings of regular expressions and pattern matching.

Background

We're not going to focus on matching code here. This is all about converting regular expressions to and from state machines. That being said, the engine is complete enough that it can simply be extended to do matching pretty easily.

We're going to represent a state in a state machine as a class. The class has an integer accept symbol id and a list of transitions to more instances of that class. Each transition in the list contains a minimum codepoint, a maximum codepoint, and a destination state. The minimum and maximum together indicate a range of matching codepoints in order to make Unicode matching realistic.

We'll be augmenting the state removal method for converting a state machine into a regular expression.

The idea with this technique in its vanilla form is each transition in a state machine is marked by part of a regular expression instead of a codepoint range. You remove states one at a time, and as you remove them, you augment the existing expressions that lead to them until you wind up with a single transition from a root state that contains the expression that represents the entire machine. See the link for the particulars.

In the vanilla algorithm, the expression is represented by a string which gets concatenated to as needed.

We'll be introducing an abstract syntax tree into the mix. An abstract syntax tree is (in this case) an expression tree. It has classes like RegexLiteralExpression, RegexSetExpression, RegexRepeatExpression and RegexConcatExpression. Instead of building out a string, we'll be building out our expression using these.

Doing this allows us to do higher level analysis on the expression after it's built out, potentially highlighting opportunities for simplifying it. This is important because while the state removal algorithm usually produces reasonable results for basic expressions, they can easily become unreadable as the expressions get more complicated. Part of this is because the state removal method will not use higher level operators like +, preferring to render z+ as zz* instead, for example. Another issue with it is overuse of parenthesis. We can fix this kind of thing by analyzing our syntax tree.

Understanding the Code

The Basics

The FA class represents a single state in a state machine. We build state machines by creating instances of these and linking them together using AddTransition().

The RegexExpression class is the base class of expression elements in the abstract syntax tree. All of the other expression classes derive from it. The rest of the RegexXXXXExpression classes are concrete classes that derive from RegexExpression.

We're going to focus on the RegexExpression and derived classes since that's where the meat of the code is. Unlike my usual regex offerings, the parsing and Thompson construction logic is in the abstract syntax tree classes, not in the FA class.

You are free to build up a regular expression using the abstract syntax tree's object model, but it's probably more expedient to call RegexExpression.Parse() to turn a regular expression into a tree for you.

Once it's parsed, you can modify it using the various members, get its string representation using ToString(), or convert it to a state machine using ToFA().

Converting From a State Machine

That being said, the method we're really interested in here is FromFA(). This method takes a state machine and gives you an abstract syntax tree back that represents it. The basic idea is we're going to start with a machine that looks like the original state machine, and then rip states out of it, replacing them with subexpressions that represent them, until we have only the first and final states with a single transition joining them. That single transition contains the entire regular expression.

We need to make a special FA variant. This will be basically a stripped down FA class but the transitions hold RegexExpression objects instead of codepoint ranges:

C#
private class _EFATransition {
    public RegexExpression Expression;
    public _EFA To;
    public _EFATransition(RegexExpression expression = null, _EFA to = null) {
        Expression = expression;
        To = to;
    }
}
private sealed class _EFA {
    public bool IsAccepting;
    public int Accept;
    public List<_EFATransition> Transitions { get; } = new List<_EFATransition>();
    public IList<_EFA> FillClosure(IList<_EFA> result = null) {
        if (result == null) result = new List<_EFA>();
        if (result.Contains(this))
            return result;
        result.Add(this);
        foreach (var t in Transitions) {
            t.To.FillClosure(result);
        }

        return result;
    }
    public static IList<KeyValuePair<_EFA, int>> 
        GetIncomingTransitionIndices(IEnumerable<_EFA> closure, 
            _EFA efa, 
            bool includeLoops = true) {
        
        var result = new List<KeyValuePair<_EFA, int>>();
        foreach (var cfa in closure) {
            var i = 0;
            foreach (var t in cfa.Transitions) {
                if (includeLoops || t.To != cfa) {
                    if (t.To == efa) {
                        var kvp = new KeyValuePair<_EFA, int>(cfa, i);
                        if (!result.Contains(kvp)) {
                            result.Add(kvp);
                        }
                    }
                }
                ++i;
            }
        }
        return result;
    }
    public IDictionary<_EFA, RegexExpression> 
        FillInputTransitionsGroupedByState(IDictionary<_EFA, RegexExpression> result = null) {
        
        if (result == null) {
            result = new Dictionary<_EFA, RegexExpression>();
        }
        for (var i = 0; i < Transitions.Count; ++i) {
            var t = Transitions[i];
            RegexExpression exp;
            if (!result.TryGetValue(t.To, out exp)) {
                var or = new RegexOrExpression(t.Expression);
                result.Add(t.To, or);
            } else {
                var or = exp as RegexOrExpression;
                var oor = t.Expression as RegexOrExpression;
                if(oor!=null) {
                    or.Expressions.AddRange(oor.Expressions);
                } else
                    or.Expressions.Add(t.Expression);
            }
        }
        return result;
    }
}

We have a function that's not present on the standard FA class. GetIncomingTransitionIndices() gives us all the state/transition combinations in the machine that lead to this state. Similarly to the function of the same name in the FA class, FillInputTransitionsGroupedByState() gives us destination state/expression pairs where each state is the destination state that its corresponding expression leads to. This is basically a clean way to get the outgoing transitions from a state.

In FromFA(), the first thing we do is reconstruct an _EFA based machine from the passed in FA based machine:

C#
var closure = fa.FillClosure();
// reserve an extra for the new final state
IList<_EFA> efas = new List<_EFA>(closure.Count + 1);
var i = 0;
while (i <= closure.Count) {
    efas.Add(null);
    ++i;
}
i = 0;
foreach (var cfa in closure) {
    efas[i] = new _EFA();
    ++i;
}
var final = new _EFA();
final.IsAccepting = true;
final.Accept = 0;
efas[i] = final;
for (i = 0; i < closure.Count; ++i) {
    var e = efas[i];
    var c = closure[i];
    if (c.AcceptSymbolId!=-1) {
        e.Transitions.Add(new _EFATransition(null,final));
    }
    for(var j = 0;j<c.Transitions.Count;++j) {
        var ct = c.Transitions[j];
        if(ct.Min==-1 && ct.Max==-1) {
            e.Transitions.Add(new _EFATransition(null, efas[closure.IndexOf(ct.To)]));
        }
    }
    var rngGrps = c.FillInputTransitionRangesGroupedByState();
    foreach (var rngGrp in rngGrps) {
        var tto = efas[closure.IndexOf(rngGrp.Key)];
        if (rngGrp.Value.Count==1) {
            var r = rngGrp.Value[0];
            if(r.Key==r.Value) {
                var lit = new RegexLiteralExpression(r.Key);
                e.Transitions.Add(new _EFATransition(lit, tto));
                continue;
            }
        }
        var sexpr = new RegexSetExpression(rngGrp.Value);
        e.Transitions.Add(new _EFATransition(sexpr, tto));
    }
}

This _EFA machine is now the same as the FA machine except that all the accepting states that were there previously are now non-accepting, and instead transition to the accepting final state on the null expression (epsilon). Every codepoint range has been translated into a basic equivalent expression. Consider the following _EFA machine, resulting from (foo|ba[rz])+[A-Z_a-z][A-Z_a-z0-9]*:

q0:
         -> q1
         -> q22

q1:
        f -> q2

q2:
        o -> q3

q3:
        o -> q4

q4:
         -> q5

q5:
         -> q6

q6:
         -> q7
         -> q17

q7:
         -> q8
         -> q13

q8:
        f -> q9

q9:
        o -> q10

q10:
        o -> q11

q11:
         -> q12

q12:
         -> q6

q13:
        b -> q14

q14:
        a -> q15

q15:
        [rz] -> q16

q16:
         -> q12

q17:
        [A-Z_a-z] -> q18

q18:
         -> q19

q19:
         -> q26
         -> q20

q20:
        [0-9A-Z_a-z] -> q21

q21:
         -> q19

q22:
        b -> q23

q23:
        a -> q24

q24:
        [rz] -> q25

q25:
         -> q5

*q26:

Now we need to start putting them together. What we're going to do is keep making changes to the machine until there are no more changes that can be made. Once that's done, we do one more pass to make sure there are no more changes that can be had, but basically the idea is we change the machine until we can't anymore.

There are several phases to this. In the first phase, we resolve simple cases, where there is only one transition leading to and from a state. We take runs of these simple state transitions and turn them into one:

C#
while (!innerDone) {
    innerDone = true;
    i = 0;
    foreach (var e in efas) {
        if (e.Transitions.Count == 1) {
            var its = _EFA.GetIncomingTransitionIndices(efas, e);
            if (its.Count == 1 && its[0].Key.Transitions.Count == 1) {
                // is a loop?
                if (e.Transitions[0].To == its[0].Key) {
                    var rep = new RegexRepeatExpression();
                    rep.Expression = e.Transitions[0].Expression;
                    rep.MinOccurs = rep.MaxOccurs = 0;
                    e.Transitions[0].Expression = rep;
                } else {
                    var exp = its[0].Key.Transitions[0].Expression;
                    var cat = exp as RegexConcatExpression;
                    if (cat == null) {
                        cat = new RegexConcatExpression();
                        cat.Expressions.Add(exp);
                        exp = cat;
                        its[0].Key.Transitions[0].Expression = cat;
                    }
                    cat.Expressions.Add(e.Transitions[0].Expression);
                    its[0].Key.Transitions[0] = new _EFATransition(exp, e.Transitions[0].To);

                }
                innerDone = false;
                efas = efas[0].FillClosure();
                break;
            } else {
                foreach (var it in its) {
                    // is it a loop?
                    if (efas.IndexOf(it.Key) >= efas.IndexOf(e)) {
                        // yes
                    } else {
                        // no
                        var t = it.Key.Transitions[it.Value];
                        it.Key.Transitions[it.Value] = 
                               new _EFATransition(t.Expression, e.Transitions[0].To);

                        var exp = t.Expression;
                        var cat = exp as RegexConcatExpression;
                        if (cat == null) {
                            cat = new RegexConcatExpression();
                            cat.Expressions.Add(exp);
                            exp = cat;
                            it.Key.Transitions[it.Value].Expression = exp;
                        }
                        cat.Expressions.Add(e.Transitions[0].Expression);
                        innerDone = false;
                        efas = efas[0].FillClosure();
                        break;
                    }
                }
            }
        }
        ++i;
    }
    if (innerDone) {
        efas = efas[0].FillClosure();
    } else
        done = false;
}
...

That looks complicated but most of what it's doing is just checking to make sure there's a singular transition in and out of the state, and then it's ripping those apart and building new transitions that are concatenated off several runs of those. The other thing it does is resolve simple loops to self. After that pass, we get this _EFA machine:

q0:
        foo -> q1
        ba[rz] -> q1

q1:
         -> q2
        [A-Z_a-z] -> q3

q2:
        foo -> q1
        ba[rz] -> q1

q3:
         -> q4
        [0-9A-Z_a-z] -> q3

*q4:

It's starting to get more manageable now! We've got a lot less states, and the beginnings of some regular expressions therein.

Next, we combine multiple transitions like the ones in q0:

C#
innerDone = false;
while (!innerDone) {
    innerDone = true;
    foreach (var e in efas) {
        var rgs = e.FillInputTransitionsGroupedByState();
        if (rgs.Count != e.Transitions.Count) {
            e.Transitions.Clear();
            foreach (var rg in rgs) {
                e.Transitions.Add(new _EFATransition(rg.Value, rg.Key));
            }
            innerDone = false;
            efas = efas[0].FillClosure();
            break;
        }
    }
}
if (innerDone) {
    efas = efas[0].FillClosure();
} else
    done = false;

That yields this:

q0:
        foo|ba[rz] -> q1

q1:
         -> q2
        [A-Z_a-z] -> q3

q2:
        foo|ba[rz] -> q1

q3:
         -> q4
        [0-9A-Z_a-z] -> q3

*q4:

Now we're cooking with gas! There's not much left to do now. Next, we fix loops like q3:

C#
while (!innerDone) {
    innerDone = true;
    foreach (var e in efas) {
        for (var ii = 0; ii < e.Transitions.Count; ++ii) {
            var t = e.Transitions[ii];
            if (t.To == e) {
                // this is a loop
                var rep = new RegexRepeatExpression();
                rep.Expression = t.Expression;
                rep.MinOccurs = rep.MaxOccurs = 0;
                // prepend it to all the other transitions 
                for (var iii = 0; iii < e.Transitions.Count; ++iii) {
                    if (ii != iii) {
                        var tt = e.Transitions[iii];
                        if (tt.To != e) {
                            var cat = tt.Expression as RegexConcatExpression;
                            if (cat == null) {
                                cat = new RegexConcatExpression();
                                cat.Expressions.Add(rep);
                                cat.Expressions.Add(tt.Expression);
                                e.Transitions[iii].Expression = cat;
                            } else {
                                cat.Expressions.Insert(0, rep);
                            }
                        }
                    }
                }
                e.Transitions.RemoveAt(ii);
                --ii;
                innerDone = false;
                efas = efas[0].FillClosure();
                break;
            }

        }
    }
}
if (innerDone) {
    efas = efas[0].FillClosure();
} else
    done = false;

That leaves us with this:

q0:
        foo|ba[rz] -> q1

q1:
         -> q2
        [A-Z_a-z] -> q3

q2:
        foo|ba[rz] -> q1

q3:
        (?:[0-9A-Z_a-z])* -> q4

*q4:

That's all the major steps, so now all we need to do is repeat everything we just did until there's finally nothing left except one transition:

q0:
        (?:foo|ba[rz])(?:(?:foo|ba[rz]))*[A-Z_a-z](?:[0-9A-Z_a-z])* -> q1

*q1:

Now hold on. We're not done yet. We've completed a basic FA to regular expression algorithm by way of the state removal method, but we haven't really improved on it yet.

We have however, built out an abstract syntax tree for this expression, so now our goal is to examine that tree, and simplify it where we can.

Reducing Our Expression

Reduce() will return a simplified version of our expression if possible. It's in this operation where our improvement to the vanilla algorithm is. The idea is to examine the tree to see if anything is redundant or could be simplified. One example would be (foo|) being turned into (foo)?. Another might be turning aa* into a+. Still another might be turning (a|b|c|d|e|f|z|) into [a-fz]?.

The key to the implementation is the TryReduce() method which is called repeatedly by Reduce() until it returns false, each time potentially returning a simplified expression. Each type of expression has its own reduction code.

Here's the one for the RegexConcatExpression. It basically reduces nested expressions, flattens nested concat expressions into one, removes concats with a single expression, and resolves expressions like zz* into z+:

C#
private bool _AddReduced(RegexExpression e) {
    if (e == null) return true;
    var r = false;
    while (e!=null && e.TryReduce(out e)) r = true;
    if (e == null) return true;
    var c = e as RegexConcatExpression;
    if(null!=c) {
        for(var i = 0;i<c.Expressions.Count;++i) {
            var ce = c.Expressions[i];
            if(ce!=null) {
                _AddReduced(ce);
            }
        }
        return true;
    }
    Expressions.Add(e);
    return r;
}
public override bool TryReduce(out RegexExpression reduced) {
    var result = false;
    var cat = new RegexConcatExpression();
    for(var i = 0;i<Expressions.Count;++i) {
        var e = Expressions[i];
        if(e==null) {
            result = true;
            continue;
        }
        if(cat._AddReduced(e)) {
            result = true;
        }
    }
    switch(cat.Expressions.Count) {
    case 0:
        reduced = null;
        return true;
    case 1:
        reduced = cat.Expressions[0].Reduce();
        return true;
    default:
        // fixup things like zz* so it's z+
        for(var i = 1;i<cat.Expressions.Count;++i) {
            var e = cat.Expressions[i].Reduce();
            var rep = e as RegexRepeatExpression;
            if (rep != null) {
                var ee = rep.Expression;
                var cc = ee as RegexConcatExpression;
                if (cc != null) {
                    var k = 0;
                    for (var j = i - cc.Expressions.Count; j < i; ++j) {
                        if (!cc.Expressions[k].Equals(cat.Expressions[j])) {
                            reduced = result ? cat : this;
                            return result;
                        }
                        ++k;
                    }
                    cat.Expressions[i] = new RegexRepeatExpression(cc, 
                        rep.MinOccurs + 1, 
                        rep.MaxOccurs > 0 ? rep.MaxOccurs + 1 : 0).Reduce();
                    cat.Expressions.RemoveRange(i - cc.Expressions.Count, 
                        cc.Expressions.Count);
                    result = true;
                } else {
                    if (cat.Expressions[i - 1].Equals(ee)) {
                        cat.Expressions[i] = new RegexRepeatExpression(ee, 
                            rep.MinOccurs + 1, 
                            rep.MaxOccurs > 0 ? rep.MaxOccurs + 1 : 0).Reduce();
                        cat.Expressions.RemoveAt(i - 1);
                        result = true;
                    }
                }
            }
        }
        reduced = result?cat:this;
        return result;
    }
}

We won't be exploring all of the reduction code, since the concept of each routine is the same - find ways to simplify an expression. RegexRepeatExpression has its own reduction code, but it amounts to the same idea.

Where to Go From Here

Both the state removal algorithm implementation and the reduction implementations have room for improvement. Particularly the state removal algorithm will tend to yield expressions like (foobar|bazbar|fubar) instead of (foo|baz|fu)bar and this is less than ideal. The reduction algorithm will not reduce something like barbarbar to (bar){3} even though it should. The character set reduction will not resolve to character classes, and character class subtraction is not yet implemented.

That being said, even with those limitations, I believe this technique yields a substantial improvement over the existing textbook state removal algorithm.

History

  • 6th January, 2022 - Initial submission

License

This article, along with any associated source code and files, is licensed under The MIT License