Turning vaguely reassuring finite-state machines into regular expressions

2020-12-06 by qntm

There's a Twitter account I like called @happyautomata, which periodically posts randomly-generated finite-state machines. Followers of this account make a hobby of turning the FSMs into regular expressions. This is always possible, because (strictly regular) regular expressions and finite-state machines are exactly equivalent to one another. And sometimes this is easy enough that you can do it in your head.

But sometimes it's very intractable, and you have to work it out by hand. But how?

Here is a worked example. Here is our original state machine:

Note: @happyautomata uses Firefox emojis, which may disagree slightly with what you see rendered in the text below.

Note that I have added labels to the four states in our state machine: A, B, C and D. The initial state is A. Each state has transitions to other states. For example,

From state A, if the machine consumes a 🌹, then we jump to state B.
From state B, if the machine consumes a 🌵, then we jump to state B again.
From state D, there are no transitions.

Additionally, states with two circles are final. If, after consuming all the input, we are at state B or state D, the machine accepts the input. For example, this machine accepts the string "🌹🌼🌼", but not "🌹🌼🌳". If a transition is missing, the machine definitely doesn't accept the string. So, "🌵🌵🌵" is not accepted; we just fall out of the machine if we do that.

We can represent all of this in the form of a set of simultaneous equations.

A = 🌱A | 🌹B | 🌲C | 🌵D
B = ε | 🌵B | 🌼C
C = 🌳C | 🌼D
D = ε

These aren't equations in the usual sense.

A, B, C and D aren't numbers, they are sets of strings. For example, A means "the set of accepted strings if we start at A".
An expression such as "🌹B" is also a set of strings: it means, "the set you get when you take every string in B and put a 🌹 on the front". An expression like "C = 🌼D" means "every string c in C has the form 🌼d for some string d in D".
Pipes mean alternation; you just combine the two sets.
ε means the empty string. For example, because B and D are final states, the empty string is one of the strings they accept.

We need to solve for A. Let's do this from the bottom up. This technique is called the Brzozowski algebraic method. First, let's substitute the equation for D into all the earlier equations, eliminating it:

A = 🌱A | 🌹B | 🌲C | 🌵
B = ε | 🌵B | 🌼C
C = 🌳C | 🌼

We see for example that the string 🌵 is accepted by A.

Now let's look at the equation for C. Notice that C has a self-transition, 🌳. That means we can input 🌳 zero or more times and arrive back at C. Self-transitions are important and have to be dealt with first. We can do this using the Kleene star:

A = 🌱A | 🌹B | 🌲C | 🌵
B = ε | 🌵B | 🌼C
C = 🌳*🌼

Then, we can substitute the equation for C into all of the earlier equations as well:

A = 🌱A | 🌹B | 🌲🌳*🌼 | 🌵
B = ε | 🌵B | 🌼🌳*🌼

Notice how in the equations for both A and B, we have two "static" terms - pure regular expressions with no reference to other states. Let's group those together. Notice how we use the question mark metacharacter from regular expression syntax, which means "this, or the empty string":

A = 🌱A | 🌹B | 🌲🌳*🌼|🌵
B = 🌵B | (🌼🌳*🌼)?

Now let's look at the equation for B. This has a self-transition too. So let's resolve that now. (We couldn't resolve it before this.)

A = 🌱A | 🌹B | 🌲🌳*🌼|🌵
B = 🌵*(🌼🌳*🌼)?

Now let's substitute the equation for B into the equation for A:

A = 🌱A | 🌹🌵*(🌼🌳*🌼)?|🌲🌳*🌼|🌵

Finally, we can resolve that self-transition on A:

A = 🌱*(🌹🌵*(🌼🌳*🌼)?|🌲🌳*🌼|🌵)

And we're done! We have a single regular expression for A. This is a strictly regular regular expression. It features no advanced syntax such as backreferences or lookaround assertions (which are not strictly regular), and it is implicitly anchored at the start and end of the input string. This is the correct answer.

Well, it's one of the many possible correct answers. There are usually many ways to represent any particular finite-state machine as a regular expression. You can see above that there are other ways in which we could have chosen to simplify our expressions. We need not have started our substitution with D, though we do have to finish with A.

Furthermore, figuring out which of the many equivalent regular expressions is the smallest is also very computationally difficult.

Generally, regular expressions for finite-state machines can be very complex and difficult to read, even by regular expression standards.

Automating this process

A while back I created a Python package called greenery which, among other things, is designed to be able to turn finite-state machines into regular expressions. All you have to do is make sure you set up the state machine and all of its transitions correctly, manually, because it can't visually parse and interpret the delightful image up there:

from greenery import fsm, lego

plants = fsm.fsm(
    alphabet = {"🌱", "🌹", "🌵", "🌲", "🌳", "🌼"},
    states = {"A", "B", "C", "D"},
    initial = "A",
    finals = {"B", "D"},
    map = {
        "A": { "🌱": "A", "🌹": "B", "🌲": "C", "🌵": "D" },
        "B": { "🌵": "B", "🌼": "C" },
        "C": { "🌳": "C", "🌼": "D" }
    }
)

computed_regex = lego.from_fsm(plants)
print(computed_regex)

The output from this program is

🌱*((🌲|🌹🌵*🌼)🌳*🌼|🌵|🌹🌵*)

Compare with our hand-derived expression:

🌱*(🌹🌵*(🌼🌳*🌼)?|🌲🌳*🌼|🌵)

They look roughly equivalent, although it's not immediately obvious whether they actually are equivalent. Is there a way to be sure? Cunningly, there is! greenery also provides an API for determining whether two regular expressions are equivalent:

handmade_regex = lego.parse("🌱*(🌹🌵*(🌼🌳*🌼)?|🌲🌳*🌼|🌵)")
print(computed_regex.equivalent(handmade_regex))

This outputs:

True

So, there we go!

Bigger fish

Alright then, shall we try something nastier?

This is rather more complicated because when state B consumes a 💙 it simultaneously transitions to both states C and D.

In effect, what this does is introduce a new "compound" state where the state machine is in both C and D at once. Let's call this new state C+D. C+D's transitions are simply C and D's transitions combined... but as we follow them, we discover more and more compound states.

This kind of thing can get very complex; theoretically, for a finite-state machine with N states and multitransitions of this kind, there can be up to 2N - 1 possible "compound" states. In this case, 4 states could have meant as many as 15 compound states after that conversion.

We also have to keep in mind that if any substate of a compound state is final, then that compound state is itself final. For example, this finite-state machine accepts the two-character string "🤍💙", which leaves it in state C+D, which is hence final.

The hard part, then, is actually writing the transitions out correctly.

hearts = fsm.fsm(
    alphabet = {"🤍", "💙", "❤️"},
    states = {"A", "B", "C", "D", "C+D", "A+D", "A+C", "A+B"},
    initial = "A",
    finals = {"B", "D", "C+D", "A+D", "A+B"},
    map = {
        "A": {"🤍": "B", "💙": "C"},
        "B": {"💙": "C+D", "❤️": "B"},
        "C": {"🤍": "A", "💙": "D"},
        "D": {"💙": "A"},
        "C+D": {"🤍": "A", "💙": "A+D"},
        "A+D": {"🤍": "B", "💙": "A+C"},
        "A+C": {"🤍": "A+B", "💙": "C+D"},
        "A+B": {"🤍": "B", "💙": "C+D", "❤️": "B"},
    }
)
computed_regex = lego.from_fsm(hearts)
print(computed_regex)

This outputs:

(💙(💙{2}|🤍)|🤍(❤️|💙(💙{2}🤍?💙)*💙(💙🤍[❤️🤍]|🤍))*💙(💙{2}🤍?💙)*🤍)*(💙{2}|🤍(❤️|💙(💙{2}🤍?💙)*💙(💙🤍[❤️🤍]|🤍))*(💙(💙{2}🤍?💙)*(💙(💙🤍)?)?)?)

Absolutely ghastly!

And what about by hand? Well, this one looks very nasty to approach by hand, but here is an equally ghastly proposed solution which may or may not have been calculated by hand. Guessing by visual inspection whether these are equivalent is likely fruitless. So let's go to the Python and see...

handmade_regex = lego.parse("(💙(🤍|💙💙)|🤍(❤️|💙(💙💙🤍?💙)*💙(🤍|💙🤍(🤍|❤️)))*💙🤍)*(💙💙|🤍(❤️|💙(💙💙🤍?💙)*💙(🤍|💙🤍(🤍|❤️)))*(💙(💙💙🤍?💙)*(💙(💙🤍)?)?)?)")
print(computed_regex.equivalent(handmade_regex))

This outputs:

False

Which indicates that there is a disagreement. How to find out what the discrepancy could be? Well, first we can compute the symmetric difference of these two regular expressions as a new regular expression (this is how .equivalent works internally, in fact), and then we can iterate over this new regular expression's strings:

symmetric_difference = computed_regex.symmetric_difference(handmade_regex)
string = next(symmetric_difference.strings())
print(string)
print(computed_regex.accepts(string))
print(handmade_regex.accepts(string))

This outputs:

🤍💙💙💙💙🤍🤍
True
False

Yes, we can see here that the string "🤍💙💙💙💙🤍🤍" is accepted by our computed regular expression (and by the original state machine), but, if we look closely, not by the handmade one.

In conclusion, I find all of this quite delightful and entertaining, and I hope you do too.

Turning vaguely reassuring finite-state machines into regular expressions