5

Technically correct: Inversed regex

 3 years ago
source link: https://www.vidarholen.net/contents/blog/?tag=regex
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Technically correct: Inversed regex

How do I write a regex that matches lines that do not contain hi?

That’s a frequently asked question if I ever saw one.

Of course, the proper answer is: you don’t. You write a regex that does match hi and then invert the matching logic, ostensibly with grep -v. But where’s the fun in that?

One interesting theorem that pops up in any book or class on formal grammars is that regular languages are closed under complement: the inverse of a regular expression is also a regular expression. This means that writing inverted regular expressions is theoretically possible, though it turns out to be quite tricky

Just try writing a regex that matches strings that does not contain “hi”, and test it against “hi”, “hhi” and “ih”, “iih” and such variations. Some solutions are coming up.

A way to cheat is using PCRE negative lookahead: ^(?!.*foo) matches all strings not containing the substring “foo”. However, lookahead assertions require a stack, and thus can’t be modelled as a finite state machine. In other words, they don’t fit the mathematical definition of a regular expression, and therefore disqualify.

There are simple, well-known algorithms for turning regular expressions into non-deterministic finite automata, and from there to deterministic FA. Less commonly used and known are algorithms for inverting a DFA and for generating familiar textual regex from it.

You can find these described in various lecture notes and slides, so I won’t recite them.

What I had a harder time finding was software that actually did this. So here is a Haskell program. It’s highly suboptimal but it does the job. When executed, it will ask for a regex and will then output a grep command that matches everything the regex does not (without -v, obviously).

The expressions it produces are quite horrific; it’s computer generated code, after all.

A regular expression for matching strings that do not match .*hi.* could be grep -E '^([^h]|h+$|h+[^hi])*$'.

This app, however, suggests grep -E '^([^h]([^h]|)*||([^h][^h]*h|h)|([^h][^h]*(h(hh*[^hi]|[^hi]))|(h(hh*[^hi]|[^hi])))((hh*[^hi]|[^h])|)*|([^h][^h]*(h([^hi][^h]*h|h))|(h([^hi][^h]*h|h)))(([^hi][^h]*h|h)|)*)$'

It still works exactly as stated though!

The app just supports a small subset of regex, just enough to convince someone that it works, and as a party trick lets you answer the original question exactly as stated.

Technically correct is the best kind of correct.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK