The case against Regular Expressions

A regular expression is a sequence of symbols that defines a pattern. This pattern is then used to search for characters or combinations of characters within text.

In its simplest form, regular expressions can provide a viable method for text matching and extraction, using a concise api (in the javascript world, it would be the RegExp api). Below is an example of how regular expressions can help on working with text:

// constructor form
new RegExp('https?://(www\\.)?youtube\\.com/(watch|playlist).*'; 

// literal form
/https?://(www\\.)?youtube\\.com/(watch|playlist).*/;

Looking at the above code, you can clearly identify the intent without looking much at symbols and flags. It is trying to match/extract Youtube videos or playlist urls.

Let's have a look at a different and apparently simple example: a library book feed. Book records might look like this:

...
The Hunger Games, Suzanne Collins - ISBN: 0439023483 - pages 374
Gone with the Wind, Margaret Mitchell - ISBN: 0446675539  - pages 1037
...

If I was requested to extract the ISBN number for each of the feed records, a regular expression would definitely be one of the solutions I could be looking at:

const bookPattern = /ISBN:\s(\d+)/gm;

A junior developer looking at this code may realise that I am trying to match the ISBN number or be looking for text around the ISBN number. With an eye on the pattern and the other eye on the manual, he might be able to find that I was trying to match:

an "ISBN" label
followed by a space
followed by some numbers
in all lines of text

Indeed, that is the intent. The results would be acceptable, but not exactly what we wanted (due to how grouping works):

text.match(bookPattern);
// ["ISBN: 0439023483", "ISBN: 0446675539"]

In order to match the ISBN number only, we would not be searching for the "ISBN label, followed by a space, followed by a number" but rather the "ISBN number, preceeded by the ISBN label". It does seems like a small tweak on our pattern: in practice, we are looking to introduce a positive lookbehind assertion .

Our pattern would then look like this:

const bookPattern = /(?<=ISBN:\s)(\d+)/gm;
// ["0439023483", "0446675539"]

And no, it is not small tweak to me.

A junior developer would possibly be able to understand I am looking for the ISBN number however, there are a number of symbols that create noise and on more convoluted scenarios they could possibly divert the focus from the business problem itself.

Let's say we have a new business requirement:

We just want 13-digit ISBNs and only if the record specifies the number of pages. - The Business

It seems like a realistic request, let's see how this translates into work:

const bookPattern = /(?<=ISBN:\s)(\d{13})(?=\s-\spages)/gm;

Frankly speaking, our pattern now looks more like a merge conflict rather than an high level piece of code.

Unfortunately, the data that we happen to work with is rarely well structured and abusing regular expressions is also quite common. Here is a solution proposed in a Stackoverflow thread for somebody looking to validate a url:

/((([A-Za-z]{3,9}:(?:\/\/)?)(?:[-;:&=\+\$,\w]+@)?[A-Za-z0-9.-]+|(?:www.|[-;:&=\+\$,\w]+@)[A-Za-z0-9.-]+)((?:\/[\+~%\/.\w-_]*)?\??(?:[-\+=&;%@.\w_]*)#?(?:[\w]*))?)/

Trying to understand the above is time consuming and even though the intent could be manifested using variables or test coverage, I would not want to find myself looking at an uncomprehensible, unmaintanable piece of code.

Alternatives to using Regular Expressions

Abstracting regular expressions is quite unpopular and there seem to be an unconditioned love for regular expressions in the developer world which I struggle to understand: to me, this is a problem worth abstracting as explained in this article.

Fortunately, I'm not alone in thinking this way and libraries such as Verbal Expressions offer a great api to constructing regular expression patterns. I would watch out for the the upcoming version 2.0 , a complete rewrite with an increased support for patterns.

Sometimes breaking the problem into smaller parts, perhaps resorting to array/string manipulation (mind the performance!) can also be a viable, but often overlooked, alternative.

In Conclusion

Regular Expressons are powerful tools and with power comes responsibility.

I find most regular expressions patterns easy to construct, but not so easy to decode and I tend to use them only if the pattern is immediately readable. I am not surprised a native abstraction is not available (there is no javascript Standard Library, let alone a regex package) however the language and its developers would benefit from a higher level api.

Alternatives to using Regular Expressions

In Conclusion

Recommend

EasyOCR: Ready-to-use OCR with 40 languages

史上最魔幻品牌：靠贴牌一年狂赚13亿，0家工厂却撑起了581亿市值

免税的 iPhone 很香，但这些「坑」你也要注意

5G不再依赖运营商，50家企业已获批自建专网

黄峥辞任CEO，大佬们为什么都要退居幕后？

rabbitmq部署及配置与验证-软件老王

一篇文章教会你如何将DOM转换为virtual DOM-13379043

苹果抢占巴菲特“半壁江山” 科技股“盛宴”何时了？

Arm中国换帅风波始末 | 钛媒体深度

同志社交Blued赴美上市背后的危与机

About Joyk