25

JavaScript Regular Expressions for Regular People

 5 years ago
source link: https://www.tuicool.com/articles/hit/aQfURvv
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Regular expressions, also known as regex or regexp, is a difficult subject to tackle. Don’t feel ashamed if you’re not 100% comfortable with writing your own regular expressions yet, as it does take some getting used to. My hope is that by the end of this article, you’ll be one step closer into rocking your own expressions in JavaScript without the need of relying so much on copypasta from Stack Overflow.

The first step to writing a regular expression is to understand how to invoke it. In JavaScript, regular expressions are a standard built-in object . Because of this, we can create a new RegExp object in few ways:

  • The literal way, /expression/.match('string to test against')
  • The new keyword with string argument, new RegExp('expression')
  • The new keyword with literal, new RegExp(/expression/)

I’ll use a combination of the methods just to show that they essentially perform the same job.

The Goals of our Regular Expression

In my example I’m going to be working with a string that contains my first name, last name, and a domain name. In the real world, the example would need much more thought. There are scores of subtleties when it comes to dealing with names , which I won’t address here.

Let’s say I’m building a dashboard and want to display the name of the logged-in user. I have no control over the data that’s returned to me so I have to make do with what I have.

I need to convert aaron.arney:alligator.io into Aaron Arney [Alligator] .

Regular expressions fit a lot of logic into a single condensed object. This can and will cause confusion. A good practice is to break down your expression into a form of pseudo-code. This enables us to see what needs to happen and when.

First Last [Domain]

Matching the First Name

To match a string with a regular expression, all you have to do is pass the literal string. The i at the end of the expression is a flag. The i flag in particular stands for case insensitive . That means that our expression with ignore casing on the string.

const unformattedName = 'aaron.arney:alligator.io';

const found = unformattedName.match(/aaron/i);

console.log(found);
// expected output: Array [ "aaron" ]

That works well, yet in our case it isn’t a good approach since the name of the user isn’t always going to be “Aaron.” This is where we explore programmatically matching strings.

Let’s focus on matching a first name for the time being. Break the word down into individual characters, what do you see?

The name “Aaron” consists of five alpha characters. Does every first name have only five characters? No, but it is reasonable to assume that first names can range between 1 and 15 characters. To denote a character in range of a-z, we use [a-z] .

Now, if we update our expression to use this character class…

const unformattedName = 'aaron.arney:alligator.io';

const found = unformattedName.match(/[a-z]/i);

console.log(found);
// expected output: Array [ "a" ]

Instead of extracting “aaron” from the string, it only returns “a.” This is good, as regular expressions try their hardest to match as little as possible. To repeat the character match a number up to our limit of 15, we use curly brackets. This tells the expression that we watch to match the preceding token, our “a-z”, to match between 1 and 15 times.

const unformattedName = 'aaron.arney:alligator.io';
const unformattedNameTwo = 'montgomery.bickerdicke:alligator.io';
const unformattedNameThree = 'a.lila:alligator.io';

const exp = new RegExp(/[a-z]{1,15}/, 'i');

const found = unformattedName.match(exp);
const foundTwo = unformattedNameTwo.match(exp);
const foundThree = unformattedNameThree.match(exp);

console.log(found);
// expected output: Array [ "aaron" ]

console.log(foundTwo);
// expected output: Array [ "montgomery" ]

console.log(foundThree);
// expected output: Array [ "a" ]

Matching the Last Name

Extracting the last name should be as easy as copying and pasting our first expression. You’ll notice that the match still returns the same value instead of both the first and last names.

Break down the string character by character , there is a full stop separating the names. To account for this, we add the full stop to our expression.

We have to be careful here. The . can mean one of two things in an expression.

.
\.

Using either version in this context will generate the same result, but that won’t always be the case. Tools like eslint will sometimes mark the escape sequence \ as unnecessary, but I say better safe than sorry!

const unformattedName = 'aaron.arney:alligator.io';

const exp = new RegExp(/[a-z]{1,15}\.[a-z]{1,15}/, 'i');

const found = unformattedName.match(exp);

console.log(found);
// expected output: Array [ "aaron.arney" ]

Since we prefer to split the string into two items as well as excluding the full stop from being returned by the expression, we can now use capturing groups . These are denoted by parenthesis () and wrap around parts of your expression in which you want to be returned. If we wrap them around the first and last name expressions, we’ll get new results.

The syntax for using capture groups is simple: (expression) . Since I only want to return my first and last name and not the full stop, wrap our expressions in parenthesis.

const unformattedName = 'aaron.arney:alligator.io';

const exp = new RegExp(/([a-z]{1,15})\.([a-z]{1,15})/, 'i');

const found = unformattedName.match(exp);

console.log(found);
// expected output: Array [ "aaron.arney", "aaron", "arney" ]

Matching the Domain Name

To extract “alligator.io”, we will use the character classes we’ve already used thus far. With some slight modification, of course.

Validating domain names and TLD’s is a difficult business. We’re going to pretend the domains that we parse, are always > 3 && < 25 characters. The TLD’s are always > 1 && < 10 . If we plug these in, we will get some new output:

const unformattedName = 'aaron.arney:alligator.io';

const exp = new RegExp(/([a-z]{1,15})\.([a-z]{1,15}):([a-z]{3,25}\.[a-z]{2,10})/, 'i');

const found = unformattedName.match(exp);

console.log(found);
// expected output: Array [ "aaron.arney:alligator.io", "aaron", "arney", "alligator.io" ]

A Shortcut

I showed you the “long way” of going about the expression. Now, I’ll show you how you can have a less verbose expression that captures the same text. By using the + quantifier, we can tell our expression to repeat the preceding token as many times as it can. It will continue until it hits a dead end, in our case the full stop. This expression also introduces the g flag, which stands for global . It tells the expression that we want to repeat our search as many times as possible, instead of the least times.

// With the global flag
'aaron.arney:alligator.io'.match(/[a-z]+/ig);
// expected output: Array(4) [ "aaron", "arney", "alligator", "io" ]

// Without the global flag
'aaron.arney:alligator.io'.match(/[a-z]+/i);
// expected output: Array(4) [ "aaron" ]

Formatting Output

To format the string, we’ll be using the replace method on the String object. The replace method takes two arguments:

RegExp | String
RegExp | function
const unformattedName = 'aaron.arney:alligator.io';

// The "long" way
const exp = new RegExp(/([a-z]{1,15})\.([a-z]{1,15}):([a-z]{3,25}\.[a-z]{2,10})/, 'i');

unformattedName.replace(exp, '$1 $2 [$3]');
// expected output:  "aaron arney [alligator.io]"

// A slightly shorter way
unformattedName.replace(/([a-z]+)\.([a-z]+):([a-z]+\.[a-z]{2,10})/ig, '$1 $2 [$3]');
// expected output: "aaron arney [alligator.io]"

In the above snippet, the $1 , $2 , $3 are special patterns that get interpreted by the replace method.

$1
$2
$n

To capitalize the words, we can use another regex. Instead of formatting the output like we did above, we will pass a function. The function capitalizes the argument supplied and returns it.

Here, I’m introducing a couple new parts, anchors , alternation , and a new character class [^] .

  • [^abc] - Not a , b , or c
  • \b - Word boundary
  • ab|cd - Logical “OR”, matches ab or cd
// Capitalize the words
"aaron arney [alligator.io]".replace(/(^\b[a-z])|([^\.]\b[a-z])/g, (char) => char.toUpperCase());
// expected output: "Aaron Arney [Alligator.io]"

Breaking down this expression into two parts..

  • (^\b[a-z]) - Capture the first character of the string. ^ says to match the beginning of the string.
  • |([^\.]\b[a-z]) - OR, match a new word that does not start with a full stop . , as this is the TLD.

Continuing Your Exploration

This is but a small taste of the power of regular expressions. The example I worked through is improvable, but how?

  • Is the expression too verbose? Is it too simplified?
  • Does it cover edge cases?
  • Could you replace it with some clever string manipulation using native methods?

This is where you take the knowledge you learned and try to answer those questions. Explore the following resources to help you in your journey and experiment!


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK