

Regex for Noobs – An Illustrated Guide
source link: https://www.tuicool.com/articles/2YraQvJ
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

This blog post is an illustrated guide to regex and aims to provide a gentle introduction for people who never have fiddled with regex, want to, but are kind of intimidated by the whole thing.
In other words, welcome to …
For most people without a formal CS education, regular expressions (regex) can come off as something that only the most hardcore unix programmers would dare to touch.
A good regex can seem like magic, but remember this: any technology sufficiently advanced enough is indistinguishable from magic. So, let’s peel away at the magic of regex and see what’s underneath!
If you understand regex it suddenly becomes this super fast and powerful tool … but you first need to understand it, and honestly I find it a bit intimidating for newcomers!
Let’s start with the basics. What are regular expressions (regex) and what are they used for?
Regex for noobs
At its core, a regular expression is a sequence of characters defining a search pattern .
Regex is often used in tools like grep
to find patterns in longer strings of text.
Consider a file cat.txt
cat cat2 dog
If we use the regex cat
to search for matches we find the following matches.
cat cat2
(Note for the hardcore powerusers: In this post I’m going to conflate the terms regex and the tools that use regex such as grep. This is technically wrong, I am aware of this.)
Regex works on characters, not words
One important thing that can not be emphasized enough is this: regex works on characters, not words . Concatenation is implied.
If we search using regex for the pattern cat
, we are not looking for the word “cat”, but we are looking for c
followed by a
followed by t
.
The dot and the asterisk
The most basic characters are the single characters, like a
, b
, c
, etc. Now let’s introduce two special guests.
The .
(dot) character matches any single character . For example, if we search for c.t
we would match anything ranging from cat
to c0t
or cAt
, we would match any single c
followed by any character followed by a single t
.
The *
(asterisk) character is a bit more difficult. It modifies the character preceding it and then matches zero or more characters of that. Yes, read that again, zero or more characters. For example, cat*
would match cat
, catt
, cattttt
but also ca
.
The cat ate my homework
Imagine we read in a file line by line and the first line is as follows.
The cat ate my homework.
Let’s look at how we would match the pattern cat
in this line.
We start with matching the first character of the pattern to the first character in the sentence.
If we don’t find a match we skip to the next character in the line and start from the first character of the pattern.
If we do find a match we go to the next character in both the pattern and the line and repeat this process. When we find a match for the whole pattern we return the line in which we find a match.
That’s it! That’s what regex is most often used for at its most basic level, to find a smaller search pattern in a larger string.
So far we’ve gone over what regex is and two of the special characters, the .
(dot) and the *
(asterisk), but wait, there’s more.
The regex trifecta
Zooming out a little bit, the parts of a regex can consist of three different components:
- Anchors
- Character sets
- Modifiers
These three make up the … regex trifecta!
Let’s start with the first part of the trifecta: anchors!
Anchors
Anchors specify the position of the pattern with respect to the line. These are the two most important anchors:
- The
^
(caret) fixes your pattern to the beginning of the line. For example the pattern^1
matches any line starting with a1
. - The
$
(dollar) fixes your pattern to the end of the sentence. For example,9$
matches any line ending with a9
.
Note that in both cases the pattern has to be respectively first and last in the pattern. ^1
matches a 1
at the start of a line but 1^
matches a 1
followed by a ^
. Similarly, 1$
matches lines ending with a 1
but $1
matches a dollar sign followed by a 1
anywhere on the line.
On to the second part of the trifecta: character sets!
Character sets
The second part of the trifecta: character sets. Character sets are the bread and butter of regex. A single character, say a
, is the most atomic character set (a set of one element). But we can do crazy stuff with regex like [0-9]
which matches any single digit, or if you recall what *
does we can make the pattern [0-9][0-9]*
(what this pattern matches is left as an exercise to the reader).
Some other important character sets:
-
[0-9]
matches any single digit from0...9
-
[a-z]
matches any lowercase character -
[A-Z]
matches any uppercase characer
We can also combine multiple sets:
-
[A-Za-z0-9]
matches any uppercase and lowercase letter and single digit.
Finally, modifiers.
Modifiers
I don’t want to get too much into depth here, but we already came across a modifier! The *
(asterisk) is a modifier. A modifier changes the meaning of the character preceding it. There are many other modifiers but starting with *
is a good start.
An actual example
Let’s quickly dump some text in a file
$ echo "The cat jumps long time \nThen we also have the fact that these are words.\n1234 this is a test post please ignore." >> grep.txt
This is what’s in the file now
$ cat grep.txt The cat jumps long time Then we also have the fact that these are words. 1234 this is a test post please ignore.
Let’s look for cat
$ grep "cat" grep.txt The cat jumps long time
Let’s look for any line starting with a digit ^[0-9]
$ grep "^[0-9]" grep.txt 1234 this is a test post please ignore.
That’s it! You just used regular expressions. Awesome.
Summary
In this blog post we went over:
- Basic functionality of regex
- The three main components of regex: anchors, character sets, and modifiers.
- The
.
(dot),*
(asterisk),^
(caret), and$
(dollar sign). - Some character sets
[0-9]
,[a-z]
,[A-Z]
, and combinations.
The goal of this blog post was to make regex a bit more approachable by means of an illustrated introduction.
If you peel away at the technical difficulties what you end up with is a relatively simple but super powerful tool that will prove invaluable in any data scientists’ toolbelt.
Subscribe
Recommend
-
6
Summer Games - An Overview for Noobs by a Noob May 30, 2019 by Wagner the Impaler I just returned from 5 days at Summer Games with my wife and we found the experience unbelievably outstanding. I am writing this...
-
5
-
7
1.1M noobs panic sell, but Bitcoin hodlers DGAF – Hodlalert1.1M noobs panic sell, but Bitcoin hodlers DGAF
-
8
Destiny 2 veterans come to 'The Rescue' of noobs stuck in the Dares of Eternity DLC The problem could have been avoided entirely if Bungie didn't automatically launch low-level players into high-lev...
-
7
LimeWire’s back! But as an… ‘NFT for noobs’ marketplace? ...
-
8
Review: NZXT FunctionGetting into the bespoke mechanical keyboard space has never quite felt this accessible.Photograph: NZXTRating:
-
8
WhatsApp’s new Communities feature sounds like Slack for noobs ...
-
4
Blockchain ExplainedAugust 11th 2022 new story4The rise of...
-
8
All-in-one Twitter growth framework and resource hub
-
7
Link outreach is the process of contacting other websites to ask for a backlink to your website.For example, here’s an outreach email we sent as part of a broken link building campaign:
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK