5

Regular Expression (Regex) Tutorial for Matching a URL

 2 years ago
source link: https://gist.github.com/DawkC/cb036f082e94772d05567c7e65c01e9b
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Regular Expression (Regex) Tutorial for Matching a URL

A regular expression (sometimes called a rational expression) is a sequence of characters that define a search pattern, mainly for use in pattern matching with strings, or string matching, i.e. “find and replace”-like operations. Regular expressions are a generalized way to match patterns with sequences of characters. It is used in every programming language like C++, Java, Python and Javascript.

Summary

This Regex will describe searching for a URL that are in the form of a string. It will describe the components of a URL regex search which will include its anchors, quantifiers, character classes, grouping and capturing, bracket expressions and greedy and lazy matching.

Regex code snippet: /^(https?://)?([\da-z.-]+).([a-z.]{2,6})([/\w .-])/?$/

Table of Contents

Regex Components

Anchors

Anchors are indiciative of the start and the end of a regular expression. For this regular expression, the opening anchor is the symol " ^ " and the closing anchor is the symbol " $ ".

Quantifiers

In regular expressions, quantifiers are symbols for optional characters. In the above regex, the quantifier " ? " is used three times: (https?://)? and /?. The character infront of each " ? " is considered an entirely optional character, signifying that this character may or may not exist in the a potential URL. There are three total character Classes in the URL matching regex: [\da-z.-], [a-z.] and [/\w .-]

Character Classes

In regular expressions, character classes enable a search for specific characters contained in a regex. Character classes are defined using brackets []. Character classes can refer to a specific range of characters, or a specific, single character. For example, in the current regex,[\da-z.-] follows 'https:', and will also match any digit \d, or letter from a to z \da-z, or a period or dash [\da-z.-].

Grouping and Capturing

Regular expressions use parentheses to indicate groupings. As in mathematics, the operations placed within parentheses are run or evaluated before all other parts of the regular expression. In our URL matching regex, the groups within parentheses are representative of the required or essential parts of a URL. There are four total groupings in the URL matching regex: (https?://), ([\da-z.-]+), ([a-z.]{2,6}), and '([/\w .-]*)'

Bracket Expressions

In regular expressions, bracket expressions are used to search for a range of characters. For example, this code snippet, [a-z.], is telling the regex to match any character within the brackets. The range is defined as any letter between a and z, and any period.

Greedy and Lazy Match

In regular expressions, the " * " and " + " symbols command the regex to find the longest possible matching string. When these two symbols are present, this is a sign of greedy matching. For lazy matching, a " ? " symbol must be added instead. A lazy match only searches for the shortest matching string. The URL matching regex has two greedy expressions: ([/\w .-]*), and ([\da-z.-]+)

Author

I am Charles Dawkins and I am a budding web developer taking a Full Stack Web Development course at University of Toronto. View my work on my Github profile


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK