The basics of regular expressions

Regular expressions is a generalized way to search and extract information from any string by using special human-readable patterns. They are supported by most computer languages and frequently used for purposes like text matching, input validation, and log processing.

Their syntax is universal and any language that supports them should be able to process the same regular expression just fine. There can be subtle differences in special cases but those should be pretty straightforward to change. All the basic functionality that we’re going to look at here is supported by every regular expression library or software that uses it, be it Apache, Nginx, PHP, Javascript.

RegEx literals

The simplest regular expression is a single literal character, such as the character “t”. It matches any string that contains the word “t” anywhere. Literals can also be longer, so the regular expression “te” matches the string “te” anywhere in a string and so on. Any character that is not special is processed as a simple literal, so it makes basic regular expressions very easy to read and understand because they mostly resemble ordinary text.

Special characters

There are a few special characters in regular expressions that mark extra functionality and they follow special rules and patterns. These are also called meta-characters. To use them as literals (for example the dot “.” is a special character but the regular expression may need to actually look for the dot character) they should be escaped by using the backslash (“\”) character. Be careful, because the backslash can have also a special meaning depending on the language or environment so it may need to be double escaped by using “\\” instead of “\”.

Here is a list of them together with their special meaning:

. (dot)any character, wildcard
^ (caret)beginning of line
\ (backslash)escape the next character
$ (dollar sign)end of line
| (pipe)logical “OR” between two patterns
? (question mark)atom to the left 0 or 1 times
* (star)atom to the left 0 to n times
+ (plus sign)atom to the left 1 to n times (so at least once)
( (opening parenthesis)beginning of group
) (closing parenthesis)end of group
[ (opening square bracket)beginning of character class (should be paired with a “]”)
{ (opening curly brace)beginning of repetition qualifier (should be paired with “}”)
RegEx meta-characters

These can mostly be freely combined with each other, brackets should always be paired with a matching pair of the same bracket or they will result in an error.

Basic examples

In its most basic form, the regular expression

techtipbits

will simply match the string “techtipbits” anywhere in the text. By using the beginning of line or end of line meta characters, we can limit them to the start or the end of a string.

Anchors are meta-characters that don’t match characters but their positions. There are two anchors, the “^” that matches the beginning of the line and the “$” that matches the end of the line.

^tech
tipbits$

The first of these patterns will match any string that starts with “tech”, the second one will match one that ends i “tipbits”. In some languages (notably, PHP) the “$” string is a special character so it may need to be escaped, the pattern becoming “tipbits\$”, or use single quotes to avoid this like ‘tipbits$’.

To match any character, use the “.” meta-character at the right location, so the expression

^.ight

will match any string that starts with “right”, “sight”, “fight” or pretty much any 5 letter word that ends in “ight”.

Matching repetitions

Three special meta-characters that match repetitions can be used to craft regular expressions to find an arbitrary number of repeated characters or patterns.

The “+” sign means that whatever is left to it is matched at least once, the “?” sign means that it is matched 0 or 1 times, the “*” means it should be repeated any number of times, even 0.

The repetition qualifier can be used to define repetition in an more refined way, {2,4} means that something to the left of the qualifier should occur between 2 and 4 times. {3,} means at least 3 times, {5} means exactly 5 times. Notice that there is no whitespace inside those curly brackets.

hello+
favou?rite
ab*cd
lo{1,3}se

In the examples above, the pattern “hello+” will match the word “hello” or “helloooo” or the same word with any number of “o”s at the end (so plus means “at least once”).

The second example will match both “favorite” and “favourite”, because of the question mark that follows the “u” letter.

The third one matches “abcd”, “acd”, “abbbbbbcd” – any number of “b”s between “a” and “c”, even zero.

The regular expression “lo{1,3}se” will match the word “lose”, “loose”, “looose”, because the word “o” left to the repetition qualifier should occur between 1 and 3 times.

Character ranges

To look for a set or range of characters, use square brackets. The expression will still match one character from the source text but any character in the range will match it. The general format of character ranges is:

[ ^ range1 range2 range3 ... ] 

where ranges are either simple characters or from-to ranges with a dash between them.

The caret optionally negates the whole range, so for example the range “[a-z]” means any letter, but the range [^a-z] means anything BUT letters.

For example, to match a hexadecimal digit, use the range:

[0-9a-fA-F]

Combine this with the repetition qualifier to match a longer hexadecimal number, so to match 4 hex characters, use

[0-9a-fA-F]{4}

There are a couple of shorthand special meta-characters to be used inside ranges, for example, “\d” matches all digits, “\w” for any word characters, “\s” for whitespaces. In the example above “0-9” can be substituted for “\d”.

Alternatives

By using the pipe character it’s possible to provide alternatives in regular expressions. This is the equivalent of an “OR” operator in a query. The expression

black|white

will match either “black” or “white”. Because alteration has the lowest precedence, in any expression that’s more complicated than the one above it should be wrapped in parentheses so to match “black cat” or “white cat” it should be written as:

(black|white) cat

Summary

These are the basics of regular expressions and it should be enough to get you started experimenting with them. The best way to learn their usage is to use an online interactive tool such as https://regex101.com and test how they behave. Under Linux, the simple “grep” command can be used to test regular expressions, it’s a string matching tool that’s the Swiss army knife of system administrators when it comes to text matching.

Related Posts