Your Guide to Regex

1. What are Regular Expressions?

Regular Expressions are sequences of characters which essentially match patterns, and are useful for a wide variety of string operations such as search queries, and string parsing. A regular expression is usually referred to as a regex or regexp. (I prefer regex; it is a little easier to say.) They are extremely useful for any kind of text or data manipulation.

Sample Regular Expression:

(1) matches Mastercard Credit Number Patterns
(2) matches email addresses

As you can see from the above example, regular expressions are just sequences of seemingly unordered random characters. Because they basically look like gibberish when they appear in code, they can seem confusing and intimidating. To understand regular expressions you must become comfortable with a few of the ways characters can be used. This tutorial will go over a few of the basic ideas when using regular expressions.

2.1 Literals

“I’m sure that Egg is a very nice person”

These are the most plain and bland type of characters. These will only match with the same literal characters or string. For example, say you wanted to remove all forms of the name “Hanson” from the paragraph below. Using the regular expression “Hanson” on the following passage would yield the following matches:

Certain hanson ADTs are best used with what Hanson calls Atoms, and the HanSon implementation of Atoms leaks memory (even code supplied by HANSON professionals can have significant HANSON shortcomings, though there are someHanson reasons why reclaiming memory for Atoms is particularly challenging). If your only memory leaks trace to (reasonable) uses of HansOn’s Atoms, there will be no penalty. HanSon – CHansonette

There are a few problems with using literals.

  • Capitalization: Maybe you wanted all variations of capitalization for “Hanson” to be removed. LIterals would not work. You would need search using all literal variations “Hanson”, “HAnson”, “HANson”, and so on. That would obviously be too much typing.
  • Location: It could also be important to differentiate strings when they are separate strings from substrings. Maybe CHansonette doesn’t trigger memory leak flashbacks like the name Hanson does, if so, it might not be necessary to free it from the passage.

2.2 Special Characters

As you have probably noticed, regular expressions are just a string of characters. Literals are rarely used as they are far too limiting. Much of the functionality of regular expressions comes from special characters and their ability to match a whole class or range of characters. Special characters can have many purposes, such as specifying the amount of times an expression should be matched, where it can be matched, or representing an untypable character (‘\n’). Here are some common special characters including those that were used in the Mastercard example. These will be gone over in more detail throughout the tutorial.

Table 1:

{n}where n is any natural number. this matches the previous character exactly n times$signifies matches the end of the string or line or before \n

Character Description
^ carats – match at a new line or string
[] character class – this will match any single character which fits the group specified between the brackets
hyphens when placed in between two other characters, such as 0 and 9, will match any character fitting in that range inclusive
\d matches any digit
\w matches any alphanumeric character + underscores (word characters)
. matches any character
\n new line character

 

2.2.1 Back Slashes and how to escape:

Backslashes either signify special characters as shown in ‘\d’ or ‘\w’ in Table 1 or they can be used to escape special character which are meant to be treated as literals. For example, if you were searching through data for a string representing money signified by the dollar sign and possibly a decimal followed by one or two other decimal numbers. You would use the regular expression.

\$ Literal ‘$’
\d any decimal character
* match the previous character zero or more times
\. Literal ‘.’
{1,2} Match one to two of the character to the left
| Logical OR: match the first group or the second group
+ match the previous character one or more times

 

2.2.3 Anchors:

Anchors do not match any characters, instead they are used to specify where valid matches are found. The Mastercard example uses two of the most common anchors, ‘^’ and ‘$’. As you can see from the description in Table 1, these two characters handle position of the found string whereas the others determine the contents of valid strings. Here are some other anchors.

\A beginning of a string
\z end of a string
\Z very end of a string
\b word boundary

 

2.2.4 Character Classes:

These are usually specified by either specific characters such as ‘\d’ for any decimal number, or by brackets for more custom arrangements. Also notice below how the carat symbol ‘^ takes on new meaning when enclosed in brackets. The character ranges are dictated by their ascii values. Some examples are below.

[a-p] any character between lower case a and lower case n
[ap] either the character ‘a’ or the character ‘p’
[a-zA-Z] match any character between ‘a’ and ‘z’ or between ‘A’ and ‘Z’
[^a] any character that is not ‘a’
[^a-z] any character that is not between ‘a’ – ‘z’
[\x4C] hexidecimal matching: in this case the character ‘L’ would be matched as 4C is its hexadecimal value on the ascii table

 

2.2.4 Quantifiers:

From the previous examples quantifiers were frequently used. They allow the specific number of times a match is made for a character. Some examples are below where n and m are natural numbers.

{n,m}match n to m times+match one or more times

{n} match n times
{n,} match n or more times
* match zero or more times
? match one or zero times

 

3. Where to go from here?:

Hopefully this tutorial provides you with a better idea of what regular expressions are and how useful they can be. This is by no means a comprehensive tutorial, but this should help you be a little more comfortable with regular expressions when you first come across them. There are plenty of other learning resources online and I have listed a few below.

https://regex101.com/ – Great interactive regex tester with descriptions for characters.
http://www.rexegg.com/ – More in depth tutorial
https://regexone.com/ – Interactive and functional tutorial through the basics and onward.
Mastering Regular Expressions : Friedl – Apparently a widely comprehensive book.

Leave a Reply

Your email address will not be published. Required fields are marked *