Code Published on the 31st May 2023

Mathematical underpinnings of regular expressions

As developers, we use regular expressions in our day to day job to find and replace patterns in text, or to validate inputs. They are both pervasive and slightly feared for how complex and arcane they can look. Through a series of articles, we will try to demystify them through their history so you can learn to love them and appreciate their beauty.

Mathematical underpinnings of regular expressions

In this first article in the series, we will study their mathematical origins in the middle of the 20th century. While some of the vocabulary in this blog post can be alien and scary at first, please press on: we will show how simple the concepts of regular expressions are at their core.

Formal Languages

Before we can start speaking about regular expressions, we must make a detour to explain the concept of formal languages, as regular expressions have their roots in this branch at the crossroads of logic, mathematics, computer science and linguistics.

To put it simply, a formal language is a set of words made up of letters from a pre-defined alphabet as well as a set of rules that define how to form valid words from those letters.

We will use a very down-to-earth (for us programmers) example that we will keep referring back to throughout the series of articles: the language of dates in the ISO 8601 format.

To reuse the same terminology, the language of dates in the ISO 8601 format is the set of all valid dates (words) in that format from 0000-01-01 up to 9999-12-31. Those words are made up of letters from the alphabet of the ten digits and the dash character { '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '-' } using rules that define valid words: four digits for the year, following by a dash, followed by two digits from 01 to 12 for the month, following by a dash, following by two digits from 01 to 31 for the day.

1951: Regular Events

Kleene.jpg
Stephen Cole Kleene

In 1951, the mathematician Stephen Cole Kleene released a paper entitled Representation of Events in Nerve Nets and Finite Automata where he presented a notation to describe sequence of events, that he named regular events, in a biological network of neurons (not to be confused with AI’s neural networks). This simple yet powerful and concise notation, now called regular expressions, allows describing a set of formal languages called regular languages.

In formal language theory, regular expressions are made of constants, which denote sets of strings, and operations over those sets.

Given a finite alphabet Σ, the following constants are defined as regular expressions:

  • denoting the empty set,
  • ε denoting the set containing only the empty string,
  • a for each a ∈ Σ denoting the set containing only the character a.

Given regular expressions R and S, the following operations over them produce regular expressions:

  • RS denoting the concatenation of R and S, the set of strings that can be obtained by concatenating a string accepted by R and a string accepted by S,
  • R|S denoting the union of R and S, the set of strings accepted by R or S,
  • R*, called the Kleene star, denoting the set of strings that can be obtained by concatenating any finite number (including zero) of strings from the set accepted by R.

Example 1: Homophones

For example, let’s describe the language of the homophones coarse, corse, and course using a regular expression: co(a|u|ε)rse that uses concatenation and union. This regular expression states that the words must:

  • co: start with the letters c and o,
  • (a|u|ε): followed by the letter a, u, or the empty string,
  • rse: and end with the letters r, s, and e.

Notice that in the second part of this expression, the empty string constant is used to make the whole group optional. We’ll see more about this in the second article in the series.

Example 2: Numbers

Let’s look at a slightly more involved example that uses Kleene’s star. Here is a regular expression to represent integer and decimal numbers like 42, 0, or 12.34:

(0|((1|2|3|4|5|6|7|8|9)(0|1|2|3|4|5|6|7|8|9)*))(((0|1|2|3|4|5|6|7|8|9)(0|1|2|3|4|5|6|7|8|9)*)|ε)

Let me format and comment it for you:

(                          # The integer part:
  0                        #   - 0
  |                        #   or
  (
    (1|2|3|4|5|6|7|8|9)    #   - a digit between 1 and 9,
    (0|1|2|3|4|5|6|7|8|9)* #     followed by any number of digits between 0 and 9.
  )
)
(                          # The decimal part:
  (
    .                      #   - a period,
    (0|1|2|3|4|5|6|7|8|9)  #   - followed by a digit between 0 and 9,
    (0|1|2|3|4|5|6|7|8|9)* #   - followed by any number of digits between 0 and 9.
  )
  |                        #   or
  ε                        #   - the empty string (making the decimal part optional).
)

Some other readers might point out that this is very verbose whereas the \d and [1-9] forms could have reduced the size of this regex. But that syntax is not part of the formal mathematics of regular expression.

Example 3: ISO 8601 Dates

Coming back to the language of ISO 8601 formatted dates, here’s a simple (but verbose) regular expression to describe it:

((0|1|2|3|4|5|6|7|8|9)(0|1|2|3|4|5|6|7|8|9)(0|1|2|3|4|5|6|7|8|9)(0|1|2|3|4|5|6|7|8|9))-(0(1|2|3|4|5|6|7|8|9)|1(0|1|2))-(0(1|2|3|4|5|6|7|8|9)|1(0|1|2|3|4|5|6|7|8|9)|2(0|1|2|3|4|5|6|7|8|9)|3(0|1))

Okay, I admit, that’s really stodgy. Again, it will be easier to read with a bit of formatting and comments:

(                          # The year:
  (0|1|2|3|4|5|6|7|8|9)    #   - millennium
  (0|1|2|3|4|5|6|7|8|9)    #   - centuary
  (0|1|2|3|4|5|6|7|8|9)    #   - decade
  (0|1|2|3|4|5|6|7|8|9)    #   - year
)
-
(                          # The month:
  0(1|2|3|4|5|6|7|8|9)     #   - January to September
  |                        #   or
  1(0|1|2)                 #   - October to December
)
-
(                          # The day:
  0(1|2|3|4|5|6|7|8|9)     #   - 1st to 9th
  |                        #   or
  1(0|1|2|3|4|5|6|7|8|9)   #   - 10th to 19th
  |                        #   or
  2(0|1|2|3|4|5|6|7|8|9)   #   - 20th to 29th
  |                        #   or
  3(0|1)                   #   - 30th or 31st
)

The astute reader will notice that this “simple” regular expression produces some invalid dates, like 2023-02-31. Of course, February does not have 31 days. A more serious regular expression would handle those cases, but I wanted to keep it simple.

Next

As we just saw through the three examples, Kleene’s regular expressions are simple but seem powerful enough to represent a wide variety of regular expression. In the next article, we will leave the realm of mathematics and delve into the early applications of regular expressions in computing. Moreover, we will demonstrate that despite introducing new syntax, those “extended” regular expressions don’t add more expressive power over Kleene’s regular expressions and that the few constants and three operations we discovered here are actually enough to represent all expressions you may have ever written.

David

Associate · Senior Software Engineer

Read his/her presentation

Do not miss any of our news

Every month, receive your share of information about us, such as the release of new articles or products.