Extracting by Regular Expression

We wish to extract a substring that matches a given pattern from a given string.

Substring extraction via a regular expression is one of the most powerful and frequently used techniques in data manipulation, especially for data cleaning.

We will look at three scenarios of substring extraction via regular expressions:

  • Entire Pattern where we cover how to extract a substring from a given string that matches a given regular expression.
  • One Capture Group where we cover how to match a given regular expression pattern to a given string and then extract part of the pattern commonly referred to as a capture group.
  • Multiple Capture Groups where we cover how to extract multiple parts of a pattern; i.e. multiple capture groups.

For each of these scenarios, we will cover two cases:

  • How to extract the First Match
  • How to extract All Matches

In addition, we cover the following three scenarios which can be applied to extend any of the above:

  • Ignore Case where we cover how to ignore the case (of both pattern and string) while matching.
  • Pattern Column where we cover how to match a column of strings against a column of patterns of the same size. This is quite often needed when we wish to locate the value of a column in another column for each row.

Entire Pattern

First Match

We wish to extract the first match of a regular expression pattern from a given string.

In this example, we wish to extract the first occurrence of a number (a sequence of digits) followed by the letters ‘kg’ with zero or more empty space characters between.

SELECT col_1,
       REGEXP_EXTRACT(col_1, '\\d+\\s*kg') AS col_2
FROM table_1

Here is how this works:

  • We use REGEXP_EXTRACT() function to extract a substring that matches a given regular expression from each value of the column col_1.
  • The REGEXP_EXTRACT() function takes the following arguments:
    • The column whose values we wish to check against; which in this case is col_1.
    • A regular expression; which in this case is '\\d+\\s*kg'.
  • In this example, the regular expression that represents the substrings that we are looking to extract is '\\d+\\s*kg' which breaks down as follows:
    • \\d+ matches one or more numeric characters.
    • \\s* matches zero or more empty space characters
    • kg matches the string kg.
  • The output table will have the same number of rows as the input table table_1 with an added column col_2 holding the extracted substrings from the corresponding value of the column col_1.

All Matches

We wish to extract all matches of a regular expression pattern from a given string.

In this example, we wish to extract all occurrences of a number (a sequence of digits) followed by the letters ‘kg’ with zero or more empty space characters between. We then wish to combine the extracted values into a single string with comma separators.

SELECT col_1,
       ARRAY_TO_STRING(REGEXP_EXTRACT_ALL(col_1, '\\d+\\s*kg'), ', ') AS col_2
FROM table_1;

Here is how this works:

  • We use the REGEXP_EXTRACT_ALL(value, regexp) function to extract all occurrences of a given regular expression from each value of the column col_1.
  • The output of REGEXP_EXTRACT_ALL() returns an array of strings where each array holds the matches extracted from the corresponding value of the column col_1.
  • In order to reduce each array of matches to a single string where the extracted matches are separated by commas, we:
    • Use ARRAY_TO_STRING() to reduce each array of matches to a single string where the extracted matches are separated by commas.

Extension: Nth Match

We wish to extract the nth match of a regular expression pattern from a given string.

SELECT col_1,
       REGEXP_EXTRACT(col_1, '\\d+\\s*kg', 1, 2) AS col_2
FROM table_1;

Here is how this works:

  • We use REGEXP_EXTRACT() to extract the nth pattern match from each element of the column col_1.
  • The function REGEXP_EXTRACT(value, regexp[, position[, occurrence]]) takes as input:
    • value: a string value or a column of string values to look into which in this case is col_1.
    • regexp: s regular expression to look for which in this case is the substring '\\d+',
    • position: an optional parameter to specify the search start position in value, otherwise it starts at 1, which is the beginning of value.
    • occurrence: an optional parameter that if specified, the search returns the position of a specific instance of value in value. If not specified, occurrence defaults to 1 and returns the position of the first occurrence. For occurrence > 1, the function searches for additional occurrences beginning with the character following the previous occurrence.
  • In order to extract the second match, we pass to REGEXP_EXTRACT() 1 as position and 2 as occurrence.

Extension: Last Match

We wish to extract the last match of a regular expression pattern from a given string.

SELECT col_1,
       ARRAY_REVERSE(REGEXP_EXTRACT_ALL(col_1, '\\d+\\s*kg'))[SAFE_ORDINAL(1)] AS col_2
FROM table_1;

Here is how this works:

  • We use REGEXP_EXTRACT_ALL() to extract all pattern matches from each element of the column col_1; as described in the primary solution above.
  • In order to extract the last match:
    • We reverse the array, which contain all matching substrings extracted for each element in col_1, and
    • we use SAFE_ORDINAL(1), to get the first element in the revered array which is the last match in this case.

One Capture Group

First Match

We wish to extract a part, specified by a capture group, of the first match of a regular expression pattern from a given string.

In this example, we wish to extract the first occurrence of a number (a sequence of digits) that is followed by the letters ‘kg’ with zero or more empty space characters between. Note we only wish to extract the number component.

SELECT col_1,
       REGEXP_EXTRACT(col_1, '(\\d+)\\s*kg') AS col_2
FROM table_1;

Here is how this works:

  • This works similarly to Entire Pattern above except we pass a different regular expression.
  • In this example the regular expression is ‘(\\d+)\\s*kg’ which works as follows:
    • (\\d+) defines a capture group that contains a sequence of digits. A capture group is a part of a regular expression that can be referred to.
    • \\s*kg denotes a pattern where zero or more empty space characters are followed by the string ‘kg’.
  • The power of capture groups in substring extraction is most pronounced when parts of the regular expression are to localize the desired substring, but we don’t want to include them in the output extraction. For instance, in this case, we wish to capture any integer that comes before the unit 'kg' but we do not wish to include 'kg' in the output.
  • The output table will have the same number of rows as the input table table_1 with an added column col_2 holding the extracted substrings (corresponding to the first capture group) from the corresponding value of the column col_1.

All Matches

We wish to extract a part, specified by a capture group, of each match of a regular expression pattern from a given string.

In this example, we wish to extract and add up all occurrences of a number (a sequence of digits) that is followed by the letters ‘kg’ (with no or more empty space characters between). Note we only wish to extract the number component.

SELECT col_1,
       REGEXP_EXTRACT_ALL(col_1, '(\\d+)\\s*kg') AS col_2
FROM table_1;

Here is how this works:

Multiple Capture Groups

We wish to extract multiple parts, specified by multiple capture groups, of the first match of a regular expression pattern from a given string.

Unfortunately in SQL we cannot use regular expressions with multiple capture groups. If we wish to extract capturing groups we do it using multiple regular expressions as separate columns.

In this example, we wish to extract two parts of the first occurrence of a string pattern; the first representing a numerical quantity and the second representing the unit from each value of the column col_1, and store each in its own column.

SELECT col_1,
       REGEXP_EXTRACT(col_1, '(\\d+)\\s*kg') AS value,
       REGEXP_EXTRACT(col_1, '\\d+\\s*(kg)') AS unit
FROM table_1;

Here is how this works:

  • We use REGEXP_EXTRACT() to extract the first substrings matching a capture group from each element in column.
  • We do this for each capture group in a separate column.
  • The output table will have the sume number of rows as the input table table_1 with two columns added valueand unit containing the values captured by the two capture groups.

Ignore Case

We wish to extract the first match of a regular expression pattern from a given string while ignoring case.

In this example, we wish to extract the first occurrence of a number (a sequence of digits) followed by the letters ‘kg’ regardless of case with zero or more empty space characters between.

SELECT col_1,
       REGEXP_EXTRACT(LOWER(col_1), '\\d+\\s*kg') AS col_2,
FROM table_1;

Here is how this works:

  • To extract the first occurrence of a given regular expression in a given string, we use REGEXP_EXTRACT() as described in Entire Pattern above.
  • To ignore case while matching, we convert the input column to lower case and use a regular expression with lower case characters. See Ignore Case under Detecting for more details.
  • We can extend the solutions presented in One Capture Group and Multiple Capture Groups the same way.

Pattern Column

We wish to match a vector of strings against a column of patterns. This is often needed when the regular expression we are matching needs to be different for each row based on the value of a particular column.

In this example, we have a table table_1 with two column col_1 and col_2, we wish to extract from each value in col_1 a substring composed of a sequence of digits followed by a set of characters representing a unit specified by the corresponding value of the column col_2.

SELECT col_1,
       col_2,
       REGEXP_EXTRACT(col_1, CONCAT('(\\d+)\\s*', col_2)) AS col_3,
FROM table_1;

Here is how this works:

  • We pass to REGEXP_EXTRACT():
    • the strings to look into which in this case is the column col_1 as the first argument.
    • and the patterns to look for which in this case is a transformation of the column col_2 and is naturally of the same size as col_1.
  • We construct the regular expression patterns by appending the value of the column col_2 to the regular expression '(\\d+)\\s*' for each row. For instance, if the value of the column col_2 is ‘kg’ then the value of the regular expression to extract will be '(\\d+)\\s*kg'.
  • The output table have the same number of rows as the input table table_1 but with an additional column col_2 containing the substrings extracted.
  • See Entire Pattern above for a more detailed description of this code. We can extend the solutions presented in One Capture Group and Multiple Capture Groups the same way.
SQL
I/O