We wish to extract a substring that matches a given pattern from a given string.
Substring extraction via a regular expression is one of the most powerful and frequently used techniques in data manipulation, especially for data cleaning.
We will look at three scenarios of substring extraction via regular expressions:
For each of these scenarios, we will cover two cases:
In addition, we cover the following three scenarios which can be applied to extend any of the above:
First Match
We wish to extract the first match of a regular expression pattern from a given string.
In this example, we wish to extract the first occurrence of a number (a sequence of digits) followed
by the letters ‘kg’
with zero or more empty space characters between.
SELECT col_1,
REGEXP_EXTRACT(col_1, '\\d+\\s*kg') AS col_2
FROM table_1
Here is how this works:
REGEXP_EXTRACT()
function to extract a substring that matches
a given regular expression from each value of the column col_1
.REGEXP_EXTRACT()
function takes the following arguments:col_1
.'\\d+\\s*kg'
.'\\d+\\s*kg'
which breaks down as follows:\\d+
matches one or more numeric characters.\\s*
matches zero or more empty space characterskg
matches the string kg
.table_1
with an added
column col_2
holding the extracted substrings from the corresponding value of the
column col_1
.All Matches
We wish to extract all matches of a regular expression pattern from a given string.
In this example, we wish to extract all occurrences of a number (a sequence of digits) followed by
the letters ‘kg’
with zero or more empty space characters between. We then wish to combine the
extracted values into a single string with comma separators.
SELECT col_1,
ARRAY_TO_STRING(REGEXP_EXTRACT_ALL(col_1, '\\d+\\s*kg'), ', ') AS col_2
FROM table_1;
Here is how this works:
REGEXP_EXTRACT_ALL(value, regexp)
function to extract all occurrences of a
given regular expression from each value of the column col_1
.REGEXP_EXTRACT_ALL()
returns an array of strings where each array holds the
matches extracted from the corresponding value of the column col_1
.ARRAY_TO_STRING()
to reduce each array of matches to a single string where the extracted
matches are separated by commas.Extension: Nth Match
We wish to extract the nth match of a regular expression pattern from a given string.
SELECT col_1,
REGEXP_EXTRACT(col_1, '\\d+\\s*kg', 1, 2) AS col_2
FROM table_1;
Here is how this works:
REGEXP_EXTRACT()
to extract the nth pattern match from each element of the
column col_1
.REGEXP_EXTRACT(value, regexp[, position[, occurrence]])
takes as input:col_1
.'\\d+'
,value
,
otherwise it starts at 1, which is the beginning of value
.value
in value
. If not specified, occurrence
defaults
to 1 and returns the position of the first occurrence. For occurrence > 1
, the function
searches for additional occurrences beginning with the character following the previous
occurrence.REGEXP_EXTRACT()
1 as position
and 2
as occurrence
.Extension: Last Match
We wish to extract the last match of a regular expression pattern from a given string.
SELECT col_1,
ARRAY_REVERSE(REGEXP_EXTRACT_ALL(col_1, '\\d+\\s*kg'))[SAFE_ORDINAL(1)] AS col_2
FROM table_1;
Here is how this works:
REGEXP_EXTRACT_ALL()
to extract all pattern matches from each element of the
column col_1
;
as described in the primary solution above.col_1
, andSAFE_ORDINAL(1)
, to get the first element in the revered array which is the last
match in this case.First Match
We wish to extract a part, specified by a capture group, of the first match of a regular expression pattern from a given string.
In this example, we wish to extract the first occurrence of a number (a sequence of digits) that is
followed by the letters ‘kg’
with zero or more empty space characters between. Note we only wish
to extract the number component.
SELECT col_1,
REGEXP_EXTRACT(col_1, '(\\d+)\\s*kg') AS col_2
FROM table_1;
Here is how this works:
‘(\\d+)\\s*kg’
which works as follows:(\\d+)
defines a capture group that contains a sequence of digits. A capture group is a part
of a regular expression that can be referred to.\\s*kg
denotes a pattern where zero or more empty space characters are followed by the
string ‘kg’
.'kg'
but we do not wish to include 'kg'
in the output.table_1
with an added
column col_2
holding the extracted substrings (corresponding to the first capture group)
from the corresponding value of the column col_1
.All Matches
We wish to extract a part, specified by a capture group, of each match of a regular expression pattern from a given string.
In this example, we wish to extract and add up all occurrences of a number (a sequence of digits)
that is followed by the letters ‘kg’
(with no or more empty space characters between). Note we
only wish to extract the number component.
SELECT col_1,
REGEXP_EXTRACT_ALL(col_1, '(\\d+)\\s*kg') AS col_2
FROM table_1;
Here is how this works:
We wish to extract multiple parts, specified by multiple capture groups, of the first match of a regular expression pattern from a given string.
Unfortunately in SQL we cannot use regular expressions with multiple capture groups. If we wish to extract capturing groups we do it using multiple regular expressions as separate columns.
In this example, we wish to extract two parts of the first occurrence of a string pattern; the first
representing a numerical quantity and the second representing the unit from each value of the
column col_1
, and store each in its own column.
SELECT col_1,
REGEXP_EXTRACT(col_1, '(\\d+)\\s*kg') AS value,
REGEXP_EXTRACT(col_1, '\\d+\\s*(kg)') AS unit
FROM table_1;
Here is how this works:
REGEXP_EXTRACT()
to extract the first substrings matching a capture group from each
element in column.table_1
with two columns
added value
and unit
containing the values captured by the two capture groups.We wish to extract the first match of a regular expression pattern from a given string while ignoring case.
In this example, we wish to extract the first occurrence of a number (a sequence of digits) followed
by the letters ‘kg’
regardless of case with zero or more empty space characters between.
SELECT col_1,
REGEXP_EXTRACT(LOWER(col_1), '\\d+\\s*kg') AS col_2,
FROM table_1;
Here is how this works:
REGEXP_EXTRACT()
as described in Entire Pattern above.We wish to match a vector of strings against a column of patterns. This is often needed when the regular expression we are matching needs to be different for each row based on the value of a particular column.
In this example, we have a table table_1
with two column col_1
and col_2
, we wish to extract
from each value in col_1
a substring composed of a sequence of digits followed by a set of
characters representing a unit specified by the corresponding value of the column col_2
.
SELECT col_1,
col_2,
REGEXP_EXTRACT(col_1, CONCAT('(\\d+)\\s*', col_2)) AS col_3,
FROM table_1;
Here is how this works:
REGEXP_EXTRACT()
:col_1
as the first argument.col_2
and
is naturally of the same size as col_1
.col_2
to the
regular expression '(\\d+)\\s*'
for each row. For instance, if the value of the column col_2
is ‘kg’
then the value of the regular expression to extract will be '(\\d+)\\s*kg'
.table_1
but with an additional
column col_2
containing the substrings extracted.