We wish to obtain the start and end locations (as integers) of a given pattern in a target string.
In this section we will cover the following four common string pattern location scenarios:
For each of these four scenarios, we will cover two cases:
In addition, we cover the following scenarios which can be applied to extend any of the above:
Locating a pattern in a string is often followed by additional steps to achieve a goal. For instance, a common next step after locating patterns in a string is to substitute them with other values. We cover that in Substituting.
Substring
We wish to obtain the location of the first match of a substring in a string.
In this example, we wish to create a new column col_2
where each row value holds the location of
the first character of the first match of the string 'gm'
in the corresponding string value of the
column col_1
.
SELECT col_1,
INSTR(col_1, 'gm') AS col_2
FROM table_1;
Here is how this works:
INSTR()
function to identify the location of the first match of the substring 'gm'
in each value of the column col_1
.INSTR(source_value, search_value[, position[, occurrence]])
takes as input:col_1
.'gm'
.source_value
,
otherwise it starts at 1, which is the beginning of source_value
. If position is negative,
the function searches backward from the end of source_value
, with -1 indicating the last
character.search_value
in source_value
. If not specified, occurrence defaults
to 1 and returns the position of the first occurrence. For occurrence > 1
, the function
includes overlapping occurrences. occurrence is of type INT64
and must be positive.INSTR()
returns 0 if search_value
is not found, or if occurrence
is greater than the number
of matches found, or if position
is greater than the length of source_value
.position
and occurrence
to INSTR()
.col_2
containing the starting positions of the first occurrence
of the pattern 'gm'
in each element of col_1
.Alternative: Using STRPOS function
SELECT col_1,
STRPOS(col_1, 'gm') AS col_2
FROM table_1;
Here is how this works:
STRPOS()
function to identify the location of the first match of the substring 'gm'
in each value of the column col_1
.STRPOS()
takes as input:col_1
.'gm'
.STRPOS(value1,value2)
is the 1-based position of the first occurrence of value2
inside value1
. Returns 0 if value2
is not found.col_2
containing the starting positions of the first occurrence
of the pattern 'gm'
in each element of col_1
.Regular Expression
We wish to obtain the location of the first match of a regular expression in a string.
In this example, we wish to create two new columns start_pos
and end_pos
where each row value
holds the
location of the first character and the last character, respectively, of the first match of a
sequence of integers in the corresponding string value of the column col_1
.
SELECT col_1,
REGEXP_INSTR(col_1, '\\d+', 1, 1, 0) AS start_pos,
REGEXP_INSTR(col_1, '\\d+', 1, 1, 1) - 1 AS end_pos
FROM table_1;
Here is how this works:
REGEXP_INSTR()
function to identify the start and end locations of the first match of
the regular expression '\\d+'
in each
value of the column col_1
.REGEXP_INSTR(source_value, regexp [, position[, occurrence, [occurrence_position]]])
takes as input:col_1
.'\\d+'
,source_value
,
otherwise it starts at 1, which is the beginning of source_value
.search_value
in source_value
. If not specified, occurrence
defaults
to 1 and returns the position of the first occurrence. For occurrence > 1
, the function
searches for the next, non-overlapping occurrence. occurrence is of type INT64
and must be
positive.occurrence
.occurrence + 1
. If the end of the occurrence is at the end
of source_value, LENGTH(source_value) + 1
is returned.REGEXP_INSTR()
returns 0 if search_value
is not found, or if occurrence
is greater than the
number
of matches found, or if position
is greater than the length of source_value
, or if the regular
expression is empty.REGEXP_INSTR(col_1, '\\d+', 1, 1, 0)
returns the start position of the first match.REGEXP_INSTR(col_1, '\\d+', 1, 1, 1)
returns the last position of the first match + 1. Thus, we
subtract 1 to get the position of the last character in the first match.start
and end
locations
because the number of characters matched may vary; in this case, the digit sequences captured
may be of different lengths.table_1
,
with two additional start_pos
and end_pos
columns containing the locations of the first and
last characters, respectively, of the first sequence of digits in the corresponding value of the
column col_1
.Substring
We wish to obtain the locations of all matches of a given substring in a given string.
In this example, we wish to locate the start position of each occurrence of the substring ‘gm’
in
each value
of
the column col_1
.
SELECT *,
ARRAY(SELECT INSTR(col_1, 'gm', 1, occ + 1)
FROM UNNEST(
REGEXP_EXTRACT_ALL(col_1, 'gm')) AS match
WITH OFFSET AS occ) AS col_2
FROM table_1;
Here is how this works:
REGEXP_EXTRACT_ALL(value, regexp)
function to get an array of all matches
of the substring 'gm'
in each value of the column col_1
. We then use this array to get the
positions of the matches using INSTR()
.REGEXP_EXTRACT_ALL'()
will work as expected as
long as it doesn't contain any special characters used in regular expressions.REGEXP_EXTRACT_ALL
function only returns non-overlapping matches. For example, using this
function to extract ana
from banana
returns only one substring, not two.REGEXP_EXTRACT_ALL
function returns an array of all substrings of value that match the
regular expression, regexp
.UNNEST()
.
The WITH OFFSET
clause is used to generate a new column that contains the 0-based index of each
element in the array. The occ
alias represents the index column in the subquery.INSTR(col_1, match, 1, occ + 1)
will return the start position of the respective occurrence.
See First Match for more details.table_1
,
with an added column col_2
holding the array of starting position for all occurrences of the
substring ‘gm’
for the corresponding value of the column col_1
.Regular Expression
We wish to obtain the locations of all matches of a regular expression in a string.
In this example, we wish to locate the start position of each occurrence of a sequence of integers
in each
value of the column col_1
.
SELECT col_1,
ARRAY(SELECT REGEXP_INSTR(col_1, '\\d+', 1, occ + 1)
FROM UNNEST(
REGEXP_EXTRACT_ALL(col_1, '\\d+')) AS match
WITH OFFSET AS occ) AS col_2
FROM table_1;
Here is how this works:
REGEXP_EXTRACT_ALL()
. Note that REGEXP_EXTRACT_ALL()
expects a regular expression by
default.REGEXP_INSTR()
instead of INSTR()
to get the start position of the first match.
For more details see Regular Expressions in First Match.table_1
,
with an added column col_2
holding the array of starting position for all occurrences of the
pattern ‘\\d+’
for the corresponding value of the column col_1
.Substring
We wish to obtain the location of the last match of a substring in a string.
In this example, we wish to create a new column col_2
where each row value holds the location of
the first character of the last match of the string 'gm'
in the corresponding string value of the
column col_1
.
SELECT col_1,
INSTR(col_1, 'gm', -1, 1) AS col_2
FROM table_1;
Here is how this works:
position
parameter to
start searching from the end of the string.col_2
containing the starting positions of the last occurrence
of the pattern 'gm'
in each element of col_1
.Regular Expression
We wish to obtain the location of the last match of a regular expression in a string.
In this example, we wish to create two new columns start_pos
and end_pos
where each row value
holds the
location of the first character and the last character, respectively, of the last match of a
sequence of integers in the corresponding string value of the column col_1
.
WITH table_2 AS
(SELECT col_1,
ARRAY(SELECT REGEXP_INSTR(col_1, '\\d+', 1, occ + 1)
FROM UNNEST(REGEXP_EXTRACT_ALL(col_1, '\\d+')) AS match
WITH OFFSET AS occ) AS start_matches,
ARRAY(SELECT REGEXP_INSTR(col_1, '\\d+', 1, occ + 1, 1) - 1
FROM UNNEST(REGEXP_EXTRACT_ALL(col_1, '\\d+')) AS match
WITH OFFSET AS occ) AS end_matches
FROM table_1)
SELECT col_1,
ARRAY_REVERSE(start_matches)[SAFE_ORDINAL(1)] AS start_pos,
ARRAY_REVERSE(end_matches)[SAFE_ORDINAL(1)] AS end_pos
FROM table_2;
Here is how this works:
REGEXP_EXTRACT_ALL()
and we use REGEXP_INSTR()
to get the start and end
positions of the matches. See All Matches and First Match for
regular expressions for more details.start_matches
column and the position of the last character in each occurrence
in end_matches
column.ARRAY_REVERSE()
to get the first element, so we
get the start and end position of the last occurrence of the pattern '\\d+'
. SAFE_ORDINAL(1)
will return the first element in the array and null if the array is empty.We wish to obtain the location of the nth match of a pattern in a string.
In this example, we wish to create two new columns start_pos
and end_pos
where each row value
holds the
location of the first character and the last character, respectively, of the nth match of a sequence
of integers in the corresponding string value of the column col_1
.
WITH table_2 AS
(SELECT col_1,
ARRAY(SELECT REGEXP_INSTR(col_1, '\\d+', 1, occ + 1)
FROM UNNEST(REGEXP_EXTRACT_ALL(col_1, '\\d+')) AS match
WITH OFFSET AS occ) AS start_matches,
ARRAY(SELECT REGEXP_INSTR(col_1, '\\d+', 1, occ + 1, 1) - 1
FROM UNNEST(REGEXP_EXTRACT_ALL(col_1, '\\d+')) AS match
WITH OFFSET AS occ) AS end_matches
FROM table_1)
SELECT col_1,
start_matches[SAFE_ORDINAL(2)] AS start_pos,
end_matches[SAFE_ORDINAL(2)] AS end_pos
FROM table_2;
Here is how this works:
REGEXP_EXTRACT_ALL()
to locate all occurrences and then extract the location of
the nth occurrence.SAFE_ORDINAL(2)
and without reversing the array.REGEXP_EXTRACT_ALL()
and REGEXP_INSTR()
. See
Substring under All Matches above.table_1
,
with two additional start_pos
and end_pos
columns containing the locations of the first and
last characters, respectively, of the 2nd sequence of digits in the corresponding value of the
column col_1
.We wish to obtain the location of the first match of a substring in a string irrespective of whether the letters are in upper or lower case; i.e. while ignoring case.
In this example, we wish to create a new column col_2
where each row value holds the location of
the first character of the first match regardless of case, of the string 'gm'
in the corresponding
string value of the column col_1
.
SELECT col_1,
INSTR(LOWER(col_1), 'gm') AS col_2
FROM table_1;
Here is how this works:
INSTR()
as
described in First Match above.col_1
to lower case using LOWER()
.
See Ignore Case under
Detecting for more details.We wish to match a column of strings against a column of patterns of the same size. This is often needed when we wish to locate the value of a column in the value of another column for each row.
Substring
In this example, we have a table table_1
with two columns col_1
and col_2
, we wish to locate
the start
of the first occurrence of the value of col_2
in col_1
.
SELECT col_1,
col_2,
INSTR(col_1, col_2) AS col_3
FROM table_1;
Here is how this works:
col_2
instead of a
substring.Regular Expression
In this example, we have a table table_1
with two columns col_1
and col_2
, we wish to locate
the start
and end
of the first occurrence of a sequence of repeated col_2
value in col_1
.
SELECT col_1,
col_2,
REGEXP_INSTR(col_1, col_2 || '+', 1, 1, 0) AS start_pos,
REGEXP_INSTR(col_1, col_2 || '+', 1, 1, 1) - 1 AS end_pos
FROM table_1;
Here is how this works:
col_2
instead of a regular expression.‘+’
to each value of the column col_2
via concat operator ||
to specify a regular expression that matches one or more occurrences of
the value of col_2
.We will cover four scenarios of locating multiple patterns:
First Match of Any Pattern
We wish to return the start and end locations of the first occurrence of the first occurring pattern of n patterns. In other words, only one location is returned and that is of whichever pattern occurs first.
SELECT col_1,
REGEXP_INSTR(col_1, '\\d+|gm', 1, 1, 0) AS start_pos,
REGEXP_INSTR(col_1, '\\d+|gm', 1, 1, 1) - 1 AS end_pos
FROM table_1;
Here is how this works:
|
of regular expressions to build a regular expression that captures all
the patterns we wish to look for or’ed together. In this case that regular expression
is '\\d+|gm'
.col_1
, REGEXP_INSTR()
will return the start
and end
locations of
whichever pattern occurs first (either '\\d+’
or ‘gm’
).table_1
with two added
columns start_pos
and end_pos
.First Match of Each Pattern
We wish to return the start and end locations of the first occurrence of each pattern of n patterns. In other words, n locations are returned one for each pattern.
SELECT col_1,
REGEXP_INSTR(col_1, '\\d+', 1, 1, 0) AS pattern_1_start_pos,
REGEXP_INSTR(col_1, '\\d+', 1, 1, 1) - 1 AS pattern_1_start_pos,
REGEXP_INSTR(col_1, 'gm', 1, 1, 0) AS pattern_2_start_pos,
REGEXP_INSTR(col_1, 'gm', 1, 1, 1) - 1 AS pattern_2_start_pos
FROM table_1;
Here is how this works:
table_1
with 2 added
columns for each pattern.All Matches in One Array
We wish to return the start and end locations of all occurrences of all patterns in the order of their occurrence as a single array.
SELECT col_1,
ARRAY(SELECT
STRUCT (
REGEXP_INSTR(col_1, '\\d+|gm', 1, occ + 1) AS start_pos ,
REGEXP_INSTR(col_1, '\\d+|gm', 1, occ + 1, 1) - 1 AS end_pos)
FROM UNNEST(
REGEXP_EXTRACT_ALL(col_1, '\\d+|gm')) AS match
WITH OFFSET AS occ) AS col_2
FROM table_1;
Here is how this works:
|
of regular expressions to build a regular expression that captures all
the patterns we wish to look for or’ed together. In this case that regular expression
is '\\d+|gm'
.col_1
, REGEXP_EXTRACT_ALL()
will return an array with all matching
substrings.REGEXP_EXTRACT_ALL()
we locate the start
and end
locations for each occurrence of each pattern and put them in a STRUCT
.table_1
with an added column col_2
where
each cell holds an array of structs with the start and end position of all
matches the corresponding
value of col_1
.All Matches in Separate Arrays
We wish to return the start and end locations of all occurrences of all patterns keeping the occurrences of each pattern in a separate array.
SELECT col_1,
ARRAY(SELECT
STRUCT (
REGEXP_INSTR(col_1, '\\d+', 1, occ + 1) AS start_pos ,
REGEXP_INSTR(col_1, '\\d+', 1, occ + 1, 1) - 1 AS end_pos)
FROM UNNEST(
REGEXP_EXTRACT_ALL(col_1, '\\d+')) AS match
WITH OFFSET AS occ) AS col_2,
ARRAY(SELECT
STRUCT (
REGEXP_INSTR(col_1, 'gm', 1, occ + 1) AS start_pos ,
REGEXP_INSTR(col_1, 'gm', 1, occ + 1, 1) - 1 AS end_pos)
FROM UNNEST(
REGEXP_EXTRACT_ALL(col_1, 'gm')) AS match
WITH OFFSET AS occ) AS col_3,
FROM table_1;
Here is how this works:
table_1
with two added
columns col_2
and col_3
where each column holds the array of matches for each pattern.