Locating

We wish to obtain the start and end locations (as integers) of a given pattern in a target string.

In this section we will cover the following four common string pattern location scenarios:

  • First Match where we cover how to obtain the start and end locations of the first occurrence of a given pattern in a given string.
  • All Matches where we cover how to obtain the start and end locations of all occurrences of a given pattern in a given string.
  • Last Match where we cover how to obtain the start and end locations of the last occurrence of a given pattern in a given string.
  • nth Match where we cover how to obtain the start and end locations of the nth occurrence of a given pattern in a given string.

For each of these four scenarios, we will cover two cases:

  • Substring where the pattern is a plain character sequence
  • Regular Expressions where the pattern is a regular expression

In addition, we cover the following scenarios which can be applied to extend any of the above:

  • Ignore Case where we cover how to ignore the case (of both pattern and string) while matching.
  • Pattern Column where we cover how to match a vector of strings against a vector of patterns of the same size. This is often needed when we wish to locate the value of a column in another column for each row.
  • Multiple Patterns where we cover how to extend any of the above scenarios to check against multiple patterns at a time.

Locating a pattern in a string is often followed by additional steps to achieve a goal. For instance, a common next step after locating patterns in a string is to substitute them with other values. We cover that in Substituting.

First Match

Substring

We wish to obtain the location of the first match of a substring in a string.

In this example, we wish to create a new column col_2 where each row value holds the location of the first character of the first match of the string 'gm' in the corresponding string value of the column col_1.

SELECT col_1,
       INSTR(col_1, 'gm') AS col_2
FROM table_1;

Here is how this works:

  • We use the INSTR() function to identify the location of the first match of the substring 'gm' in each value of the column col_1.
  • The function INSTR(source_value, search_value[, position[, occurrence]]) takes as input:
    • source_value: a string value or a column of string values to look into which in this case is col_1.
    • search_value: a pattern to look for which in this case is the substring 'gm'.
    • position: an optional parameter to specify the search start position in source_value, otherwise it starts at 1, which is the beginning of source_value. If position is negative, the function searches backward from the end of source_value, with -1 indicating the last character.
    • occurrence: an optional parameter that if specified, the search returns the position of a specific instance of search_value in source_value. If not specified, occurrence defaults to 1 and returns the position of the first occurrence. For occurrence > 1, the function includes overlapping occurrences. occurrence is of type INT64 and must be positive.
  • INSTR() returns 0 if search_value is not found, or if occurrence is greater than the number of matches found, or if position is greater than the length of source_value.
  • Since we are looking for the first occurrence, we don't pass position and occurrence to INSTR().
  • The output table has a new column col_2containing the starting positions of the first occurrence of the pattern 'gm' in each element of col_1.

Alternative: Using STRPOS function

SELECT col_1,
       STRPOS(col_1, 'gm') AS col_2
FROM table_1;

Here is how this works:

  • We use the STRPOS() function to identify the location of the first match of the substring 'gm' in each value of the column col_1.
  • The function STRPOS() takes as input:
    • A string value or a column of string values to look into which in this case is col_1.
    • A pattern to look for which in this case is the substring 'gm'.
  • The output of STRPOS(value1,value2) is the 1-based position of the first occurrence of value2 inside value1. Returns 0 if value2 is not found.
  • In SQL indexing is 1-based. For instance, a match that occurs at the very beginning of the string would have an index of 1.
  • The output table has a new column col_2containing the starting positions of the first occurrence of the pattern 'gm' in each element of col_1.

Regular Expression

We wish to obtain the location of the first match of a regular expression in a string.

In this example, we wish to create two new columns start_pos and end_pos where each row value holds the location of the first character and the last character, respectively, of the first match of a sequence of integers in the corresponding string value of the column col_1.

SELECT col_1,
       REGEXP_INSTR(col_1, '\\d+', 1, 1, 0) AS start_pos,
       REGEXP_INSTR(col_1, '\\d+', 1, 1, 1) - 1  AS end_pos
FROM table_1;

Here is how this works:

  • We use the REGEXP_INSTR() function to identify the start and end locations of the first match of the regular expression '\\d+' in each value of the column col_1.
  • The function REGEXP_INSTR(source_value, regexp [, position[, occurrence, [occurrence_position]]]) takes as input:
    • source_value: a string value or a column of string values to look into which in this case is col_1.
    • regexp: s regular expression to look for which in this case is the substring '\\d+',
    • position: an optional parameter to specify the search start position in source_value, otherwise it starts at 1, which is the beginning of source_value.
    • occurrence: an optional parameter that if specified, the search returns the position of a specific instance of search_value in source_value. If not specified, occurrence defaults to 1 and returns the position of the first occurrence. For occurrence > 1, the function searches for the next, non-overlapping occurrence. occurrence is of type INT64 and must be positive.
    • occurrence_position: an optional parameter that specifies where a position is in relation to an occurrence.
      • 0 returns the start position of occurrence.
      • 1 returns the end position of occurrence + 1. If the end of the occurrence is at the end of source_value, LENGTH(source_value) + 1 is returned.
  • REGEXP_INSTR() returns 0 if search_value is not found, or if occurrence is greater than the number of matches found, or if position is greater than the length of source_value, or if the regular expression is empty.
  • REGEXP_INSTR(col_1, '\\d+', 1, 1, 0) returns the start position of the first match.
  • REGEXP_INSTR(col_1, '\\d+', 1, 1, 1) returns the last position of the first match + 1. Thus, we subtract 1 to get the position of the last character in the first match.
  • Usually, when matching regular expressions, we need to hold on to both the start and end locations because the number of characters matched may vary; in this case, the digit sequences captured may be of different lengths.
  • The resulting table will have the same number of rows as the original table table_1, with two additional start_pos and end_pos columns containing the locations of the first and last characters, respectively, of the first sequence of digits in the corresponding value of the column col_1.

All Matches

Substring

We wish to obtain the locations of all matches of a given substring in a given string.

In this example, we wish to locate the start position of each occurrence of the substring ‘gm’ in each value of the column col_1.

SELECT *,
       ARRAY(SELECT INSTR(col_1, 'gm', 1, occ + 1)
             FROM UNNEST(
                     REGEXP_EXTRACT_ALL(col_1, 'gm')) AS match
             WITH OFFSET AS occ) AS col_2
FROM table_1;

Here is how this works:

  • We use the REGEXP_EXTRACT_ALL(value, regexp) function to get an array of all matches of the substring 'gm' in each value of the column col_1. We then use this array to get the positions of the matches using INSTR().
  • There is no function to match based on substring, but passing a substring to REGEXP_EXTRACT_ALL'() will work as expected as long as it doesn't contain any special characters used in regular expressions.
  • The REGEXP_EXTRACT_ALL function only returns non-overlapping matches. For example, using this function to extract ana from banana returns only one substring, not two.
  • The REGEXP_EXTRACT_ALL function returns an array of all substrings of value that match the regular expression, regexp.
  • To obtain the start positioning of each match, we unnest the array using UNNEST(). The WITH OFFSET clause is used to generate a new column that contains the 0-based index of each element in the array. The occ alias represents the index column in the subquery.
  • INSTR(col_1, match, 1, occ + 1) will return the start position of the respective occurrence. See First Match for more details.
  • The resulting table will have the same number of rows as the original table table_1, with an added column col_2 holding the array of starting position for all occurrences of the substring ‘gm’ for the corresponding value of the column col_1.

Regular Expression

We wish to obtain the locations of all matches of a regular expression in a string.

In this example, we wish to locate the start position of each occurrence of a sequence of integers in each value of the column col_1.

SELECT col_1,
       ARRAY(SELECT REGEXP_INSTR(col_1, '\\d+', 1, occ + 1)
             FROM UNNEST(
                     REGEXP_EXTRACT_ALL(col_1, '\\d+')) AS match
             WITH OFFSET AS occ) AS col_2
FROM table_1;

Here is how this works:

  • This works just like the Substring case above except that we pass a regular expression to REGEXP_EXTRACT_ALL(). Note that REGEXP_EXTRACT_ALL() expects a regular expression by default.
  • We also use REGEXP_INSTR() instead of INSTR() to get the start position of the first match. For more details see Regular Expressions in First Match.
  • The resulting table will have the same number of rows as the original table table_1, with an added column col_2 holding the array of starting position for all occurrences of the pattern ‘\\d+’ for the corresponding value of the column col_1.

Last Match

Substring

We wish to obtain the location of the last match of a substring in a string.

In this example, we wish to create a new column col_2 where each row value holds the location of the first character of the last match of the string 'gm' in the corresponding string value of the column col_1.

SELECT col_1,
       INSTR(col_1, 'gm', -1, 1)  AS col_2
FROM table_1;

Here is how this works:

  • This works similarly to First Match except we pass -1 as position parameter to start searching from the end of the string.
  • The output table has a new column col_2 containing the starting positions of the last occurrence of the pattern 'gm' in each element of col_1.

Regular Expression

We wish to obtain the location of the last match of a regular expression in a string.

In this example, we wish to create two new columns start_pos and end_pos where each row value holds the location of the first character and the last character, respectively, of the last match of a sequence of integers in the corresponding string value of the column col_1.

WITH table_2 AS
         (SELECT col_1,
                 ARRAY(SELECT REGEXP_INSTR(col_1, '\\d+', 1, occ + 1)
                       FROM UNNEST(REGEXP_EXTRACT_ALL(col_1, '\\d+')) AS match
                       WITH OFFSET AS occ) AS start_matches,
                 ARRAY(SELECT REGEXP_INSTR(col_1, '\\d+', 1, occ + 1, 1) - 1
                       FROM UNNEST(REGEXP_EXTRACT_ALL(col_1, '\\d+')) AS match
                       WITH OFFSET AS occ) AS end_matches
          FROM table_1)
SELECT col_1,
       ARRAY_REVERSE(start_matches)[SAFE_ORDINAL(1)] AS start_pos,
       ARRAY_REVERSE(end_matches)[SAFE_ORDINAL(1)]   AS end_pos
FROM table_2;

Here is how this works:

  • This works just like the Substring case above except that we pass a regular expression to the regex argument of REGEXP_EXTRACT_ALL() and we use REGEXP_INSTR() to get the start and end positions of the matches. See All Matches and First Match for regular expressions for more details.
  • We split it into two steps for readability. In the first CTE we get the start positions of each occurrence in start_matches column and the position of the last character in each occurrence in end_matches column.
  • In the second step, we reverse the array using ARRAY_REVERSE() to get the first element, so we get the start and end position of the last occurrence of the pattern '\\d+'. SAFE_ORDINAL(1) will return the first element in the array and null if the array is empty.

nth Match

We wish to obtain the location of the nth match of a pattern in a string.

In this example, we wish to create two new columns start_pos and end_pos where each row value holds the location of the first character and the last character, respectively, of the nth match of a sequence of integers in the corresponding string value of the column col_1.

WITH table_2 AS
         (SELECT col_1,
                 ARRAY(SELECT REGEXP_INSTR(col_1, '\\d+', 1, occ + 1)
                       FROM UNNEST(REGEXP_EXTRACT_ALL(col_1, '\\d+')) AS match
                       WITH OFFSET AS occ) AS start_matches,
                 ARRAY(SELECT REGEXP_INSTR(col_1, '\\d+', 1, occ + 1, 1) - 1
                       FROM UNNEST(REGEXP_EXTRACT_ALL(col_1, '\\d+')) AS match
                       WITH OFFSET AS occ) AS end_matches
          FROM table_1)
SELECT col_1,
       start_matches[SAFE_ORDINAL(2)] AS start_pos,
       end_matches[SAFE_ORDINAL(2)]   AS end_pos
FROM table_2;

Here is how this works:

  • At a high level, the approach we follow here to locate the nth occurrence of a pattern in a string, we use REGEXP_EXTRACT_ALL() to locate all occurrences and then extract the location of the nth occurrence.
  • This works similarly to Last Match scenario above except we get the second occurrence using SAFE_ORDINAL(2) and without reversing the array.
  • While in this example, the pattern we are locating is a regular expression, we can just as well use a plain string. We simply pass that to REGEXP_EXTRACT_ALL() and REGEXP_INSTR(). See Substring under All Matches above.
  • The resulting table will have the same number of rows as the original table table_1, with two additional start_pos and end_pos columns containing the locations of the first and last characters, respectively, of the 2nd sequence of digits in the corresponding value of the column col_1.

Ignore Case

We wish to obtain the location of the first match of a substring in a string irrespective of whether the letters are in upper or lower case; i.e. while ignoring case.

In this example, we wish to create a new column col_2 where each row value holds the location of the first character of the first match regardless of case, of the string 'gm' in the corresponding string value of the column col_1.

SELECT col_1,
       INSTR(LOWER(col_1), 'gm') AS col_2
FROM table_1;

Here is how this works:

  • To locate the first occurrence of a given substring in a given string, we use INSTR() as described in First Match above.
  • To ignore case while matching, we convert col_1 to lower case using LOWER(). See Ignore Case under Detecting for more details.

Pattern Column

We wish to match a column of strings against a column of patterns of the same size. This is often needed when we wish to locate the value of a column in the value of another column for each row.

Substring

In this example, we have a table table_1 with two columns col_1 and col_2, we wish to locate the start of the first occurrence of the value of col_2 in col_1.

SELECT col_1,
       col_2,
       INSTR(col_1, col_2) AS col_3
FROM table_1;

Here is how this works:

Regular Expression

In this example, we have a table table_1 with two columns col_1 and col_2, we wish to locate the start and end of the first occurrence of a sequence of repeated col_2 value in col_1.

 SELECT col_1,
        col_2,
        REGEXP_INSTR(col_1, col_2 || '+', 1, 1, 0)     AS start_pos,
        REGEXP_INSTR(col_1, col_2 || '+', 1, 1, 1) - 1 AS end_pos
 FROM table_1;

Here is how this works:

  • This works similarly to First Match regular expression scenario above, except we pass col_2 instead of a regular expression.
  • We construct the regular expression by adding ‘+’ to each value of the column col_2 via concat operator || to specify a regular expression that matches one or more occurrences of the value of col_2.

Multiple Patterns

We will cover four scenarios of locating multiple patterns:

  • First Match of Any Pattern where we cover how to obtain the start and end locations of the first occurrence of the first occurring pattern of n patterns.
  • First Match of Each Pattern where we cover how to obtain the start and end locations of the first occurrence of each pattern of n patterns.
  • All Matches in Separate Arrays where we cover how to obtain the start and end locations of all occurrences of all patterns keeping the occurrences of each pattern in a separate array.

First Match of Any Pattern

We wish to return the start and end locations of the first occurrence of the first occurring pattern of n patterns. In other words, only one location is returned and that is of whichever pattern occurs first.

SELECT col_1,
       REGEXP_INSTR(col_1, '\\d+|gm', 1, 1, 0) AS start_pos,
       REGEXP_INSTR(col_1, '\\d+|gm', 1, 1, 1) - 1  AS end_pos
FROM table_1;

Here is how this works:

  • We use the or operator | of regular expressions to build a regular expression that captures all the patterns we wish to look for or’ed together. In this case that regular expression is '\\d+|gm'.
  • For each value of col_1, REGEXP_INSTR() will return the start and end locations of whichever pattern occurs first (either '\\d+’ or ‘gm’).
  • The output table will be a copy of the input table table_1 with two added columns start_pos and end_pos.
  • See Regular Expression under First Match above for a more detailed description.

First Match of Each Pattern

We wish to return the start and end locations of the first occurrence of each pattern of n patterns. In other words, n locations are returned one for each pattern.

SELECT col_1,
       REGEXP_INSTR(col_1, '\\d+', 1, 1, 0)     AS pattern_1_start_pos,
       REGEXP_INSTR(col_1, '\\d+', 1, 1, 1) - 1 AS pattern_1_start_pos,
       REGEXP_INSTR(col_1, 'gm', 1, 1, 0)          AS pattern_2_start_pos,
       REGEXP_INSTR(col_1, 'gm', 1, 1, 1) - 1      AS pattern_2_start_pos
FROM table_1;

Here is how this works:

  • We need to create two separate columns for each pattern. The code works similarly to First Match above.
  • The output table will be a copy of the input table table_1 with 2 added columns for each pattern.

All Matches in One Array

We wish to return the start and end locations of all occurrences of all patterns in the order of their occurrence as a single array.

SELECT col_1,
       ARRAY(SELECT
             STRUCT (
                 REGEXP_INSTR(col_1, '\\d+|gm', 1, occ + 1) AS start_pos ,
                 REGEXP_INSTR(col_1, '\\d+|gm', 1, occ + 1, 1) - 1 AS end_pos)
             FROM UNNEST(
                     REGEXP_EXTRACT_ALL(col_1, '\\d+|gm')) AS match
             WITH OFFSET AS occ) AS col_2
FROM table_1;

Here is how this works:

  • We use the or operator | of regular expressions to build a regular expression that captures all the patterns we wish to look for or’ed together. In this case that regular expression is '\\d+|gm'.
  • For each value of col_1, REGEXP_EXTRACT_ALL() will return an array with all matching substrings.
  • For each matching substring returned by REGEXP_EXTRACT_ALL() we locate the start and end locations for each occurrence of each pattern and put them in a STRUCT.
  • The output table will be a copy of the input table table_1 with an added column col_2 where each cell holds an array of structs with the start and end position of all matches the corresponding value of col_1.

All Matches in Separate Arrays

We wish to return the start and end locations of all occurrences of all patterns keeping the occurrences of each pattern in a separate array.

SELECT col_1,
       ARRAY(SELECT
             STRUCT (
                 REGEXP_INSTR(col_1, '\\d+', 1, occ + 1) AS start_pos ,
                 REGEXP_INSTR(col_1, '\\d+', 1, occ + 1, 1) - 1 AS end_pos)
             FROM UNNEST(
                     REGEXP_EXTRACT_ALL(col_1, '\\d+')) AS match
             WITH OFFSET AS occ) AS col_2,
       ARRAY(SELECT
             STRUCT (
                 REGEXP_INSTR(col_1, 'gm', 1, occ + 1) AS start_pos ,
                 REGEXP_INSTR(col_1, 'gm', 1, occ + 1, 1) - 1 AS end_pos)
             FROM UNNEST(
                     REGEXP_EXTRACT_ALL(col_1, 'gm')) AS match
             WITH OFFSET AS occ) AS col_3,
FROM table_1;

Here is how this works:

  • We create spear arrays in separate columns for each pattern. This works similarly to All Matches in One Array scenario above.
  • The output table will be a copy of the input table table_1 with two added columns col_2 and col_3 where each column holds the array of matches for each pattern.
SQL
I/O