Substituting

We wish to locate one or more parts of a string given their start and end indices and replace those with one or more given replacement string(s).

This section is roughly organized into two parts as follows:

  • Locating Scenarios
    • One Substring: We locate one substring within a given string via its start and end indices and replace that with a given replacement string.
    • Multiple Substrings: We locate multiple substrings within a given string via their start and end indices and replace those substrings with a given replacement string.
    • Location Columns: We have two columns of location indices of the same size as the column of input strings. We use the corresponding elements in the location columns to find the sub-string to be replaced in the input column.
  • Replacement Scenarios
    • Replacement Column: We have a column of replacement strings and a column of input strings . We use the corresponding element in the replacement column to replace the located sub-string in the input column.
    • Custom Replacement: We apply custom logic to determine the replacement string often based on the matched substring.

One Substring

We wish to replace a part of a given string given its start and end location indices.

In this example, we wish to replace the part of a string between the 2nd and 4th characters ( inclusive i.e. 3 characters) from each value of the column col_1 with three hyphens '---'.

SELECT col_1,
       CONCAT(SUBSTR(col_1, 1, 2 - 1),
              '---',
              SUBSTR(col_1, 4)) AS col_2
FROM table_1;

Here is how this works:

  • We split the string into two substrings using SUBSTR():
    • SUBSTR(col_1, 1, 2 - 1): all characters before the start position.
    • SUBSTR(col_1, 4): all characters after the end position.
  • We then use CONCAT() to combine the two substrings with the replacement string.
  • The output table has the same number of rows and columns as the input table table_1 with a new column col_2 with the new string.

Alternative: Creating A Function

CREATE OR REPLACE FUNCTION
    dataset.replace_substr(value STRING,
                           start_position INT64,
                           end_position INT64,
                           replacement STRING)
    RETURNS STRING AS ((SELECT CASE
                                   WHEN end_position <= LENGTH(value) AND end_position > 0 THEN
                                       CONCAT(SUBSTR(value, 1, start_position - 1), replacement,
                                              SUBSTR(value, end_position))
                                   ELSE value END));

We can then use this function and pass any column to it as follows:

SELECT col_1,
       dataset.replace_substr(col_1, 2, 4, '---') AS col_2
FROM table_1;

Here is how this works:

  • We wrap the logic from the primary solution in a function that we can reuse.
  • The new function dataset.replace_substr() takes 4 arguments:
    • A single string we wish to modify.
    • The start (argument name start_position) and end (argument name end_position) locations of the part we wish to replace, which in this case are 2 and 4.
    • The replacement string, which in this case is '---'.
  • We can pass a column to this function to perform the replacement over all rows in the table.
  • The output table has the same number of rows and columns as the input table table_1 with a new column col_2 with the new string.

Multiple Substrings

We wish to replace multiple substrings given their start and end location indices.

In this example, we wish to replace three substrings given their start and end locations from each value of the column col_1 with the hyphen character ‘-’.

WITH RECURSIVE replaces AS (
    SELECT [1, 4, 7]  start_position ,
           [2, 5, 9]  end_position ,
            col_1 input_string,
            col_1 as new_string,
            0 as step,
            (SELECT array_length([2, 5, 9])) + 1 as max_iterations
    FROM dataset.table_1
    UNION ALL
    SELECT start_position,
           end_position,
           input_string,
           dataset.replace_substr( new_string,
                    start_position[offset(step)], 
                    end_position[offset(step)],'-'),
                    step+1,
                    max_iterations-1
    FROM replaces
    WHERE step<=max_iterations
)
SELECT new_string FROM replaces
WHERE max_iterations=1

Here is how this works:

  • When we wish to make multiple replacements it may get tedious to write multiple individual replacement commands. Fortunately, we can carry out multiple replacements by using recursive CTEs.
  • To learn more about recursive CTEs see BigQuery documentation Recursive CTEs.

Extension: Different Replacements

We wish to locate multiple substrings in a target string and replace each with a different replacement string.

In this example, for each string in the column col_1, we wish to replace three parts, situated between given start and end indices, with the digit characters ‘1’, ‘2’, and ‘3’ respectively.

WITH RECURSIVE replaces AS (
    SELECT [1, 4, 7]  start_position ,
           [2, 5, 9]  end_position ,
           ['1', '2', '3'] replacement,
            col_2 input_string,
            col_2 as new_string,
            0 as step,
            (SELECT array_length([2, 5, 9])) + 1 as max_iterations
    FROM dataset.table_1
    UNION ALL
    SELECT start_position,
           end_position,
           replacement,
           input_string,
           dataset.replace_substr( new_string,
                    start_position[offset(step)], 
                    end_position[offset(step)],
                    replacement[offset(step)]),
                    step+1,
                    max_iterations-1
    FROM replaces
    WHERE step<=max_iterations
)
SELECT new_string FROM replaces
WHERE max_iterations=1 ;

Here is how this works:

This works similarly to solution above except that we use an array of replacements of the same size as the number of substrings being replaced, which in this case is ['1', '2', '3'].

Location Columns

We have two columns of location indices of the same size as the column of input strings. We use the corresponding elements in the location vectors to find the sub-string to be replaced in the input vector.

In this example, we wish to create a new column, col_4, by modifying the values of an existing column, col_1, by replacing a substring between specific locations, defined by the corresponding values in col_2 and col_3, with a sequence of hyphens of the same length as the substring being replaced.

SELECT *,
       dataset.replace_substr(col_1, --input column
                              col_2, -- start position
                              col_3, --end position
                              REPEAT('-', col_3 - col_2) --replacement 
                              ) AS col_4
FROM table_1;

Here is how this works:

  • We use the function dataset.replace_substr() to replace a substring between given start and end characters in a parent string with a given string. See Substring above.
  • We can pass any column as an argument to dataset.replace_substr().
  • We use REPEAT() to repeat the character '-' for a number of times, equal to the size of the string we're replacing, which is calculated by subtracting the values of the start and end locations, as provided by the columns col_2 and col_3 respectively. See Repeating.

Replacement Column

We have a column of replacement strings of the same size as the column of input strings. We use the corresponding element in the replacement column to replace the located sub-string in the input column.

One Replacement

In this example, for each row of the table table_1, we wish to replace the first two characters of the value of the column col_1 with the value of the column col_2.

SELECT col_1,
       dataset.replace_substr(col_1, 1, 2, col_2)
FROM table_1;

Here is how this works:

  • We use the function dataset.replace_substr to replace a substring given its start and end locations. See One Substring above.
  • We can pass any column as an argument to dataset.replace_substr(). We pass the column col_2 as the replacement column.
  • Note: dataset.replace_substr can also take location columns as arguments as covered in Location Columns.

Custom Replacement

We wish to apply custom logic to determine the replacement string often based on the matched substring.

In this example, we wish to replace the first two characters of each value of the column col_1 holding a country abbreviation (e.g. US) with the name of the corresponding country.

SELECT col_1,
       dataset.replace_substr(col_1, 1, 2,
                              CASE
                                  WHEN SUBSTR(col_1, 1, 2) = 'US' THEN 'USA'
                                  WHEN SUBSTR(col_1, 1, 2) = 'DE' THEN 'Germany'
                                  WHEN SUBSTR(col_1, 1, 2) = 'AE' THEN 'UAE'
                                  WHEN SUBSTR(col_1, 1, 2) = 'FR' THEN 'France'
                                  ELSE SUBSTR(col_1, 1, 2) END
           )
FROM table_1;

Here is how this works:

  • We use the function dataset.replace_substr() to replace a substring given its start and end locations. See One Substring above.
  • In SUBSTR(col_1, 1, 2), we use SUBSTR() to extract the substring between character indices 1 and 2. See Extracting by Location.
  • We use a CASE expression to select the replacement based on the output of SUBSTR() function.
SQL
I/O