Replacing

We wish to find the parts of a given string that match a given pattern and replace those with a given replacement string.

This section is roughly organized into two parts as follows:

  • Finding Scenarios:
    • Substring: We wish to replace any occurrence of a substring i.e. a plain sequence of one or more characters.
    • Regular Expression: We wish to replace any occurrence of a regular expression match.
    • Pattern Column: We have a column of pattern strings (often a column of a table) of the same size as the column of input strings (often another column of a table) and we use the corresponding element in the pattern column to find the sub-string to be replaced in the input column.
    • First Match: We wish to replace only the first occurrence of a given pattern in a given string.
    • Ignore Case: We wish to replace parts of a given string that match a given pattern regardless of the case; i.e. we wish to ignore case while matching.
  • Replacement Scenarios:
    • Replacement Column: We have a column of replacement strings and a column of input strings . We use the corresponding element in the replacement column to replace the matched sub-string in the input column.
    • Multiple Replacements: We wish to replace multiple patterns each with a particular (often different) string.
    • Custom Replacement: We apply custom logic to determine the replacement string often based on the matched substring.
    • Capture Group Replacement: We wish to replace a pattern with a replacement that includes parts (represented as capture groups) of the captured pattern.

Substring

We wish to replace any occurrence of a substring i.e. a plain sequence of one or more characters.

In this example, we replace every occurrence of a hyphen '-' in each element of the column col_1 with an underscore '_'.

SELECT *,
       REPLACE(col_1, '-', '_') col_2
FROM table_1;

Here is how this works:

  • We use the function REPLACE() to replace every occurrence of the substring '-' in each element of the column col_1 with the substring '_'.
  • In this case, the substring has just one character but in general, the substrings to use for matching or as replacements can be of any length.
  • The output table has the same number of rows and columns as the input table table_1 with an additional column col_2 where each element is the corresponding element of the column col_1 but with each occurrence of ‘-’ replaced with ‘_’.

Regular Expression

We wish to replace any occurrence of a regular expression match.

In this example, we wish to replace any sequence of two or more underscores with a single underscore.

SELECT *,
       REGEXP_REPLACE(col_1, '_{2,}', '_') col_2
FROM table_1;

Here is how this works:

  • We use the function REGEXP_REPLACE() to replace any sequence of two or more underscores with a single underscore in each element of the column col_1.
  • The function REGEXP_REPLACE() expects a regular expression by default.
  • The regular expression we use to capture the target pattern, in this case, is '_{2,}' where:
    • _ is the underscore character whose duplicate occurrence we wish to detect
    • {2,} specifies that we are looking for patterns where the character is repeated 2 or more times
  • The output table has the same number of rows and columns as the input table table_1 with an additional column col_2 where each element is the corresponding element of the column col_1 but with each sequence of two or more underscores replaced with a single underscore.

Pattern Column

We have a column of pattern strings of and a column of input strings. We use the corresponding element in the pattern column to find the sub-string to be replaced in the input column.

Substring

In this example, we wish to replace all occurrences of the value of the column col_2 in the corresponding value of the column col_1 with a fixed string '-'.

SELECT *,
       REPLACE(col_1, col_2, '-') col_2
FROM table_1;

Here is how this works:

  • We use REPLACE() to replace each occurrence of a substring in a parent string. See Substring above.
  • The function REPLACE() is columnized over both the input string, which here is col_1, and the pattern, which here is col_2.

Regular Expression

In this example, we wish to replace all occurrences of a regular expression representing two or more repetitions of the value of the column col_2 (but not one) in the corresponding value of the column col_1 with a fixed string '-'.

SELECT *,
       REGEXP_REPLACE(col_1, col_2 || '{2,}', '-') col_3
FROM table_1;

Here is how this works:

  • We use REGEXP_REPLACE() to replace each occurrence of a regular expression match in a parent string. See Regular Expression above.
  • The function REGEXP_REPLACE() is columnized over both the input string, which here is col_1, and the pattern, which here is col_2.
  • In col_2 || '{2,}', we construct the desired regular expression by appending '{2,}' to the value of the column col_2. For instance, if the value of the column col_2 for a particular row is ‘A' the generated regular expression is 'A{2,}' that captures a sequence where the character ‘A’ is repeated 2 or more times.

First Match

We wish to replace only the first occurrence of a given pattern in a given string.

In this example, we wish to replace the first occurrence of ‘+‘ with ‘=’.

WITH locate AS
         (SELECT col_1,
                 INSTR(col_1, '+')               start_position,
                 INSTR(col_1, '+') + LENGTH('+') end_position
          FROM table_1)

SELECT col_1,
       CONCAT(SUBSTR(col_1, 1, start_position - 1), '=',
              SUBSTR(col_1, end_position)) AS col_2
FROM locate;

Here is how this works:

  • We use the function INSTR() to get the start and end positions of the first occurrence. See Locating - First Match for more details.
  • We use the function SUBSTR() and CONCAT() to replace the substring by concatenating all characters before the start position (SUBSTR(col_1, 1, start_position - 1)) with the replacement string = and all the characters after the end position (SUBSTR(col_1, end_position)).

Ignore Case

We wish to replace parts of a given string that match a given pattern regardless of the case; i.e. we wish to ignore case while matching.

Substring

In this example, we wish to replace any occurrence of the substring ‘old’ with ‘new’ regardless of the case the characters of ‘old’ may be in.

SELECT *,
       REPLACE(LOWER(col_1), 'old', 'new') col_2
FROM table_1;

Here is how this works:

  • We use REPLACE() to replace each occurrence of the match in the parent string. See Substring above.
  • To ignore case while matching, we wrap the input column in LOWER().

Replacement Column

We have a column of replacement strings and a column of input strings. We use the corresponding element in the replacement column to replace the matched sub-string in the input column.

In this example, we wish to replace the first occurrence of the value of col_2 in the corresponding value of col_1 with the corresponding value of col_3.

SELECT *,
       REPLACE(col_1, col_2, col_3) col_4
FROM table_1;

Here is how this works:

  • We use REPLACE() to replace each occurrence of a substring in a parent string. See Substring above.
  • The function REPLACE() is columnized over all three arguments:
    • The input string, which here are the values of the column col_1
    • The pattern, which here are the values of the column col_2
    • The replacement, which here are the values of the column col_3
  • The output of REPLACE() in this case will be a column of the same length as each of the input columns and where each element is the value of col_1 where any occurrence of the corresponding value of col_2 is replaced with the corresponding value of the column col_3.

Multiple Replacements

We wish to replace multiple patterns each with a particular (often different) string.

Single Character

In this example, we wish to replace each number with its arabic equivalent.

SELECT *,
       TRANSLATE(col_1, '123', '١٢٣')
FROM table_1;

Here is how this works:

  • When we wish to make multiple replacements it may get tedious to write multiple individual replacement commands. Fortunately, we can use TRANSLATE(source_characters,target_characters) to do multiple replacements for single characters.
  • TRANSLATE() replaces each character in source_characters with the corresponding character in target_characters

Substring

In this example, we wish to replace each occurrence of a currency symbol with the corresponding currency abbreviation e.g. ‘$’ becomes ‘USD’.

WITH RECURSIVE replaces AS (
    SELECT ['$','£','€']  find_string ,
           ['USD','GBP','EUR']  replace_string ,
            col_2 input_string,
            col_2 as new_string,
            0 as step,
            (SELECT array_length(['$','£','€'])) + 1 as max_iterations
    FROM dataset.table_1
    UNION ALL
    SELECT find_string,
           replace_string,
           input_string,
           REPLACE( new_string,
                    find_string[offset(step)], 
                    replace_string[offset(step)]),
                    step+1,
                    max_iterations-1
    FROM replaces
    WHERE step<=max_iterations
)
SELECT new_string FROM replaces
WHERE max_iterations=1

Here is how this works:

  • When we wish to make multiple replacements it may get tedious to write multiple individual replacement commands. Fortunately, we can carry out multiple replacements by using recursive CTEs.
  • To learn more about recursive CTEs see BigQuery documentation Recursive CTEs.

Extension: Replace Multiple Pattern Matches with the Same Value

In this example, we wish to replace any occurrence of ‘kilogram’ or ‘kilograms’ or ‘kgs’ with ‘kg’ regardless of case.

SELECT *,
       REGEXP_REPLACE(LOWER(col_1), '(kilogram(s{0,1})|kgs)', 'kg') col_2
FROM table_1;

Here is how this works:

  • We use the function REGEXP_REPLACE() to replace multiple subtrings with the same value.
  • We use regular expression or operator ('|') to replace kilogram,kilograms, or kgs with kg.

Custom Replacement

We wish to apply custom logic to determine the replacement string often based on the matched substring.

In this example, we wish to replace numbers (sequences of digits) in each value of the string column col_1, by other numbers based on the value of the original numbers. In particular, we wish to add one to any number that is greater than or equal 10 and subtract 1 from any number that is smaller than 10.

SELECT *,
       REGEXP_REPLACE(col_1, r'(\d+)',
                      CAST(CASE
                               WHEN CAST(REGEXP_EXTRACT(col_1, r'\d+') AS INT64) >= 10
                                   THEN CAST(REGEXP_EXTRACT(col_1, r'\d+') AS INT64) + 1
                               ELSE CAST(REGEXP_EXTRACT(col_1, r'\d+') AS INT64) - 1
                          END AS STRING)
           ) AS col_2
FROM table_1;

Here is how this works:

  • Instead of passing a replacement string, we use a CASE expression to manipulate the replacement string based on the matched value.

Capture Group Replacement

We wish to replace a pattern with a replacement that includes parts (represented as capture groups) of the captured pattern.

In this example, we wish to replace any occurrence of the substring ‘kilogram’ or ‘kilograms’ when they shows up after a number with ‘kg’ regardless of case.

SELECT *,
        REGEXP_REPLACE(LOWER(col_1), '(\\d+)\\s*(kilogram(s{0,1}))', '\\1 kg') col_2
 FROM table_1;

Here is how this works:

  • In order to construct a replacement string out of parts of the matched string, we can use capture groups. Capture groups allow us to isolate a portion of the match and refer to it later.
  • In this case the regular expression we are using is (\\d+)\\s*(kilogram(s{0,1})), where:
    • (\\d+) is a capture group that holds the numerical part of the pattern
    • \\s* denotes zero or more empty space characters
    • kilogram(s{0,1}) matches kilogram or kilograms.
  • While constructing the replacement, we can refer to any capture group by its number via \\n which in this case is \\1 to refer to the first capture group holding the numeric part.
  • The replacement expression '\\1 kg' takes the first capture group from the matched string (which is the sequence of digits) and appends to it the unit ‘kg’ with an empty space character in between.
SQL
I/O