We wish to find the parts of a given string that match a given pattern and replace those with a given replacement string.
This section is roughly organized into two parts as follows:
We wish to replace any occurrence of a substring i.e. a plain sequence of one or more characters.
In this example, we replace every occurrence of a hyphen '-'
in each element of the column col_1
with an underscore '_'
.
SELECT *,
REPLACE(col_1, '-', '_') col_2
FROM table_1;
Here is how this works:
REPLACE()
to replace every occurrence of the substring '-'
in each element
of the column col_1
with the substring '_'
.table_1
with an
additional column col_2
where each element is the corresponding element of the column col_1
but with each occurrence of ‘-’
replaced with ‘_’
.We wish to replace any occurrence of a regular expression match.
In this example, we wish to replace any sequence of two or more underscores with a single underscore.
SELECT *,
REGEXP_REPLACE(col_1, '_{2,}', '_') col_2
FROM table_1;
Here is how this works:
REGEXP_REPLACE()
to replace any sequence of two or more underscores with a
single underscore in each element of the column col_1
.REGEXP_REPLACE()
expects a regular expression by default.'_{2,}'
where:_
is the underscore character whose duplicate occurrence we wish to detect{2,}
specifies that we are looking for patterns where the character is repeated 2 or more
timestable_1
with an
additional column col_2
where each element is the corresponding element of the column col_1
but with each sequence of two or more underscores replaced with a single underscore.We have a column of pattern strings of and a column of input strings. We use the corresponding element in the pattern column to find the sub-string to be replaced in the input column.
Substring
In this example, we wish to replace all occurrences of the value of the column col_2
in the
corresponding value of the column col_1
with a fixed string '-'
.
SELECT *,
REPLACE(col_1, col_2, '-') col_2
FROM table_1;
Here is how this works:
REPLACE()
to replace each occurrence of a substring in a parent string.
See Substring above.REPLACE()
is columnized over both the input string, which here is col_1
,
and the pattern, which here is col_2
.Regular Expression
In this example, we wish to replace all occurrences of a regular expression representing two or more
repetitions of the value of the column col_2
(but not one) in the corresponding value of the
column col_1
with a fixed string '-'
.
SELECT *,
REGEXP_REPLACE(col_1, col_2 || '{2,}', '-') col_3
FROM table_1;
Here is how this works:
REGEXP_REPLACE()
to replace each occurrence of a regular expression match in a parent
string. See Regular Expression above.REGEXP_REPLACE()
is columnized over both the input string, which here is col_1
,
and the pattern, which here is col_2
.col_2 || '{2,}'
, we construct the desired regular expression by appending '{2,}'
to
the value of the column col_2
. For instance, if the value of the column col_2
for a particular
row is ‘A'
the generated regular expression is 'A{2,}'
that captures a sequence where the
character ‘A’
is repeated 2 or more times.We wish to replace only the first occurrence of a given pattern in a given string.
In this example, we wish to replace the first occurrence of ‘+‘
with ‘=’
.
WITH locate AS
(SELECT col_1,
INSTR(col_1, '+') start_position,
INSTR(col_1, '+') + LENGTH('+') end_position
FROM table_1)
SELECT col_1,
CONCAT(SUBSTR(col_1, 1, start_position - 1), '=',
SUBSTR(col_1, end_position)) AS col_2
FROM locate;
Here is how this works:
INSTR()
to get the start and end positions of the first occurrence.
See Locating - First Match
for more details.SUBSTR()
and CONCAT()
to replace the substring by concatenating all
characters before the start position (SUBSTR(col_1, 1, start_position - 1)
) with the replacement
string =
and all the characters after the end position (SUBSTR(col_1, end_position)
).We wish to replace parts of a given string that match a given pattern regardless of the case; i.e. we wish to ignore case while matching.
Substring
In this example, we wish to replace any occurrence of the substring ‘old’
with ‘new’
regardless
of the case the characters of ‘old’
may be in.
SELECT *,
REPLACE(LOWER(col_1), 'old', 'new') col_2
FROM table_1;
Here is how this works:
REPLACE()
to replace each occurrence of the match in the parent string.
See Substring above.LOWER()
.We have a column of replacement strings and a column of input strings. We use the corresponding element in the replacement column to replace the matched sub-string in the input column.
In this example, we wish to replace the first occurrence of the value of col_2
in the
corresponding value of col_1
with the corresponding value of col_3
.
SELECT *,
REPLACE(col_1, col_2, col_3) col_4
FROM table_1;
Here is how this works:
REPLACE()
to replace each occurrence of a substring in a parent string.
See Substring above.REPLACE()
is columnized over all three arguments:col_1
col_2
col_3
REPLACE()
in this case will be a column of the same length as each of the
input columns and where each element is the value of col_1
where any occurrence of the
corresponding value of col_2
is replaced with the corresponding value of the column col_3
.We wish to replace multiple patterns each with a particular (often different) string.
Single Character
In this example, we wish to replace each number with its arabic equivalent.
SELECT *,
TRANSLATE(col_1, '123', '١٢٣')
FROM table_1;
Here is how this works:
TRANSLATE(source_characters,target_characters)
to
do multiple replacements for single characters.TRANSLATE()
replaces each character in source_characters with the corresponding character in
target_charactersSubstring
In this example, we wish to replace each occurrence of a currency symbol with the corresponding
currency abbreviation e.g. ‘$’
becomes ‘USD’
.
WITH RECURSIVE replaces AS (
SELECT ['$','£','€'] find_string ,
['USD','GBP','EUR'] replace_string ,
col_2 input_string,
col_2 as new_string,
0 as step,
(SELECT array_length(['$','£','€'])) + 1 as max_iterations
FROM dataset.table_1
UNION ALL
SELECT find_string,
replace_string,
input_string,
REPLACE( new_string,
find_string[offset(step)],
replace_string[offset(step)]),
step+1,
max_iterations-1
FROM replaces
WHERE step<=max_iterations
)
SELECT new_string FROM replaces
WHERE max_iterations=1
Here is how this works:
Extension: Replace Multiple Pattern Matches with the Same Value
In this example, we wish to replace any occurrence of ‘kilogram’
or ‘kilograms’
or ‘kgs’
with ‘kg’
regardless of case.
SELECT *,
REGEXP_REPLACE(LOWER(col_1), '(kilogram(s{0,1})|kgs)', 'kg') col_2
FROM table_1;
Here is how this works:
REGEXP_REPLACE()
to replace multiple subtrings with the same value.'|'
) to replace kilogram,kilograms, or kgs
with kg
.We wish to apply custom logic to determine the replacement string often based on the matched substring.
In this example, we wish to replace numbers (sequences of digits) in each value of the string
column col_1
, by other numbers based on the value of the original numbers. In particular, we wish
to add one to any number that is greater than or equal 10 and subtract 1 from any number that is
smaller than 10.
SELECT *,
REGEXP_REPLACE(col_1, r'(\d+)',
CAST(CASE
WHEN CAST(REGEXP_EXTRACT(col_1, r'\d+') AS INT64) >= 10
THEN CAST(REGEXP_EXTRACT(col_1, r'\d+') AS INT64) + 1
ELSE CAST(REGEXP_EXTRACT(col_1, r'\d+') AS INT64) - 1
END AS STRING)
) AS col_2
FROM table_1;
Here is how this works:
We wish to replace a pattern with a replacement that includes parts (represented as capture groups) of the captured pattern.
In this example, we wish to replace any occurrence of the substring ‘kilogram’
or ‘kilograms’
when they shows up after a number with ‘kg’
regardless of case.
SELECT *,
REGEXP_REPLACE(LOWER(col_1), '(\\d+)\\s*(kilogram(s{0,1}))', '\\1 kg') col_2
FROM table_1;
Here is how this works:
(\\d+)\\s*(kilogram(s{0,1}))
, where:(\\d+)
is a capture group that holds the numerical part of the pattern\\s*
denotes zero or more empty space characterskilogram(s{0,1})
matches kilogram or kilograms.\\n
which in this case is \\1
to refer to the first capture group holding the numeric part.'\\1 kg'
takes the first capture group from the matched string (which
is the sequence of digits) and appends to it the unit ‘kg’
with an empty space character in
between.