We wish to locate one or more parts of a string given their start and end indices and replace those with one or more given replacement string(s).
This section is roughly organized into two parts as follows:
We wish to replace a part of a given string given its start and end location indices.
In this example, we wish to replace the part of a string between the 2nd and 4th characters (
inclusive i.e. 3 characters) from each value of the column col_1
with three hyphens '---'
.
SELECT col_1,
CONCAT(SUBSTR(col_1, 1, 2 - 1),
'---',
SUBSTR(col_1, 4)) AS col_2
FROM table_1;
Here is how this works:
SUBSTR()
:SUBSTR(col_1, 1, 2 - 1)
: all characters before the start position.SUBSTR(col_1, 4)
: all characters after the end position.CONCAT()
to combine the two substrings with the replacement string.table_1
with a new
column col_2
with the new string.Alternative: Creating A Function
CREATE OR REPLACE FUNCTION
dataset.replace_substr(value STRING,
start_position INT64,
end_position INT64,
replacement STRING)
RETURNS STRING AS ((SELECT CASE
WHEN end_position <= LENGTH(value) AND end_position > 0 THEN
CONCAT(SUBSTR(value, 1, start_position - 1), replacement,
SUBSTR(value, end_position))
ELSE value END));
We can then use this function and pass any column to it as follows:
SELECT col_1,
dataset.replace_substr(col_1, 2, 4, '---') AS col_2
FROM table_1;
Here is how this works:
dataset.replace_substr()
takes 4 arguments:start_position
) and end (argument name end_position
) locations of
the part we wish to replace, which in this case are 2 and 4.'---'
.table_1
with a new
column col_2
with the new string.We wish to replace multiple substrings given their start and end location indices.
In this example, we wish to replace three substrings given their start and end locations from each
value of the column col_1
with the hyphen character ‘-’
.
WITH RECURSIVE replaces AS (
SELECT [1, 4, 7] start_position ,
[2, 5, 9] end_position ,
col_1 input_string,
col_1 as new_string,
0 as step,
(SELECT array_length([2, 5, 9])) + 1 as max_iterations
FROM dataset.table_1
UNION ALL
SELECT start_position,
end_position,
input_string,
dataset.replace_substr( new_string,
start_position[offset(step)],
end_position[offset(step)],'-'),
step+1,
max_iterations-1
FROM replaces
WHERE step<=max_iterations
)
SELECT new_string FROM replaces
WHERE max_iterations=1
Here is how this works:
Extension: Different Replacements
We wish to locate multiple substrings in a target string and replace each with a different replacement string.
In this example, for each string in the column col_1
, we wish to replace three parts, situated
between given start and end indices, with the digit characters ‘1’
, ‘2’
, and ‘3’
respectively.
WITH RECURSIVE replaces AS (
SELECT [1, 4, 7] start_position ,
[2, 5, 9] end_position ,
['1', '2', '3'] replacement,
col_2 input_string,
col_2 as new_string,
0 as step,
(SELECT array_length([2, 5, 9])) + 1 as max_iterations
FROM dataset.table_1
UNION ALL
SELECT start_position,
end_position,
replacement,
input_string,
dataset.replace_substr( new_string,
start_position[offset(step)],
end_position[offset(step)],
replacement[offset(step)]),
step+1,
max_iterations-1
FROM replaces
WHERE step<=max_iterations
)
SELECT new_string FROM replaces
WHERE max_iterations=1 ;
Here is how this works:
This works similarly to solution above except that we use an array of replacements of the same size
as the number of substrings being replaced, which in this case is ['1', '2', '3']
.
We have two columns of location indices of the same size as the column of input strings. We use the corresponding elements in the location vectors to find the sub-string to be replaced in the input vector.
In this example, we wish to create a new column, col_4
, by modifying the values of an existing
column, col_1
, by replacing a substring between specific locations, defined by the corresponding
values in col_2
and col_3
, with a sequence of hyphens of the same length as the substring being
replaced.
SELECT *,
dataset.replace_substr(col_1, --input column
col_2, -- start position
col_3, --end position
REPEAT('-', col_3 - col_2) --replacement
) AS col_4
FROM table_1;
Here is how this works:
dataset.replace_substr()
to replace a substring between given start and end
characters in a parent string with a given string. See Substring above.dataset.replace_substr()
.REPEAT()
to repeat the character '-'
for a number of times, equal to the size of the
string we're replacing, which is calculated by subtracting the values of the start and end
locations, as provided by the columns col_2
and col_3
respectively.
See Repeating.We have a column of replacement strings of the same size as the column of input strings. We use the corresponding element in the replacement column to replace the located sub-string in the input column.
One Replacement
In this example, for each row of the table table_1
, we wish to replace the first two characters of
the value of the column col_1
with the value of the column col_2
.
SELECT col_1,
dataset.replace_substr(col_1, 1, 2, col_2)
FROM table_1;
Here is how this works:
dataset.replace_substr
to replace a substring given its start and end
locations. See One Substring above.dataset.replace_substr()
. We pass the column col_2
as
the replacement column.dataset.replace_substr
can also take location columns as arguments as covered
in Location Columns.We wish to apply custom logic to determine the replacement string often based on the matched substring.
In this example, we wish to replace the first two characters of each value of the column col_1
holding a country abbreviation (e.g. US) with the name of the corresponding country.
SELECT col_1,
dataset.replace_substr(col_1, 1, 2,
CASE
WHEN SUBSTR(col_1, 1, 2) = 'US' THEN 'USA'
WHEN SUBSTR(col_1, 1, 2) = 'DE' THEN 'Germany'
WHEN SUBSTR(col_1, 1, 2) = 'AE' THEN 'UAE'
WHEN SUBSTR(col_1, 1, 2) = 'FR' THEN 'France'
ELSE SUBSTR(col_1, 1, 2) END
)
FROM table_1;
Here is how this works:
dataset.replace_substr()
to replace a substring given its start and end
locations. See One Substring above.SUBSTR(col_1, 1, 2)
, we use SUBSTR()
to extract the substring between
character indices 1 and 2.
See Extracting by Location.CASE
expression to select the replacement based on the output of SUBSTR()
function.