We wish to locate one or more parts of a string given their start and end indices and replace those with one or more given replacement string(s).
This section is roughly organized into two parts as follows:
We wish to replace a part of a given string given its start and end location indices.
In this example, we wish to replace the part of a string between the 2nd and 4th characters (inclusive i.e. 3 characters) from each value of the column col_1
with three hyphens '---'
.
library(stringi)
df_2 = df %>%
mutate(
col_2 = stri_sub_replace(col_1, 2, 4, replacement = '---'))
Here is how this works:
stri_sub_replace()
from the stringi
package for replacement by location. It has a neat interface that fits well within data manipulation chains. See the Alternative below for a solution that uses str_sub()
from the stringr
package.stri_sub_replace()
takes the following as input:col_1
.from
) and end (argument name to
) locations of the part we wish to replace, which in this case are 2 and 4.'---'
.stri_sub_replace()
is a vector of the same size as the first input, which here is the column col_1
, with the replacements made.df_2
is a copy of the input data frame df
with a new column col_2
that is the output of the function stri_sub_replace()
.Alternative: Using str_sub()
str_locate_replace <- function(p_col, p_start, p_end, p_value) {
str_sub(p_col, p_start, p_end) <- p_value
return(p_col)
}
df_2 = df %>%
mutate(col_2 = str_locate_replace(col_1, 2, 4, '---'))
Here is how this works:
str_sub()
(from the stringr
package) to replace parts of a string.str_sub()
:col_1
str_sub()
the replacement string which in this case is '---'
.str_locate_replace()
inside which we modify the passed vector p_col
via assigning to str_sub()
and then we return the modified vector.We wish to replace multiple substrings given their start and end location indices.
In this example, we wish to replace three substrings given their start and end locations from each value of the column col_1
with the hyphen character ‘-’
.
library(stringi)
df_2 = df %>%
mutate(
col_2 = stri_sub_replace_all(
col_1,
from = c(1, 4, 7),
to = c(2, 5, 9),
replacement = "-")
)
Here is how this works:
stri_sub_replace_all()
from the stringi
package for replacing multiple parts of a string given their locations.stri_sub_replace_all()
takes the following as input:col_1
.from
) locations of the parts we wish to replace as a vector, which in this case are c(1, 4, 7)
.to
) locations of the parts we wish to replace as a vector, which in this case are c(2, 5, 9)
.'-'
.stri_sub_replace_all()
is a vector of the same size as the first input, which here is the column col_1
, with the replacements made.df_2
is a copy of the input data frame df
with a new column col_2
that is the output of the function stri_sub_replace_all()
.Extension: Different Replacements
We wish to locate multiple substrings in a target string and replace each with a different replacement string.
In this example, for each string in the column col_1
, we wish to replace three parts, situated between given start and end indices, with the digit characters ‘1’
, ‘2’
, and ‘3’
respectively.
library(stringi)
df_2 = df %>%
mutate(
col_2 = stri_sub_replace_all(
col_1,
from = c(1, 4, 7),
to = c(2, 5, 9),
replacement = c('1', '2', '3'))
)
Here is how this works:
This works as described in the solution above except that we pass to the replacement
argument of stri_sub_replace_all()
a vector of replacements of the same size as the number of substrings being replaced, which in this case is c('1', '2', '3')
.
We have two vectors of location indices (often columns of a data frame) of the same size as the vector of input strings (often another column of a data frame) and we use the corresponding elements in the location vectors to find the sub-string to be replaced in the input vector.
In this example, we wish to create a new column, col_4
, by modifying the values of an existing column, col_1
, by replacing a substring between specific locations, defined by the corresponding values in col_2
and col_3
, with a sequence of hyphens of the same length as the substring being replaced.
library(stringi)
df_2 = df %>%
mutate(
col_4 = stri_sub_replace(
col_1,
from = col_2,
to = col_3,
replacement = str_dup('-', col_3 - col_2))
)
Here is how this works:
stri_sub_replace()
to replace a substring between given start and end characters in a parent string with a given string. See Substring above.stri_sub_replace()
is vectorized over both the input string, which here is col_1
, and the start and end locations. Therefore, we can pass to the arguments from
and to
the columns col_2
and col_3
respectively.str_dup()
to repeat the character '-'
for a number of times, equal to the size of the string we're replacing, which is calculated by subtracting the values of the start and end locations, as provided by the columns col_2
and col_3
respectively. See Repeating.We obtain the locations of the substring(s) to be replaced in a dynamic manner. Typically, the locations are returned by another function as a matrix of two columns for start and end locations.
One Replacement
In this example, we wish to replace any sequence of two or more occurrences of the character ‘X’
with a hyphen ‘-’
.
library(stringi)
df_2 = df %>%
mutate(
col_2 = stri_sub_replace(
col_1,
str_locate(col_1, 'X{2,}'),
replacement = '-') %>%
coalesce(col_1)
)
Here is how this works:
str_locate(col_1, 'X{2,}')
, we obtain the start and end locations of the first match of a sequence of two more ‘X’
characters. The output is a matrix with two columns for the start and end locations. See Locating.stri_sub_replace()
to replace a substring given its start and end locations. See One Substring above.stri_sub_replace()
can accept a matrix containing the start and end locations directly to its from
argument.str_locate()
returns NA
. Consequently, stri_sub_replace()
will also return NA
. We use coalesce()
(from dplyr
) to return the original value of col_1
if the output of stri_sub_replace()
is NA
. See General Operations.Multiple Replacements
In this example, we wish to replace any occurrence of a number (sequence of digits) followed by the string ‘gm’
with a hyphen ‘-’
for each element in the column col_1
.
library(stringi)
df_2 = df %>%
mutate(
col_2 = stri_sub_replace_all(
col_1,
str_locate_all(col_1, '\\d+gm'),
replacement = "-")
)
Here is how this works:
str_locate_all(col_1, '\\d+gm')
, we obtain the start and end locations of the first match of a sequence of two more ‘X’
characters. The output is a list of matrices, each of which has two columns for the start and end locations. See Locating.stri_sub_replace_all()
to replace multiple substrings given their start and end locations. See Multiple Substrings above.stri_sub_replace_all()
can accept a list of matrices containing the start and end locations directly to its from
argument.Extension: Replace Inverse
library(stringi)
df_2 = df %>%
mutate(
col_2 = stri_sub_replace_all(
col_1,
map(str_locate_all(col_1, '\\d+'), invert_match),
replacement = "-")
)
Here is how this works:
stri_sub_replace_all()
.map()
to iterate over the list of matrices returned by str_locate_all()
and apply the function invert_match()
. See Locating Complement.We have a vector of replacement strings (often a column of a data frame) of the same size as the vector of input strings (often another column of a data frame) and we use the corresponding element in the replacement vector to replace the located sub-string in the input vector.
One Replacement
In this example, for each row of the data frame df
, we wish to replace the first two characters of the value of the column col_1
with the value of the column col_2
.
library(stringi)
df_2 = df %>%
mutate(
col_3 = stri_sub_replace(
col_1,
from = 1,
to = 2,
replacement = col_2)
)
Here is how this works:
stri_sub_replace()
to replace a substring given its start and end locations. See One Substring above.stri_sub_replace()
is vectorized over the input string, which here is col_1
as well as the replacement string, to which we pass the column col_2
.stri_sub_replace()
is also vectorized over the location columns as covered in Location Columns.Multiple Replacements
In this example, for each row of the data frame df
, we wish to replace the first two characters of the value of the column col_1
with the value of the column col_2
and the last four characters of the value of the column col_1
with the value of the column col_3
.
library(stringi)
df_2 = df %>%
mutate(
col_r = stri_sub_replace_all(
col_1,
from = c(1, 7),
to = c(2, 10),
replacement = map2(col_2, col_3, c))
)
Here is how this works:
stri_sub_replace_all()
to replace multiple substrings given their start and end locations. See Multiple Substrings above.replacement
argument of the function stri_sub_replace_all()
with as many elements as the number of replacements being made. See “Extension: Different Replacements” under Multiple Substrings above.stri_sub_replace_all()
is vectorized over the input string, which here is col_1
as well as the replacement, to which we pass a list of replacement vectors:col_1
.col_2
and col_3
for the corresponding row; i.e. c(col_2, col_3)
.map2()
(from the purrr
package) to iterate over the values of col_2
and col_3
simultaneously and use the function c()
to create a vector from each pair of values. See Working with Lists.We wish to apply custom logic to determine the replacement string often based on the matched substring.
One Replacement
In this example, we wish to replace the first two characters of each value of the column col_1
holding a country abbreviation (e.g. US) with the name of the corresponding country.
library(stringi)
un_abbrv <- function(x) {
countries = c(US = "USA", DE = "Germany", AE = "UAE", FR = 'France')
return(countries[x])
}
start_loc = 1
end_loc = 2
df_2 = df %>%
mutate(
col_3 = stri_sub_replace(
col_1,
from = start_loc,
to = end_loc,
replacement = un_abbrv(str_sub(col_1, start_loc, end_loc)))
)
Here is how this works:
stri_sub_replace()
to replace a substring given its start and end locations. See One Substring above.un_abbrv()
that accepts a string that is expected to be a country abbreviation and returns the corresponding country name.str_sub(col_1, start_loc, end_loc)
, we use str_sub()
to extract the substring between character indices 1 and 2. See Extracting by Location.un_abbrv()
which returns the country name back to the replacement
argument of stri_sub_replace()
.Multiple Replacements
In this example, we wish to replace three parts of each value of the column col_1
given their start and end locations where each part holds a country abbreviation (e.g. US) with the name of the corresponding country.
library(stringi)
un_abbrv <- function(x) {
countries = c(US = "USA", DE = "Germany", AE = "UAE", FR = 'France',
EG = 'Egypt', SA = "KSA")
return(countries[x])
}
start_locs = c(1, 10, 19)
end_locs = c(2, 11, 20)
df_2 = df %>%
mutate(
col_2 = stri_sub_replace_all(
col_1,
from = start_locs,
to = end_locs,
replacement = map(str_sub_all(col_1, start_locs, end_locs), un_abbrv))
)
Here is how this works:
un_abbrv()
that converts an abbreviation to a country name.str_sub_all(col_1, start_locs, end_locs)
, we use str_sub_all()
to extract the substrings between the given start and end indices of each value of the column col_1
. The output is a list of vectors; i.e. one vector of extracted strings for each value of the column col_1
. See Extracting by Location.map()
from the purrr
package to iterate over the list of vectors and convert each vector of abbreviations to a vector of country names via the custom function un_abbrv()
.map()
is naturally of the same size as the column col_1
and can, therefore, be passed to the replacement
argument of stri_sub_replace_all()
.