Substituting

We wish to locate one or more parts of a string given their start and end indices and replace those with one or more given replacement string(s).

This section is roughly organized into two parts as follows:

  • Locating Scenarios
    • One Substring: We locate one substring within a given string via its start and end indices and replace that with a given replacement string.
    • Multiple Substrings: We locate multiple substrings within a given string via their start and end indices and replace those substrings with a given replacement string.
    • Location Columns: We have two vectors of location indices (often columns of a data frame) of the same size as the vector of input strings (often another column of a data frame) and we use the corresponding elements in the location vectors to find the sub-string to be replaced in the input vector.
    • Dynamic Locations: We obtain the locations of the substring(s) to be replaced in a dynamic manner. Typically, the locations are returned by another function as a matrix of two columns for start and end locations. For instance, we obtain the locations of the substring(s) to be replaced by returning the locations of a pattern match (which we cover in more detail in Locating).
  • Replacement Scenarios
    • Replacement Column: We have a vector of replacement strings (often a column of a data frame) of the same size as the vector of input strings (often another column of a data frame) and we use the corresponding element in the replacement vector to replace the located sub-string in the input vector.
    • Custom Replacement: We apply custom logic to determine the replacement string often based on the matched substring.

One Substring

We wish to replace a part of a given string given its start and end location indices.

In this example, we wish to replace the part of a string between the 2nd and 4th characters (inclusive i.e. 3 characters) from each value of the column col_1 with three hyphens '---'.

library(stringi)

df_2 = df %>%
  mutate(
    col_2 = stri_sub_replace(col_1, 2, 4, replacement = '---'))

Here is how this works:

  • We recommend the use of the function stri_sub_replace() from the stringi package for replacement by location. It has a neat interface that fits well within data manipulation chains. See the Alternative below for a solution that uses str_sub() from the stringr package.
  • The function stri_sub_replace() takes the following as input:
    • A single string or a vector of strings whose values we wish to modify, which in this case is the column col_1.
    • The start (argument name from) and end (argument name to) locations of the part we wish to replace, which in this case are 2 and 4.
    • The replacement string, which in this case is '---'.
  • The output of stri_sub_replace() is a vector of the same size as the first input, which here is the column col_1, with the replacements made.
  • The output data frame df_2 is a copy of the input data frame df with a new column col_2 that is the output of the function stri_sub_replace().

Alternative: Using str_sub()

str_locate_replace <- function(p_col, p_start, p_end, p_value) {
  str_sub(p_col, p_start, p_end) <- p_value
  return(p_col)
}

df_2 = df %>%
  mutate(col_2 = str_locate_replace(col_1, 2, 4, '---'))

Here is how this works:

  • While somewhat awkward and can’t be used easily within a mutate statement, we can assign to the str_sub() (from the stringr package) to replace parts of a string.
  • We pass to str_sub():
    • The string to act on, which in this case are the elements of the column col_1
    • The start and end locations which in this case are 2 and 4 respectively
  • We then assign to str_sub() the replacement string which in this case is '---'.
  • We use a custom function str_locate_replace() inside which we modify the passed vector p_col via assigning to str_sub() and then we return the modified vector.

Multiple Substrings

We wish to replace multiple substrings given their start and end location indices.

In this example, we wish to replace three substrings given their start and end locations from each value of the column col_1 with the hyphen character ‘-’.

library(stringi)

df_2 = df %>% 
  mutate(
    col_2 = stri_sub_replace_all(
      col_1, 
      from = c(1, 4, 7), 
      to = c(2, 5, 9),
      replacement = "-")
  )

Here is how this works:

  • We recommend the use of the function stri_sub_replace_all() from the stringi package for replacing multiple parts of a string given their locations.
  • The function stri_sub_replace_all() takes the following as input:
    • A single string or a vector of strings whose values we wish to modify, which in this case is the column col_1.
    • The start (argument name from) locations of the parts we wish to replace as a vector, which in this case are c(1, 4, 7).
    • The end (argument name to) locations of the parts we wish to replace as a vector, which in this case are c(2, 5, 9).
    • The replacement string, which in this case is '-'.
  • Note that the index ranges must be sorted and mutually disjoint i.e. parts must be referred to in order from left to right and no parts can overlap.
  • The output of stri_sub_replace_all() is a vector of the same size as the first input, which here is the column col_1, with the replacements made.
  • The output data frame df_2 is a copy of the input data frame df with a new column col_2 that is the output of the function stri_sub_replace_all().

Extension: Different Replacements

We wish to locate multiple substrings in a target string and replace each with a different replacement string.

In this example, for each string in the column col_1, we wish to replace three parts, situated between given start and end indices, with the digit characters ‘1’, ‘2’, and ‘3’ respectively.

library(stringi)

df_2 = df %>% 
  mutate(
    col_2 = stri_sub_replace_all(
      col_1, 
      from = c(1, 4, 7), 
      to = c(2, 5, 9),
      replacement = c('1', '2', '3'))
  )

Here is how this works:

This works as described in the solution above except that we pass to the replacement argument of stri_sub_replace_all() a vector of replacements of the same size as the number of substrings being replaced, which in this case is c('1', '2', '3').

Location Columns

We have two vectors of location indices (often columns of a data frame) of the same size as the vector of input strings (often another column of a data frame) and we use the corresponding elements in the location vectors to find the sub-string to be replaced in the input vector.

In this example, we wish to create a new column, col_4, by modifying the values of an existing column, col_1, by replacing a substring between specific locations, defined by the corresponding values in col_2 and col_3, with a sequence of hyphens of the same length as the substring being replaced.

library(stringi)

df_2 = df %>%
  mutate(
    col_4 = stri_sub_replace(
      col_1, 
      from = col_2, 
      to = col_3, 
      replacement = str_dup('-', col_3 - col_2))
    )

Here is how this works:

  • We use the function stri_sub_replace() to replace a substring between given start and end characters in a parent string with a given string. See Substring above.
  • The function stri_sub_replace() is vectorized over both the input string, which here is col_1, and the start and end locations. Therefore, we can pass to the arguments from and to the columns col_2 and col_3 respectively.
  • We use str_dup() to repeat the character '-' for a number of times, equal to the size of the string we're replacing, which is calculated by subtracting the values of the start and end locations, as provided by the columns col_2 and col_3 respectively. See Repeating.

Dynamic Locations

We obtain the locations of the substring(s) to be replaced in a dynamic manner. Typically, the locations are returned by another function as a matrix of two columns for start and end locations.

One Replacement

In this example, we wish to replace any sequence of two or more occurrences of the character ‘X’ with a hyphen ‘-’.

library(stringi)

df_2 = df %>%
  mutate(
    col_2 = stri_sub_replace(
      col_1, 
      str_locate(col_1, 'X{2,}'), 
      replacement = '-') %>% 
      coalesce(col_1)
    )

Here is how this works:

  • While we can do replacement by regular expression, as described in Replacing, we will solve this by first locating the target substrings and then replacing them by Location to show how to deal with situations where the start and end locations are returned as a matrix.
  • In str_locate(col_1, 'X{2,}'), we obtain the start and end locations of the first match of a sequence of two more ‘X’ characters. The output is a matrix with two columns for the start and end locations. See Locating.
  • We use stri_sub_replace() to replace a substring given its start and end locations. See One Substring above.
  • The function stri_sub_replace() can accept a matrix containing the start and end locations directly to its from argument.
  • In cases where there is no match, str_locate() returns NA. Consequently, stri_sub_replace() will also return NA. We use coalesce() (from dplyr) to return the original value of col_1 if the output of stri_sub_replace() is NA. See General Operations.

Multiple Replacements

In this example, we wish to replace any occurrence of a number (sequence of digits) followed by the string ‘gm’ with a hyphen ‘-’ for each element in the column col_1.

library(stringi)

df_2 = df %>% 
  mutate(
    col_2 = stri_sub_replace_all(
      col_1, 
      str_locate_all(col_1, '\\d+gm'),
      replacement = "-")
  )

Here is how this works:

  • While we can do replacement by regular expression, as described in Replacing, we will solve this by first locating the target substrings and then replacing them by Location to show how to deal with situations where the start and end locations are returned as a matrix.
  • In str_locate_all(col_1, '\\d+gm'), we obtain the start and end locations of the first match of a sequence of two more ‘X’ characters. The output is a list of matrices, each of which has two columns for the start and end locations. See Locating.
  • We use stri_sub_replace_all() to replace multiple substrings given their start and end locations. See Multiple Substrings above.
  • The function stri_sub_replace_all() can accept a list of matrices containing the start and end locations directly to its from argument.

Extension: Replace Inverse

library(stringi)

df_2 = df %>% 
  mutate(
    col_2 = stri_sub_replace_all(
      col_1, 
      map(str_locate_all(col_1, '\\d+'), invert_match),
      replacement = "-")
  )

Here is how this works:

  • This works similarly to the primary solution above except that we invert the substring locations before passing them to stri_sub_replace_all().
  • To invert matches, we use map() to iterate over the list of matrices returned by str_locate_all() and apply the function invert_match(). See Locating Complement.

Replacement Column

We have a vector of replacement strings (often a column of a data frame) of the same size as the vector of input strings (often another column of a data frame) and we use the corresponding element in the replacement vector to replace the located sub-string in the input vector.

One Replacement

In this example, for each row of the data frame df, we wish to replace the first two characters of the value of the column col_1 with the value of the column col_2.

library(stringi)

df_2 = df %>%
  mutate(
    col_3 = stri_sub_replace(
      col_1, 
      from = 1, 
      to = 2, 
      replacement = col_2)
  )

Here is how this works:

  • We use the function stri_sub_replace() to replace a substring given its start and end locations. See One Substring above.
  • The function stri_sub_replace() is vectorized over the input string, which here is col_1 as well as the replacement string, to which we pass the column col_2.
  • Note: stri_sub_replace() is also vectorized over the location columns as covered in Location Columns.

Multiple Replacements

In this example, for each row of the data frame df, we wish to replace the first two characters of the value of the column col_1 with the value of the column col_2 and the last four characters of the value of the column col_1 with the value of the column col_3.

library(stringi)

df_2 = df %>%
  mutate(
    col_r = stri_sub_replace_all(
      col_1, 
      from = c(1, 7), 
      to = c(2, 10), 
      replacement = map2(col_2, col_3, c))
  )

Here is how this works:

  • We use the function stri_sub_replace_all() to replace multiple substrings given their start and end locations. See Multiple Substrings above.
  • We can pass a vector of replacements to the replacement argument of the function stri_sub_replace_all() with as many elements as the number of replacements being made. See “Extension: Different Replacements” under Multiple Substrings above.
  • The function stri_sub_replace_all() is vectorized over the input string, which here is col_1 as well as the replacement, to which we pass a list of replacement vectors:
    • The list has as many elements as the first input col_1.
    • and each vector has the values of the columns col_2 and col_3 for the corresponding row; i.e. c(col_2, col_3).
  • To construct the replacement vectors, we use map2() (from the purrr package) to iterate over the values of col_2 and col_3 simultaneously and use the function c() to create a vector from each pair of values. See Working with Lists.

Custom Replacement

We wish to apply custom logic to determine the replacement string often based on the matched substring.

One Replacement

In this example, we wish to replace the first two characters of each value of the column col_1 holding a country abbreviation (e.g. US) with the name of the corresponding country.

library(stringi)

un_abbrv <- function(x) {
  countries = c(US = "USA", DE = "Germany", AE = "UAE", FR = 'France')
  return(countries[x])
}

start_loc = 1
end_loc = 2

df_2 = df %>%
  mutate(
    col_3 = stri_sub_replace(
      col_1, 
      from = start_loc, 
      to = end_loc, 
      replacement = un_abbrv(str_sub(col_1, start_loc, end_loc)))
  )

Here is how this works:

  • We use the function stri_sub_replace() to replace a substring given its start and end locations. See One Substring above.
  • We create a custom function un_abbrv() that accepts a string that is expected to be a country abbreviation and returns the corresponding country name.
  • In str_sub(col_1, start_loc, end_loc), we use str_sub() to extract the substring between character indices 1 and 2. See Extracting by Location.
  • We then pass the extracted substring to the custom function un_abbrv() which returns the country name back to the replacement argument of stri_sub_replace().

Multiple Replacements

In this example, we wish to replace three parts of each value of the column col_1 given their start and end locations where each part holds a country abbreviation (e.g. US) with the name of the corresponding country.

library(stringi)

un_abbrv <- function(x) {
  countries = c(US = "USA", DE = "Germany", AE = "UAE", FR = 'France', 
                EG = 'Egypt', SA = "KSA")
  return(countries[x])
}

start_locs = c(1, 10, 19)
end_locs = c(2, 11, 20)

df_2 = df %>% 
  mutate(
    col_2 = stri_sub_replace_all(
      col_1, 
      from = start_locs, 
      to = end_locs,
      replacement = map(str_sub_all(col_1, start_locs, end_locs), un_abbrv))
  )

Here is how this works:

  • We create a custom function un_abbrv() that converts an abbreviation to a country name.
  • In str_sub_all(col_1, start_locs, end_locs), we use str_sub_all() to extract the substrings between the given start and end indices of each value of the column col_1. The output is a list of vectors; i.e. one vector of extracted strings for each value of the column col_1. See Extracting by Location.
  • We then use the function map() from the purrr package to iterate over the list of vectors and convert each vector of abbreviations to a vector of country names via the custom function un_abbrv().
  • The list of vectors created by map() is naturally of the same size as the column col_1 and can, therefore, be passed to the replacement argument of stri_sub_replace_all().
R
I/O