Replacing

We wish to find the parts of a given string that match a given pattern and replace those with a given replacement string. The pattern may be a plain sequence of strings or a regular expression.

This section is roughly organized into two parts as follows:

  • Finding Scenarios:
    • Substring: We wish to replace any occurrence of a substring i.e. a plain sequence of one or more characters.
    • Regular Expression: We wish to replace any occurrence of a regular expression match.
    • Pattern Column: We have a vector of pattern strings (often a column of a data frame) of the same size as the vector of input strings (often another column of a data frame) and we use the corresponding element in the pattern vector to find the sub-string to be replaced in the input vector.
    • First Match: We wish to replace only the first occurrence of a given pattern in a given string.
    • Ignore Case: We wish to replace parts of a given string that match a given pattern regardless of the case; i.e. we wish to ignore case while matching.
  • Replacement Scenarios:
    • Replacement Column: We have a vector of replacement strings (often a column of a data frame) of the same size as the vector of input strings (often another column of a data frame) and we use the corresponding element in the replacement vector to replace the matched sub-string in the input vector.
    • Multiple Replacements: We wish to replace multiple patterns each with a particular (often different) string.
    • Custom Replacement: We apply custom logic to determine the replacement string often based on the matched substring.
    • Capture Group Replacement: We wish to replace a pattern with a replacement that includes parts (represented as capture groups) of the captured pattern.

Substring

We wish to replace any occurrence of a substring i.e. a plain sequence of one or more characters.

In this example, we replace every occurrence of a hyphen '-' in each element of the column col_1 with an underscore '_'.

df_2 = df %>% 
  mutate(col_2 = str_replace_all(col_1, fixed('-'), '_'))

Here is how this works:

  • We use the function str_replace_all() (from the stringr package) to replace every occurrence of the substring '-' in each element of the column col_1 with the substring '_'.
  • In this case, the substring has just one character but in general, the substrings to use for matching or as replacements can be of any length.
  • The function str_replace_all() expects a regular expression by default. Therefore, to specify that we wish to treat the pattern as a fixed string (a plain sequence of characters), we wrap the pattern in fixed().
  • The output data frame df_2 is a copy of the input data frame df with an additional column col_2 where each element is the corresponding element of the column col_1 but with each occurrence of ‘-’ replaced with ‘_’.

Regular Expression

We wish to replace any occurrence of a regular expression match.

In this example, we wish to replace any sequence of two or more underscores with a single underscore.

df_2 = df %>% 
  mutate(col_2 = str_replace_all(col_1, '_{2,}', '_'))

Here is how this works:

  • We use the function str_replace_all() (from the stringr package) to replace any sequence of two or more underscores with a single underscore in each element of the column col_1.
  • The function str_replace_all() expects a regular expression by default.
  • The regular expression we use to capture the target pattern, in this case, is '_{2,}' where:
    • _ is the underscore character whose duplicate occurrence we wish to detect
    • {2,} specifies that we are looking for patterns where the character is repeated 2 or more times
  • The output data frame df_2 is a copy of the input data frame df with an additional column col_2 where each element is the corresponding element of the column col_1 but with each sequence of two or more underscores replaced with a single underscore.

Pattern Column

We have a vector of pattern strings (often a column of a data frame) of the same size as the vector of input strings (often another column of a data frame) and we use the corresponding element in the pattern vector to find the sub-string to be replaced in the input vector.

Substring

In this example, we wish to replace all occurrences of the value of the column col_2 in the corresponding value of the column col_1 with a fixed string '-'.

df_2 = df %>%
  mutate(col_3 = str_replace_all(col_1, fixed(col_2), '-'))

Here is how this works:

  • We use str_replace_all() to replace each occurrence of a substring in a parent string. See Substring above.
  • The function str_replace_all() is vectorized over both the input string, which here is col_1, and the pattern, which here is col_2.

Regular Expression

In this example, we wish to replace all occurrences of a regular expression representing two or more repetitions of the value of the column col_2 (but not one) in the corresponding value of the column col_1 with a fixed string '-'.

df_2 = df %>%
  mutate(col_3 = str_replace_all(col_1, str_c(col_2, '{2,}'), '-'))

Here is how this works:

  • We use str_replace_all() to replace each occurrence of a regular expression match in a parent string. See Regular Expression above.
  • The function str_replace_all() is vectorized over both the input string, which here is col_1, and the pattern, which here is col_2.
  • In str_c(col_2, '{2,}'), we construct the desired regular expression by appending '{2,}' to the value of the column col_2. For instance, if the value of the column col_2 for a particular row is ‘A' the generated regular expression is 'A{2,}' that captures a sequence where the character ‘A’ is repeated 2 or more times.

First Match

We wish to replace only the first occurrence of a given pattern in a given string.

In this example, we wish to replace the first occurrence of ‘+‘ with ‘=’.

df_2 = df %>%
  mutate(col_2 = str_replace(col_1, fixed('+'), '='))

Here is how this works:

  • We use the function str_replace() (from the stringr package) to replace only the first occurrence of a pattern in a given string.
  • The function str_replace() expects a regular expression by default. In this example, since we wish to treat ‘+’ as a fixed string and not as a regular expression, we wrap it in fixed().
  • The function str_replace() works in a similar way to str_replace_all() and therefore most of the solutions presented in this section can be applied with str_replace(). We chose to make replace-all the default becasue that is the more common operation in practice.

Ignore Case

We wish to replace parts of a given string that match a given pattern regardless of the case; i.e. we wish to ignore case while matching.

Substring

In this example, we wish to replace any occurrence of the substring ‘old’ with ‘new’ regardless of the case the characters of ‘old’ may be in.

df_2 = df %>% 
    mutate(col_2 = str_replace_all(col_1, fixed('old', ignore_case=TRUE), 'new'))

Here is how this works:

  • We use str_replace_all() to replace each occurrence of the match in the parent string. See Substring above.
  • To ignore case while matching, we wrap the pattern in fixed() and pass the parameter ignore_case=TRUE.

Regular Expression

In this example, we wish to remove duplicate characters regardless of case i.e. we wish to replace any individual character repeated more than once in a sequence with a single instance of that character.

df_2 = df %>% 
  mutate(
    col_2 = str_replace_all(
      col_1, 
      regex('(.)\\1+', ignore_case=TRUE), 
      '\\1'))

Here is how this works:

  • We use str_replace_all() to replace each occurrence of the match in the parent string. See Regular Expression above.
  • To ignore case while matching, we wrap the regular expression in regex() and pass the parameter ignore_case=TRUE.
  • The regular expression that we are matching, in this case, is '(.)\\1+' which works as follows:
    • (.) is a capture group that picks a single occurrance of any character.
    • \\1+ picks one or more occurrances of that capture group referred to by \\1
  • In this example, we use a capture group in the replacement. This is described in Capture Group Replacement below.

Replacement Column

We have a vector of replacement strings (often a column of a data frame) of the same size as the vector of input strings (often another column of a data frame) and we use the corresponding element in the replacement vector to replace the matched sub-string in the input vector.

In this example, we wish to replace the first occurrence of the value of col_2 in the corresponding value of col_1 with the corresponding value of col_3.

df_2 = df %>%
  mutate(col_4 = str_replace_all(col_1, col_2, col_3))

Here is how this works:

  • We use str_replace_all() to replace each occurrence of a substring in a parent string. See Substring above.
  • The function str_replace_all() is vectorized over all three arguments:
    • The input string, which here are the values of the column col_1
    • The pattern, which here are the values of the column col_2
    • The replacement, which here are the values of the column col_3
  • The output of str_replace_all() in this case will be a vector of the same length as each of the input columns and where each element is the value of col_1 where any occurrence of the corresponding value of col_2 is replaced with the corresponding value of the column col_3.

Multiple Replacements

We wish to replace multiple patterns each with a particular (often different) string.

In this example, we wish to replace each occurrence of a currency symbol with the corresponding currency abbreviation e.g. ‘$’ becomes ‘USD’.

df_2 = df %>%
  mutate(
    col_2 = str_replace_all(
      col_1, 
      c('\\$' = 'USD ', '£' = 'GBP ', '€' = 'EUR ')))

Here is how this works:

  • When we wish to make multiple replacements it may get tedious to write multiple inidividual replacement commands. Fortunately, we can carry out multiple replacments via one call to str_replace_all() by defining the replacements as a named vector.
  • The named vector, e.g. c('\\$' = 'USD '), is structured as follows:
    • The name is the pattern we are looking for, which in this case are the currency symbols e.g. $.
    • The value is the replacement we wish to use instead, which in this case are the currency names e.g. USD.
  • The output data frame df_2 will be a copy of the input data frame df with an added column col_2 where each value is the corresponding value of the column col_1 with currency symbols replaced with currency names.

Extension: Replace Multiple Pattern Matches with the Same Value

In this example, we wish to replace any occurrence of ‘kilogram’ or ‘kilograms’ or ‘kgs’ with ‘kg’ regardless of case.

df_2 = df %>% 
  mutate(
    col_2 = str_replace_all(
      col_1, 
      regex('kilogram|kilograms|kgs', ignore_case=TRUE), 
      'kg'))

Here is how this works:

Custom Replacement

We wish to apply custom logic to determine the replacement string often based on the matched substring.

In this example, we wish to replace numbers (sequences of digits) in each value of the string column col_1, by other numbers based on the value of the original numbers. In particular, we wish to add one to any number that is greater than or equal 10 and subtract 1 from any number that is smaller than 10.

adjust_values <- function(m) {
  m = as.numeric(m)
  m = ifelse(m >= 10, m + 1, m - 1)
  return(m)
}

df_2 = df %>%
  mutate(col_2 = str_replace_all(col_1, '\\d+', adjust_values))

Here is how this works:

  • Instead of passing a replacement string, we can pass a replacement function to str_replace_all() (and str_replace()).
  • The matches would be passed to the custom function and the custom function's output is then used as the replacement.
  • In this case we create a custom function adjust_values() which accepts a string value and cast it to a numeric value then increment it by 1 if it is greater than or equal 10, else we decrement it by 1.
  • Putting it together:
    • For each value of the column col_1
    • The regular expression ‘\\d+' matches any sequence of digits
    • Each captured substring (sequence of digit characters) is then passed to the custom function adjust_credits()
    • The custom adjust_credits() acts on the input and returns the corresponding replacement value
    • The replacement value is inserted in the input string in place of the match

Capture Group Replacement

We wish to replace a pattern with a replacement that includes parts (represented as capture groups) of the captured pattern.

In this example, we wish to replace any occurrence of the substring ‘kilogram’ or ‘kilograms’ when they shows up after a number with ‘kg’ regardless of case.

df_2 = df %>% 
  mutate(
    col_2 = str_replace_all(
      col_1, 
      regex('(\\d+)\\s*(kilogram|kilograms)', ignore_case=TRUE), 
      '\\1 kg'))

Here is how this works:

  • In order to construct a replacement string out of parts of the matched string, we can use capture groups. Capture groups allow us to isolate a portion of the match and refer to it later.
  • In this case the regular expression we are using is (\\d+)\\s+(kilogram|kilograms), where:
    • (\\d+) is a capture group that holds the numerical part of the pattern
    • \\s* denotes zero or more empty space characters
    • (kilogram|kilograms) denotes one of these two words, hence the or | operator.
  • While constructing the replacement, we can refer to any capture group by its number via \\n which in this case is \\1 to refer to the first capture group holding the numeric part.
  • The replacement expression '\\1 kg' takes the first capture group from the matched string (which is the sequence of digits) and appends to it the unit ‘kg’ with an empty space character in between.
R
I/O