We wish to find the parts of a given string that match a given pattern and replace those with a given replacement string. The pattern may be a plain sequence of strings or a regular expression.
This section is roughly organized into two parts as follows:
We wish to replace any occurrence of a substring i.e. a plain sequence of one or more characters.
In this example, we replace every occurrence of a hyphen '-'
in each element of the column col_1
with an underscore '_'
.
df_2 = df %>%
mutate(col_2 = str_replace_all(col_1, fixed('-'), '_'))
Here is how this works:
str_replace_all()
(from the stringr
package) to replace every occurrence of the substring '-'
in each element of the column col_1
with the substring '_'
.str_replace_all()
expects a regular expression by default. Therefore, to specify that we wish to treat the pattern as a fixed string (a plain sequence of characters), we wrap the pattern in fixed()
.df_2
is a copy of the input data frame df
with an additional column col_2
where each element is the corresponding element of the column col_1
but with each occurrence of ‘-’
replaced with ‘_’
.We wish to replace any occurrence of a regular expression match.
In this example, we wish to replace any sequence of two or more underscores with a single underscore.
df_2 = df %>%
mutate(col_2 = str_replace_all(col_1, '_{2,}', '_'))
Here is how this works:
str_replace_all()
(from the stringr
package) to replace any sequence of two or more underscores with a single underscore in each element of the column col_1
.str_replace_all()
expects a regular expression by default.'_{2,}'
where:_
is the underscore character whose duplicate occurrence we wish to detect{2,}
specifies that we are looking for patterns where the character is repeated 2 or more timesdf_2
is a copy of the input data frame df
with an additional column col_2
where each element is the corresponding element of the column col_1
but with each sequence of two or more underscores replaced with a single underscore.We have a vector of pattern strings (often a column of a data frame) of the same size as the vector of input strings (often another column of a data frame) and we use the corresponding element in the pattern vector to find the sub-string to be replaced in the input vector.
Substring
In this example, we wish to replace all occurrences of the value of the column col_2
in the corresponding value of the column col_1
with a fixed string '-'
.
df_2 = df %>%
mutate(col_3 = str_replace_all(col_1, fixed(col_2), '-'))
Here is how this works:
str_replace_all()
to replace each occurrence of a substring in a parent string. See Substring above.str_replace_all()
is vectorized over both the input string, which here is col_1
, and the pattern, which here is col_2
.Regular Expression
In this example, we wish to replace all occurrences of a regular expression representing two or more repetitions of the value of the column col_2
(but not one) in the corresponding value of the column col_1
with a fixed string '-'
.
df_2 = df %>%
mutate(col_3 = str_replace_all(col_1, str_c(col_2, '{2,}'), '-'))
Here is how this works:
str_replace_all()
to replace each occurrence of a regular expression match in a parent string. See Regular Expression above.str_replace_all()
is vectorized over both the input string, which here is col_1
, and the pattern, which here is col_2
.str_c(col_2, '{2,}')
, we construct the desired regular expression by appending '{2,}'
to the value of the column col_2
. For instance, if the value of the column col_2
for a particular row is ‘A'
the generated regular expression is 'A{2,}'
that captures a sequence where the character ‘A’
is repeated 2 or more times.We wish to replace only the first occurrence of a given pattern in a given string.
In this example, we wish to replace the first occurrence of ‘+‘
with ‘=’
.
df_2 = df %>%
mutate(col_2 = str_replace(col_1, fixed('+'), '='))
Here is how this works:
str_replace()
(from the stringr
package) to replace only the first occurrence of a pattern in a given string.str_replace()
expects a regular expression by default. In this example, since we wish to treat ‘+’
as a fixed string and not as a regular expression, we wrap it in fixed()
.str_replace()
works in a similar way to str_replace_all()
and therefore most of the solutions presented in this section can be applied with str_replace()
. We chose to make replace-all the default becasue that is the more common operation in practice.We wish to replace parts of a given string that match a given pattern regardless of the case; i.e. we wish to ignore case while matching.
Substring
In this example, we wish to replace any occurrence of the substring ‘old’
with ‘new’
regardless of the case the characters of ‘old’
may be in.
df_2 = df %>%
mutate(col_2 = str_replace_all(col_1, fixed('old', ignore_case=TRUE), 'new'))
Here is how this works:
str_replace_all()
to replace each occurrence of the match in the parent string. See Substring above.fixed()
and pass the parameter ignore_case=TRUE
.Regular Expression
In this example, we wish to remove duplicate characters regardless of case i.e. we wish to replace any individual character repeated more than once in a sequence with a single instance of that character.
df_2 = df %>%
mutate(
col_2 = str_replace_all(
col_1,
regex('(.)\\1+', ignore_case=TRUE),
'\\1'))
Here is how this works:
str_replace_all()
to replace each occurrence of the match in the parent string. See Regular Expression above.regex()
and pass the parameter ignore_case=TRUE
.'(.)\\1+'
which works as follows:(.)
is a capture group that picks a single occurrance of any character.\\1+
picks one or more occurrances of that capture group referred to by \\1
We have a vector of replacement strings (often a column of a data frame) of the same size as the vector of input strings (often another column of a data frame) and we use the corresponding element in the replacement vector to replace the matched sub-string in the input vector.
In this example, we wish to replace the first occurrence of the value of col_2
in the corresponding value of col_1
with the corresponding value of col_3
.
df_2 = df %>%
mutate(col_4 = str_replace_all(col_1, col_2, col_3))
Here is how this works:
str_replace_all()
to replace each occurrence of a substring in a parent string. See Substring above.str_replace_all()
is vectorized over all three arguments:col_1
col_2
col_3
str_replace_all()
in this case will be a vector of the same length as each of the input columns and where each element is the value of col_1
where any occurrence of the corresponding value of col_2
is replaced with the corresponding value of the column col_3
.We wish to replace multiple patterns each with a particular (often different) string.
In this example, we wish to replace each occurrence of a currency symbol with the corresponding currency abbreviation e.g. ‘$’
becomes ‘USD’
.
df_2 = df %>%
mutate(
col_2 = str_replace_all(
col_1,
c('\\$' = 'USD ', '£' = 'GBP ', '€' = 'EUR ')))
Here is how this works:
str_replace_all()
by defining the replacements as a named vector.c('\\$' = 'USD ')
, is structured as follows:$
.USD
.df_2
will be a copy of the input data frame df
with an added column col_2
where each value is the corresponding value of the column col_1
with currency symbols replaced with currency names.Extension: Replace Multiple Pattern Matches with the Same Value
In this example, we wish to replace any occurrence of ‘kilogram’
or ‘kilograms’
or ‘kgs’
with ‘kg’
regardless of case.
df_2 = df %>%
mutate(
col_2 = str_replace_all(
col_1,
regex('kilogram|kilograms|kgs', ignore_case=TRUE),
'kg'))
Here is how this works:
We wish to apply custom logic to determine the replacement string often based on the matched substring.
In this example, we wish to replace numbers (sequences of digits) in each value of the string column col_1
, by other numbers based on the value of the original numbers. In particular, we wish to add one to any number that is greater than or equal 10 and subtract 1 from any number that is smaller than 10.
adjust_values <- function(m) {
m = as.numeric(m)
m = ifelse(m >= 10, m + 1, m - 1)
return(m)
}
df_2 = df %>%
mutate(col_2 = str_replace_all(col_1, '\\d+', adjust_values))
Here is how this works:
str_replace_all()
(and str_replace()
).adjust_values()
which accepts a string value and cast it to a numeric value then increment it by 1
if it is greater than or equal 10
, else we decrement it by 1
.col_1
‘\\d+'
matches any sequence of digitsadjust_credits()
adjust_credits()
acts on the input and returns the corresponding replacement valueWe wish to replace a pattern with a replacement that includes parts (represented as capture groups) of the captured pattern.
In this example, we wish to replace any occurrence of the substring ‘kilogram’
or ‘kilograms’
when they shows up after a number with ‘kg’
regardless of case.
df_2 = df %>%
mutate(
col_2 = str_replace_all(
col_1,
regex('(\\d+)\\s*(kilogram|kilograms)', ignore_case=TRUE),
'\\1 kg'))
Here is how this works:
(\\d+)\\s+(kilogram|kilograms)
, where:(\\d+)
is a capture group that holds the numerical part of the pattern\\s*
denotes zero or more empty space characters(kilogram|kilograms)
denotes one of these two words, hence the or |
operator.\\n
which in this case is \\1
to refer to the first capture group holding the numeric part.'\\1 kg'
takes the first capture group from the matched string (which is the sequence of digits) and appends to it the unit ‘kg’
with an empty space character in between.