We wish to remove parts of a given string that match a given pattern.
We wish to remove any occurrence of a substring i.e. a plain sequence of one or more characters.
In this example, we wish to remove the ‘$’
sign from each element of the string column col_1
.
df_2 = df %>%
mutate(col_2 = str_remove_all(col_1, fixed('$')))
Here is how this works:
str_remove_all()
(from the stringr
package) to remove all occurrences of the ‘$’
sign from each element of the string column col_1
.str_remove_all()
expects a regular expression by default. To instruct it to treat the pattern as plain text, we wrap it in fixed()
. Therefore in this case fixed('$')
is treated as the plain character ‘$’
and not as the end of string regular expression anchor.df_2
is a copy of the input df
with a new column col_2
that is a transformation of the column col_1
with the ‘$’
sign removed.We wish to remove any occurrence of a regular expression match.
In this example, we wish to remove any numerical character from each element of the string column col_1
.
df_2 = df %>%
mutate(col_2 = str_remove_all(col_1, '\\d'))
Here is how this works:
str_remove_all()
(from the stringr
package) to remove any numerical character from each element of the string column col_1
.str_remove_all()
expects a regular expression by default.‘\\d’
captures any numeric character.df_2
is a copy of the input df
with a new column col_2
that is a transformation of the column col_1
with numeric characters removed.We wish to remove the first occurrence of a given pattern from a given string.
In this example, we wish to remove the first occurrence of a sequence of zero characters from each element of the string column col_1
.
df_2 = df %>%
mutate(col_2 = str_remove(col_1, '0+'))
Here is how this works:
str_remove()
, and not str_remove_all()
, since we wish to remove the first match only.str_remove()
will find and remove the first occurrence of the pattern, which here is '0+'
, from each element of the string input, which here is the column col_1
.df_2
is a copy of the input df
with a new column col_2
that is a transformation of the column col_1
with the first occurrence of a sequence of zero characters removed.We wish to remove parts of a given string that match a given pattern regardless of the case; i.e. we wish to ignore case while matching.
Substring
df_2 = df %>%
mutate(col_2 = str_remove_all(col_1, fixed('A', ignore_case=TRUE)))
Here is how this works:
str_remove_all()
as described in Substring above.fixed()
and pass the parameter ignore_case=TRUE
. Regular Expression
df = tibble(
col_1 = c('AABBCCA', 'AaBCA', 'ABC'))
df_2 = df %>%
mutate(col_2 = str_remove_all(col_1, regex('A{2,}', ignore_case=TRUE)))
Here is how this works:
str_remove_all()
as described in Regular Expression above.regex()
and pass the parameter ignore_case=TRUE
. We wish to remove any occurrence of any of a given set of patterns from a given string.
In this example, we wish to remove any dollar sign $,
or comma character ,
, or empty space, from each element of the string column col_1
.
df_2 = df %>%
mutate(col_2 = str_remove_all(col_1, '[\\$,\\s]'))
Here is how this works:
str_remove_all()
a regular expression or’ing the patterns that we wish to find and remove.'[\\$,\\s]'
where:\\$
captures the dollar sign character,
captures the comma\\s
captures the empty white space characterdf_2
is a copy of the input df
with a new column col_2
that is a transformation of the column col_1
with the specified characters removed.Extension: Multiple Character Patterns
df_2 = df %>%
mutate(col_2 = str_remove_all(col_1, '\\(.+\\)|\\[.+\\]') %>% str_trim())
Here is how this works:
str_remove_all()
a regular expression or’ing the patterns that we wish to find and remove.'\(.+\)|\[.+\]'
where:\(.+\)
captures any sequence of characters in parenthesis along with the parenthesis themselves.|
indicates an "or" condition.\[.+\]
captures any sequence of characters in brackets along with the brackets themselves..+
will match one or more of any character.str_trim()
at the end to get rid of any leading or trailing white spac characters left behind by the removal.df_2
is a copy of the input df
with a new column col_2
that is a transformation of the column col_1
with the specified patterns removed.We wish to remove from each value of a particular column the corresponding value or another column.
Substring
In this example, for each row of the data frame df
, we wish to remove each occurrence of the value of the column col_2
in the corresponding occurrence of the value of the column col_1
.
df_2 = df %>%
mutate(col_3 = str_remove_all(col_1, fixed(col_2)))
Here is how this works:
str_remove_all()
to remove all occurrences of a given pattern.str_remove_all()
is vectorized over both the string and the pattern. Therefore, we can pass:col_1
.col_2
.df_2
is a copy of the input df
with a new column col_3
where each element is the corresponding element of col_1
with any occurrence of the corresponding element of col_2
removed.Regular Expression
In this example, for each row of the data frame df
, we wish to remove each sequence of repeated column col_2
values (e.g. remove ‘AA’
and ‘AAA’
but not ‘A’
) in the corresponding occurrence of the value of the column col_1
.
df_2 = df %>%
mutate(col_3 = str_remove_all(col_1, str_c(col_2, '{2,}')))
Here is how this works:
str_remove_all()
:col_1
as the first argument.col_2
and is naturally of the same size as col_1
.col_2
the regular expression '{2,'
for each row to capture a pattern of two or more repetetions of the value of col_2
for the current row. For instance, if the value of the column col_2
is ‘A’
then the value of the regular expression to extract will be 'A{2,}'
.df_2
is a copy of the input df
with a new column col_3
where each element is the corresponding element of col_1
with any matches of the specified regular expression constructed from the corresponding value of the column col_2
removed.