Removing

We wish to remove parts of a given string that match a given pattern.

Substring

We wish to remove any occurrence of a substring i.e. a plain sequence of one or more characters.

In this example, we wish to remove the ‘$’ sign from each element of the string column col_1.

df_2 = df %>% 
  mutate(col_2 = str_remove_all(col_1, fixed('$')))

Here is how this works:

  • We use the function str_remove_all() (from the stringr package) to remove all occurrences of the ‘$’ sign from each element of the string column col_1.
  • str_remove_all() expects a regular expression by default. To instruct it to treat the pattern as plain text, we wrap it in fixed(). Therefore in this case fixed('$') is treated as the plain character ‘$’ and not as the end of string regular expression anchor.
  • The output data frame df_2 is a copy of the input df with a new column col_2 that is a transformation of the column col_1 with the ‘$’ sign removed.

Regular Expression

We wish to remove any occurrence of a regular expression match.

In this example, we wish to remove any numerical character from each element of the string column col_1.

df_2 = df %>% 
  mutate(col_2 = str_remove_all(col_1, '\\d'))

Here is how this works:

  • We use the function str_remove_all() (from the stringr package) to remove any numerical character from each element of the string column col_1.
  • str_remove_all() expects a regular expression by default.
  • The regular expression ‘\\d’ captures any numeric character.
  • The output data frame df_2 is a copy of the input df with a new column col_2 that is a transformation of the column col_1 with numeric characters removed.

First Match

We wish to remove the first occurrence of a given pattern from a given string.

In this example, we wish to remove the first occurrence of a sequence of zero characters from each element of the string column col_1.

df_2 = df %>% 
  mutate(col_2 = str_remove(col_1, '0+'))

Here is how this works:

  • We use the function str_remove(), and not str_remove_all(), since we wish to remove the first match only.
  • The function str_remove() will find and remove the first occurrence of the pattern, which here is '0+', from each element of the string input, which here is the column col_1.
  • The output data frame df_2 is a copy of the input df with a new column col_2 that is a transformation of the column col_1 with the first occurrence of a sequence of zero characters removed.

Ignore Case

We wish to remove parts of a given string that match a given pattern regardless of the case; i.e. we wish to ignore case while matching.

Substring

df_2 = df %>% 
  mutate(col_2 = str_remove_all(col_1, fixed('A', ignore_case=TRUE)))

Here is how this works:

  • To remove all occurrences of a substring (a plain sequence of one or more characters), we use str_remove_all() as described in Substring above.
  • To ignore case while matching, we wrap the substring in fixed() and pass the parameter ignore_case=TRUE.

Regular Expression

df = tibble(
  col_1 = c('AABBCCA', 'AaBCA', 'ABC'))

df_2 = df %>% 
  mutate(col_2 = str_remove_all(col_1, regex('A{2,}', ignore_case=TRUE)))

Here is how this works:

  • To remove all occurrences of a regular expression match, we use str_remove_all() as described in Regular Expression above.
  • To ignore case while matching, we wrap the regular expression in regex() and pass the parameter ignore_case=TRUE.

Multiple Removals

We wish to remove any occurrence of any of a given set of patterns from a given string.

In this example, we wish to remove any dollar sign $, or comma character ,, or empty space, from each element of the string column col_1.

df_2 = df %>% 
  mutate(col_2 = str_remove_all(col_1, '[\\$,\\s]'))

Here is how this works:

  • We pass to str_remove_all() a regular expression or’ing the patterns that we wish to find and remove.
  • In this case, the regular expression is '[\\$,\\s]' where:
    • \\$ captures the dollar sign character
    • , captures the comma
    • \\s captures the empty white space character
  • The output data frame df_2 is a copy of the input df with a new column col_2 that is a transformation of the column col_1 with the specified characters removed.

Extension: Multiple Character Patterns

df_2 = df %>% 
  mutate(col_2 = str_remove_all(col_1, '\\(.+\\)|\\[.+\\]') %>% str_trim())

Here is how this works:

  • We pass to str_remove_all() a regular expression or’ing the patterns that we wish to find and remove.
  • In this case, the regular expression is '\(.+\)|\[.+\]' where:
    • \(.+\) captures any sequence of characters in parenthesis along with the parenthesis themselves.
    • | indicates an "or" condition.
    • \[.+\] captures any sequence of characters in brackets along with the brackets themselves.
    • .+ will match one or more of any character.
  • We add a call to str_trim() at the end to get rid of any leading or trailing white spac characters left behind by the removal.
  • The output data frame df_2 is a copy of the input df with a new column col_2 that is a transformation of the column col_1 with the specified patterns removed.

Pattern Column

We wish to remove from each value of a particular column the corresponding value or another column.

Substring

In this example, for each row of the data frame df, we wish to remove each occurrence of the value of the column col_2 in the corresponding occurrence of the value of the column col_1.

df_2 = df %>% 
  mutate(col_3 = str_remove_all(col_1, fixed(col_2)))

Here is how this works:

  • We use str_remove_all() to remove all occurrences of a given pattern.
  • str_remove_all() is vectorized over both the string and the pattern. Therefore, we can pass:
    • A vector of strings from which substrings will be removed, which in this case is the column col_1.
    • Vectors of strings representing the patterns to be identified and removed, which in this case is the column col_2.
  • The output data frame df_2 is a copy of the input df with a new column col_3 where each element is the corresponding element of col_1 with any occurrence of the corresponding element of col_2 removed.

Regular Expression

In this example, for each row of the data frame df, we wish to remove each sequence of repeated column col_2 values (e.g. remove ‘AA’ and ‘AAA’ but not ‘A’) in the corresponding occurrence of the value of the column col_1.

df_2 = df %>% 
  mutate(col_3 = str_remove_all(col_1, str_c(col_2, '{2,}')))

Here is how this works:

  • We pass to str_remove_all():
    • the strings to look into which in this case is the column col_1 as the first argument.
    • and the patterns to look for and remove which in this case is a transformation of the column col_2 and is naturally of the same size as col_1.
  • We construct the regular expression patterns by appending to the value of the column col_2 the regular expression '{2,' for each row to capture a pattern of two or more repetetions of the value of col_2 for the current row. For instance, if the value of the column col_2 is ‘A’ then the value of the regular expression to extract will be 'A{2,}'.
  • The output data frame df_2 is a copy of the input df with a new column col_3 where each element is the corresponding element of col_1 with any matches of the specified regular expression constructed from the corresponding value of the column col_2 removed.
R
I/O