Removing

We wish to remove parts of a given string that match a given pattern.

Substring

We wish to remove any occurrence of a substring i.e. a plain sequence of one or more characters.

In this example, we wish to remove the ‘$’ sign from each element of the string column col_1.

df_2 = df.assign(
    col_2 = df['col_1'].str.replace('$', '', regex=False)
)

Here is how this works:

  • We use str.replace() to find all occurrences of the substring that we wish to remove and replace them with an empty string ‘’ effectively removing it. See Replacing.
  • We set regex=False because we wish to match a plain character sequence and not a regular expression and regular expression matching is the default for str.replace().
  • The output data frame df_2 is a copy of the input df with a new column col_2 that is a transformation of the column col_1 with the ‘$’ sign removed.

Regular Expression

We wish to remove any occurrence of a regular expression match.

In this example, we wish to remove any numerical character from each element of the string column col_1.

df_2 = df.assign(
    col_2 = df['col_1'].str.replace('\d', '', regex=True)
)

Here is how this works:

  • We use the function str.replace() to remove any numerical character from each element of the string column col_1. See Replacing.
  • str.replace() expects a regular expression by default. However, it may change in a future version. It is, therefore, safer to pass regex=True.
  • The regular expression ‘\d’ captures any numeric character.
  • The output data frame df_2 is a copy of the input df with a new column col_2 that is a transformation of the column col_1 with numeric characters removed.

First Match

We wish to remove the first occurrence of a given pattern from a given string.

In this example, we wish to remove the first occurrence of a sequence of zero characters from each element of the string column col_1.

df_2 = df.assign(
    col_2 = df['col_1'].str.replace('\d', '', n=1, regex=True)
)

Here is how this works:

  • We use the function str.replace() while setting its argument n to n=1 to remove the first match only. See Replacing.
  • The regular expression in this example is '0+' which matches any sequence of 1 or more zero characters.
  • The output data frame df_2 is a copy of the input df with a new column col_2 that is a transformation of the column col_1 with the first occurrence of a sequence of zero characters removed.

Ignore Case

We wish to remove parts of a given string that match a given pattern regardless of the case; i.e. we wish to ignore case while matching.

Substring

df_2 = df.assign(
    col_2 = df['col_1'].str.replace('A', '', regex=False, case=False)
)

Here is how this works:

  • To remove all occurrences of a substring (a plain sequence of one or more characters), we use str.replace() while setting regex=False as described in Substring above.
  • To ignore case while matching, we pass the argument case=False to str.replace().

Regular Expression

df_2 = df.assign(
    col_2 = df['col_1'].str.replace('A{2,}', '', regex=True, case=False)
)

Here is how this works:

  • To remove all occurrences of a regular expression match, we use str.replace() as described in Regular Expression above.
  • To ignore case while matching, we pass the argument case=False to str.replace().

Multiple Removals

We wish to remove any occurrence of any of a given set of patterns from a given string.

In this example, we wish to remove any dollar sign $, or comma character ,, or empty space, from each element of the string column col_1.

df_2 = df.assign(
    col_2 = df['col_1'].str.replace('[\$,\s]', '', regex=True)
)

Here is how this works:

  • We pass to str.replace() a regular expression or’ing the patterns that we wish to find and remove.
  • In this case, the regular expression is '[\\$,\\s]' where:
    • \\$ captures the dollar sign character
    • , captures the comma
    • \\s captures the empty white space character
    • [] The square brackets specify that we wish to match any single character that is contained within the brackets
  • The output data frame df_2 is a copy of the input df with a new column col_2 that is a transformation of the column col_1 with the specified characters removed.

Extension: Multiple Character Patterns

df_2 = df.assign(
    col_2 = df['col_1'].str.replace('\(.+\)|\[.+\]', '', regex=True).str.strip()
)

Here is how this works:

  • We pass to str.replace() a regular expression or’ing the patterns that we wish to find and remove.
  • In this case, the regular expression is '\(.+\)|\[.+\]' where:
    • \(.+\) captures any sequence of characters in parentheses along with the parenthesis themselves.
    • | indicates an "or" condition.
    • \[.+\] captures any sequence of characters in brackets along with the brackets themselves.
    • .+ will match one or more of any character.
  • We add a call to str.strip() at the end to get rid of any leading or trailing white spac characters left behind by the removal.
  • The output data frame df_2 is a copy of the input df with a new column col_2 that is a transformation of the column col_1 with the specified patterns removed.

Pattern Column

We wish to remove from each value of a particular column the corresponding value or another column.

Substring

In this example, for each row of the data frame df, we wish to remove each occurrence of the value of the column col_2 in the corresponding occurrence of the value of the column col_1.

df_2 = df.assign(
    col_3 = df.apply(lambda x: x['col_1'].replace(x['col_2'], ''), axis=1)
)

Here is how this works:

  • The Pandas function str.replace() that we used for most scenarios above can’t accept a Series (or list) of patterns. Therefore, we use apply() with axis=1 to perform a row wise (non-vectorized) operation. See Non-Vectorized Transformation.
  • We are using the string replace() method from the core Python string module. Luckily they take the same arguments and work in exactly the same way (by design). See Replacing.
  • The output data frame df_2 is a copy of the input df with a new column col_3 where each element is the corresponding element of col_1 with any occurrence of the corresponding element of col_2 removed.

Regular Expression

In this example, for each row of the data frame df, we wish to remove each sequence of repeated column col_2 values (e.g. remove ‘AA’ and ‘AAA’ but not ‘A’) in the corresponding occurrence of the value of the column col_1.

import re

df_2 = df.assign(
    col_3 = df.apply(lambda x: re.sub((x['col_2'] + '{2,}'), '', x['col_1']), axis=1)
)

Here is how this works:z

  • The Pandas function str.replace() that we used for most scenarios above can’t accept a Series (or list) of patterns. Therefore, we use apply() with axis=1 to perform a row wise (non-vectorized) operation. See Non-Vectorized Transformation.
  • We are using the function re.sub() from the Python re module. Note that the order of arguments to re.sub() is different from str.replace(), i.e. re.sub() has the signature re.sub(pattern, repl, string). See Replacing.
  • The output data frame df_2 is a copy of the input df with a new column col_3 where each element is the corresponding element of col_1 with any matches of the specified regular expression constructed from the corresponding value of the column col_2 removed.
PYTHON
I/O