We wish to remove parts of a given string that match a given pattern.
We wish to remove any occurrence of a substring i.e. a plain sequence of one or more characters.
In this example, we wish to remove the ‘$’
sign from each element of the string column col_1
.
df_2 = df.assign(
col_2 = df['col_1'].str.replace('$', '', regex=False)
)
Here is how this works:
str.replace()
to find all occurrences of the substring that we wish to remove and replace them with an empty string ‘’
effectively removing it. See Replacing.regex=False
because we wish to match a plain character sequence and not a regular expression and regular expression matching is the default for str.replace()
.df_2
is a copy of the input df
with a new column col_2
that is a transformation of the column col_1
with the ‘$’
sign removed.We wish to remove any occurrence of a regular expression match.
In this example, we wish to remove any numerical character from each element of the string column col_1
.
df_2 = df.assign(
col_2 = df['col_1'].str.replace('\d', '', regex=True)
)
Here is how this works:
str.replace()
to remove any numerical character from each element of the string column col_1
. See Replacing.str.replace()
expects a regular expression by default. However, it may change in a future version. It is, therefore, safer to pass regex=True
.‘\d’
captures any numeric character.df_2
is a copy of the input df
with a new column col_2
that is a transformation of the column col_1
with numeric characters removed.We wish to remove the first occurrence of a given pattern from a given string.
In this example, we wish to remove the first occurrence of a sequence of zero characters from each element of the string column col_1
.
df_2 = df.assign(
col_2 = df['col_1'].str.replace('\d', '', n=1, regex=True)
)
Here is how this works:
str.replace()
while setting its argument n
to n=1
to remove the first match only. See Replacing.'0+'
which matches any sequence of 1 or more zero characters.df_2
is a copy of the input df
with a new column col_2
that is a transformation of the column col_1
with the first occurrence of a sequence of zero characters removed.We wish to remove parts of a given string that match a given pattern regardless of the case; i.e. we wish to ignore case while matching.
Substring
df_2 = df.assign(
col_2 = df['col_1'].str.replace('A', '', regex=False, case=False)
)
Here is how this works:
str.replace()
while setting regex=False
as described in Substring above.case=False
to str.replace()
.Regular Expression
df_2 = df.assign(
col_2 = df['col_1'].str.replace('A{2,}', '', regex=True, case=False)
)
Here is how this works:
str.replace()
as described in Regular Expression above.case=False
to str.replace()
.We wish to remove any occurrence of any of a given set of patterns from a given string.
In this example, we wish to remove any dollar sign $,
or comma character ,
, or empty space, from each element of the string column col_1
.
df_2 = df.assign(
col_2 = df['col_1'].str.replace('[\$,\s]', '', regex=True)
)
Here is how this works:
str.replace()
a regular expression or’ing the patterns that we wish to find and remove.'[\\$,\\s]'
where:\\$
captures the dollar sign character,
captures the comma\\s
captures the empty white space character[]
The square brackets specify that we wish to match any single character that is contained within the bracketsdf_2
is a copy of the input df
with a new column col_2
that is a transformation of the column col_1
with the specified characters removed.Extension: Multiple Character Patterns
df_2 = df.assign(
col_2 = df['col_1'].str.replace('\(.+\)|\[.+\]', '', regex=True).str.strip()
)
Here is how this works:
str.replace()
a regular expression or’ing the patterns that we wish to find and remove.'\(.+\)|\[.+\]'
where:\(.+\)
captures any sequence of characters in parentheses along with the parenthesis themselves.|
indicates an "or" condition.\[.+\]
captures any sequence of characters in brackets along with the brackets themselves..+
will match one or more of any character.str.strip()
at the end to get rid of any leading or trailing white spac characters left behind by the removal.df_2
is a copy of the input df
with a new column col_2
that is a transformation of the column col_1
with the specified patterns removed.We wish to remove from each value of a particular column the corresponding value or another column.
Substring
In this example, for each row of the data frame df
, we wish to remove each occurrence of the value of the column col_2
in the corresponding occurrence of the value of the column col_1
.
df_2 = df.assign(
col_3 = df.apply(lambda x: x['col_1'].replace(x['col_2'], ''), axis=1)
)
Here is how this works:
str.replace()
that we used for most scenarios above can’t accept a Series
(or list) of patterns. Therefore, we use apply()
with axis=1
to perform a row wise (non-vectorized) operation. See Non-Vectorized Transformation.replace()
method from the core Python string module. Luckily they take the same arguments and work in exactly the same way (by design). See Replacing.df_2
is a copy of the input df
with a new column col_3
where each element is the corresponding element of col_1
with any occurrence of the corresponding element of col_2
removed.Regular Expression
In this example, for each row of the data frame df
, we wish to remove each sequence of repeated column col_2
values (e.g. remove ‘AA’
and ‘AAA’
but not ‘A’
) in the corresponding occurrence of the value of the column col_1
.
import re
df_2 = df.assign(
col_3 = df.apply(lambda x: re.sub((x['col_2'] + '{2,}'), '', x['col_1']), axis=1)
)
Here is how this works:z
str.replace()
that we used for most scenarios above can’t accept a Series
(or list) of patterns. Therefore, we use apply()
with axis=1
to perform a row wise (non-vectorized) operation. See Non-Vectorized Transformation.re.sub()
from the Python re
module. Note that the order of arguments to re.sub()
is different from str.replace()
, i.e. re.sub()
has the signature re.sub(pattern, repl, string)
. See Replacing.df_2
is a copy of the input df
with a new column col_3
where each element is the corresponding element of col_1
with any matches of the specified regular expression constructed from the corresponding value of the column col_2
removed.