We wish to extract a substring that matches a given pattern from a given string.
Substring extraction via a regular expression is one of the most powerful and frequently used techniques in data manipulation, especially for data cleaning.
We will look at three scenarios of substring extraction via regular expressions:
For each of these scenarios, we will cover two cases:
In addition, we cover the following three scenarios which can be applied to extend any of the above:
First Match
We wish to extract the first match of a regular expression pattern from a given string.
In this example, we wish to extract the first occurrence of a number (a sequence of digits) followed
by the letters ‘kg’
with zero or more empty space characters between.
df_2 = df.assign(
col_2 = df['col_1'].str.extract('(\d+\s*kg)', expand=False))
Here is how this works:
str.extract()
method from the str
accessor set of string manipulation
methods of Pandas Series
to extract a substring that matches a given regular expression from
each value of the column col_1
and return a vector of the same size.'\d+\s*kg'
which breaks down as follows:\d+
matches one or more numeric characters.\s*
matches zero or more empty space characterskg
matches the string kg
.str.extract()
only works with capture groups, so we need to wrap the entire pattern in a capture
group. In this example (\d+\s*kg)
.expand=False
to get a Series and not a Data Frame.df_2
will be a copy of the input data frame df
with an added
column col_2
holding the extracted substrings from the corresponding value of the
column col_1
.All Matches
We wish to extract all matches of a regular expression pattern from a given string.
In this example, we wish to extract all occurrences of a number (a sequence of digits) followed by
the letters ‘kg’
with zero or more empty space characters between. We then wish to combine the
extracted values into a single string with comma separators.
df_2 = df.assign(
col_2=df['col_1'].str.findall('\d+\s*kg'),
col_3=lambda x: x['col_2'].map(', '.join))
Here is how this works:
str.findall()
method from the str
accessor set of string manipulation
methods of Pandas Series
to extract all occurrences of a given regular expression from each
value of the column col_1
.str.findall()
is a list of lists where each list holds the matches extracted from
the corresponding value of the column col_1
.', '.join)
to reduce each vector of list to a single string where the extracted
matches are separated by commas.map()
to iterate over the lists of strings extracted from each element of col_1
.
See Working with Lists.join
.map()
function to iterate over the lists.
See Working with Lists.Extension: Nth Match
We wish to extract the nth match of a regular expression pattern from a given string.
df_2 = df.assign(
col_2=df['col_1'].str.findall('\d+').str[1])
Here is how this works:
findall()
to extract all pattern matches from each element of the column col_1
;
as described in the primary solution above..str[]
or .str.get()
. No explicit mapping or looping is
required which makes for elegant code.Extension: Last Match
We wish to extract the last match of a regular expression pattern from a given string.
df_2 = df.assign(
col_2=df['col_1'].str.findall('\d+').str[-1])
Here is how this works:
findall()
to extract all pattern matches from each element of the column col_1
;
as described in the primary solution above..str[]
or .str.get()
and we pass -1
as an index to get
the last match. No explicit mapping or looping is required which makes for elegant code.First Match
We wish to extract a part, specified by a capture group, of the first match of a regular expression pattern from a given string.
In this example, we wish to extract the first occurrence of a number (a sequence of digits) that is
followed by the letters ‘kg’
with zero or more empty space characters between. Note we only wish
to extract the number component.
df_2 = df.assign(
col_2=df['col_1'].str.extract('(\d+)\s*kg'))
Here is how this works:
str.extract()
method from the str
accessor set of string manipulation
methods of Pandas Series
to extract the part of the matching substring that corresponds to the
first group defined in the regular expression for each value in the column col_1
.‘(\d+)\s*kg’
which works as follows:(\d+)
defines a capture group that contains a sequence of digits. A capture group is a part
of a regular expression that can be referred to.\s*kg
denotes a pattern where zero or more empty space characters are followed by the
string ‘kg’
.'kg'
but we do not wish to include 'kg'
in the output.df_2
will be a copy of the input data frame df
with an added
column col_2
holding the extracted substrings (corresponding to the first capture group) from
the corresponding value of the column col_1
.Extension: nth Capture Group
df_2 = df.assign(
col_2=df['col_1'].str.extract('(X+)(Y+)').iloc[:, 1])
Here is how this works:
str.extract()
returns a data frame.col_1
by
using iloc[:, 1]
to extract the second column by position.
See Basic Selecting.df_2
will be a copy of the input data frame df
with an added
column col_2
holding the extracted substrings (corresponding to the second capture group) from
the corresponding value of the column col_1
.All Matches
We wish to extract a part, specified by a capture group, of each match of a regular expression pattern from a given string.
In this example, we wish to extract and add up all occurrences of a number (a sequence of digits)
that is followed by the letters ‘kg’
(with no or more empty space characters between). Note we
only wish to extract the number component.
df_2 = df.assign(
col_2=df['col_1'] \
.str.extractall('(\d+)\s*kg') \
.groupby(level=0) \
.apply(lambda x: x.iloc[:, 0].astype(int).sum()))
Here is how this works:
str.extractall()
method from the str
accessor set of string manipulation
methods of Pandas Series
to extract all matches of a regular expression as well as all matches
of each capture group from each element of the input column col_1
.str.extractall()
takes regex with at least one capture group ()
.str.extractall()
method returns every match. The result of str.extractall()
is always a
DataFrame with a MultiIndex
of two levels on its rows. The second level of the MultiIndex
is
named match and indicates the order in the subject.MultiIndex
, we use groupby(level=0)
followed
by apply()
to loop over the first level of the row of the MultiIndex
; i.e. apply()
receives
sub-data-frames each holding the matches obtained from one element of the column col_1
.iloc[:,0]
astype(int)
.sum()
.First Match
We wish to extract multiple parts, specified by multiple capture groups, of the first match of a regular expression pattern from a given string.
In this example, we wish to extract two parts of the first occurrence of a string pattern; the first
representing a numerical quantity and the second representing the unit from each value of the
column col_1
, and store each in its own column.
df_2 = df.copy()
df_2[['value', 'unit']] = df['col_1'].str.extract(r'(\d+)(\w+)\b')
Here is how this works:
str.extract()
method from the str
accessor set of string manipulation
methods of Pandas Series
to extract the first substrings matching each capture group from each
element in the column col_1
.str.extract()
returns a DataFrame
with one column per group.str.extract()
to
two columns that we can name df[['value', 'unit']
.
See Multi-Value Transformationr'(\d+)(\w+)\b'
which works as
follows:(\d+)
is capture group that matches a sequence of one or more digits.(\w+)
represents one or more ‘word’ character; i.e. letters, digits or underscore.df_2
is a copy of the input data frame df
with two columns added value
and unit
containing the values captured by the two capture groups in the given regular
expression.Extension: Arbitrarily Many Capture Groups
df_2 = pd.concat([df,
df['col_1'].str.extract(r'(\d+)(\w+)\b')],
axis=1)
Here is how this work:
pd.concat()
to add a set of arbitrarily many columns to a data frame.
See Multi-Value Transformation
and Appending.str.extract()
will be 0, 1, 2
, etc. If
needed, you may apply a function to rename these to the desired names.
See Implicit Renaming.All Matches
We wish to extract from a given string all matches of multiple parts, specified as capture groups, of a given regular expression.
In this example, each element of the input column is a string that takes the
form "a1 * b1 + a2 * b2 … + an * bn”
where the a1, a2,..
and b1, b2,..
are each a sequence of
digits. We wish to parse and evaluate each of these expressions and store the result in a new
column col_2
.
def sum_prods(p_df):
return p_df.astype(int).prod(axis=1).sum()
df_2 = df.assign(
col_2=df['col_1'] \
.str.extractall(r'(\d+)\s*\*\s*(\d+)') \
.groupby(level=0) \
.apply(sum_prods))
Here is how this work:
str.extractall()
to extract all matches of all capture groups in the given regular
expression for each element of the column col_1
as follows;str.extractall()
takes regex with at least one capture group ()
a DataFrame
with one row
for each match, and one column for each group.str.extractall()
is always a DataFrame with a MultiIndex
on its rows. The
last level of the MultiIndex
is named match and indicates the order in the subject.MultiIndex
, we use groupby(level=0)
followed
by apply()
to loop over the first level of the row of the MultiIndex
; i.e. apply()
receives
sub-data-frames each holding the matches obtained from one element of the column col_1
.sum_prods()
astype(int)
.prod(axis=1)
.sum()
.df_2
is a copy of the input data frame df
with an added column col_2
holding the numeric value resulting from processing the corresponding string value of the
column col_1
.We wish to extract the first match of a regular expression pattern from a given string while ignoring case.
In this example, we wish to extract the first occurrence of a number (a sequence of digits) followed
by the letters ‘kg’
regardless of case with zero or more empty space characters between.
import re
df_2 = df.assign(
col_2=df['col_1'].str.extract('(\d+\s*kg)', expand=False, flags=re.IGNORECASE))
Here is how this works:
str.extract()
as described in Entire Pattern above.flags=re.IGNORECASE
to str.extract()
.
See Ignore Case
under Detecting for more details.We wish to match a column of strings against a column of patterns of the same size. This is often needed when the regular expression we are matching needs to be different for each row based on the value of a particular column.
In this example, we have a data frame df
with two column col_1
and col_2
, we wish to extract
from each value in col_1
a substring composed of a sequence of digits followed by a set of
characters representing a unit specified by the corresponding value of the column col_2
.
import re
def extract_string(string, pattern):
match = re.search(pattern, string)
if match:
return match.group()
return None
df_2 = df.assign(
col_3=df.apply(lambda x: extract_string(x['col_1'], '(\d+)\s*' + x['col_2']), axis=1))
Here is how this works:
str.extract()
is not vectorized over the pattern, so we use search()
function from re
package which takes a string literal and a regular expression as arguments.extract_string()
that takes a string and a pattern and returns the
first substring that matches the given pattern.apply()
with axis=1
to apply a lambda
in a row wise manner to each row of
the data frame df
.
See Non-Vectorized Transformation.extract_string()
:col_1
as the first argument.col_2
and
is naturally of the same size as col_1
.col_2
to the
regular expression '(\d+)\s*'
for each row. For instance, if the value of the column col_2
is ‘kg’
then the value of the regular expression to extract will be '(\d+)\s*kg'
.df_2
will be a copy of the input data frame df but with an additional
column col_3
containing the substrings extracted.