Extracting by Regular Expression

We wish to extract a substring that matches a given pattern from a given string.

Substring extraction via a regular expression is one of the most powerful and frequently used techniques in data manipulation, especially for data cleaning.

We will look at three scenarios of substring extraction via regular expressions:

  • Entire Pattern where we cover how to extract a substring from a given string that matches a given regular expression.
  • One Capture Group where we cover how to match a given regular expression pattern to a given string and then extract part of the pattern commonly referred to as a capture group.
  • Multiple Capture Groups where we cover how to extract multiple parts of a pattern; i.e. multiple capture groups.

For each of these scenarios, we will cover two cases:

  • How to extract the First Match
  • How to extract All Matches

In addition, we cover the following three scenarios which can be applied to extend any of the above:

  • Ignore Case where we cover how to ignore the case (of both pattern and string) while matching.
  • Pattern Column where we cover how to match a vector of strings against a vector of patterns of the same size. This is quite often needed when we wish to locate the value of a column in another column for each row.

Entire Pattern

First Match

We wish to extract the first match of a regular expression pattern from a given string.

In this example, we wish to extract the first occurrence of a number (a sequence of digits) followed by the letters ‘kg’ with zero or more empty space characters between.

df_2 = df.assign(
    col_2 = df['col_1'].str.extract('(\d+\s*kg)', expand=False))

Here is how this works:

  • We use the function str.extract() method from the str accessor set of string manipulation methods of Pandas Series to extract a substring that matches a given regular expression from each value of the column col_1 and return a vector of the same size.
  • In this example, the regular expression that represents the substrings that we are looking to extract is '\d+\s*kg' which breaks down as follows:
    • \d+ matches one or more numeric characters.
    • \s* matches zero or more empty space characters
    • kg matches the string kg.
  • str.extract() only works with capture groups, so we need to wrap the entire pattern in a capture group. In this example (\d+\s*kg).
  • We set expand=False to get a Series and not a Data Frame.
  • The output data frame df_2 will be a copy of the input data frame df with an added column col_2 holding the extracted substrings from the corresponding value of the column col_1.

All Matches

We wish to extract all matches of a regular expression pattern from a given string.

In this example, we wish to extract all occurrences of a number (a sequence of digits) followed by the letters ‘kg’ with zero or more empty space characters between. We then wish to combine the extracted values into a single string with comma separators.

df_2 = df.assign(
    col_2=df['col_1'].str.findall('\d+\s*kg'),
    col_3=lambda x: x['col_2'].map(', '.join))

Here is how this works:

  • We use the function str.findall() method from the str accessor set of string manipulation methods of Pandas Series to extract all occurrences of a given regular expression from each value of the column col_1.
  • The output of str.findall() is a list of lists where each list holds the matches extracted from the corresponding value of the column col_1.
  • In order to reduce each vector of matches to a single string where the extracted matches are separated by commas, we:
    • Use ', '.join) to reduce each vector of list to a single string where the extracted matches are separated by commas.
    • Use map() to iterate over the lists of strings extracted from each element of col_1. See Working with Lists.
  • In this example, we simply flatten the list of returned matching substrings into one string. To perform, other processing, simply:
    • Call the function of choice (build-in or custom) instead of join.
    • Use map() function to iterate over the lists. See Working with Lists.

Extension: Nth Match

We wish to extract the nth match of a regular expression pattern from a given string.

df_2 = df.assign(
    col_2=df['col_1'].str.findall('\d+').str[1])

Here is how this works:

  • We use findall() to extract all pattern matches from each element of the column col_1; as described in the primary solution above.
  • To extract the nth match we can use .str[] or .str.get(). No explicit mapping or looping is required which makes for elegant code.

Extension: Last Match

We wish to extract the last match of a regular expression pattern from a given string.

df_2 = df.assign(
    col_2=df['col_1'].str.findall('\d+').str[-1])

Here is how this works:

  • We use findall() to extract all pattern matches from each element of the column col_1; as described in the primary solution above.
  • To extract the nth match we can use .str[] or .str.get() and we pass -1 as an index to get the last match. No explicit mapping or looping is required which makes for elegant code.

One Capture Group

First Match

We wish to extract a part, specified by a capture group, of the first match of a regular expression pattern from a given string.

In this example, we wish to extract the first occurrence of a number (a sequence of digits) that is followed by the letters ‘kg’ with zero or more empty space characters between. Note we only wish to extract the number component.

df_2 = df.assign(
    col_2=df['col_1'].str.extract('(\d+)\s*kg'))

Here is how this works:

  • We use the function str.extract() method from the str accessor set of string manipulation methods of Pandas Series to extract the part of the matching substring that corresponds to the first group defined in the regular expression for each value in the column col_1.
  • In this example the regular expression is ‘(\d+)\s*kg’ which works as follows:
    • (\d+) defines a capture group that contains a sequence of digits. A capture group is a part of a regular expression that can be referred to.
    • \s*kg denotes a pattern where zero or more empty space characters are followed by the string ‘kg’.
  • The power of capture groups in substring extraction is most pronounced when parts of the regular expression are to localize the desired substring, but we don’t want to include them in the output extraction. For instance, in this case, we wish to capture any integer that comes before the unit 'kg' but we do not wish to include 'kg' in the output.
  • The output data frame df_2 will be a copy of the input data frame df with an added column col_2 holding the extracted substrings (corresponding to the first capture group) from the corresponding value of the column col_1.

Extension: nth Capture Group

df_2 = df.assign(
    col_2=df['col_1'].str.extract('(X+)(Y+)').iloc[:, 1])

Here is how this works:

  • If the regular expression pattern we are matching has multiple capture groups, str.extract() returns a data frame.
  • In this case, we extract the second capture group from each value of the column col_1 by using iloc[:, 1] to extract the second column by position. See Basic Selecting.
  • The output data frame df_2 will be a copy of the input data frame df with an added column col_2 holding the extracted substrings (corresponding to the second capture group) from the corresponding value of the column col_1.

All Matches

We wish to extract a part, specified by a capture group, of each match of a regular expression pattern from a given string.

In this example, we wish to extract and add up all occurrences of a number (a sequence of digits) that is followed by the letters ‘kg’ (with no or more empty space characters between). Note we only wish to extract the number component.

df_2 = df.assign(
    col_2=df['col_1'] \
        .str.extractall('(\d+)\s*kg') \
        .groupby(level=0) \
        .apply(lambda x: x.iloc[:, 0].astype(int).sum()))

Here is how this works:

  • We use the function str.extractall() method from the str accessor set of string manipulation methods of Pandas Series to extract all matches of a regular expression as well as all matches of each capture group from each element of the input column col_1.
  • str.extractall() takes regex with at least one capture group ().
  • The str.extractall() method returns every match. The result of str.extractall() is always a DataFrame with a MultiIndex of two levels on its rows. The second level of the MultiIndex is named match and indicates the order in the subject.
  • In order to iterate over a data frame with a row MultiIndex, we use groupby(level=0) followed by apply() to loop over the first level of the row of the MultiIndex; i.e. apply() receives sub-data-frames each holding the matches obtained from one element of the column col_1.
  • The lambda function receives the sub-data-frame and
    • Selects the column matching the capture group of interest which here is the first and only via iloc[:,0]
    • Converts to integer via astype(int).
    • Computes the sum via sum().

Multiple Capture Groups

First Match

We wish to extract multiple parts, specified by multiple capture groups, of the first match of a regular expression pattern from a given string.

In this example, we wish to extract two parts of the first occurrence of a string pattern; the first representing a numerical quantity and the second representing the unit from each value of the column col_1, and store each in its own column.

df_2 = df.copy()
df_2[['value', 'unit']] = df['col_1'].str.extract(r'(\d+)(\w+)\b')

Here is how this works:

  • We use the function str.extract() method from the str accessor set of string manipulation methods of Pandas Series to extract the first substrings matching each capture group from each element in the column col_1.
  • Extracting a regular expression with more than one group using str.extract() returns a DataFrame with one column per group.
  • Since we have two capture groups, we expect the output to be a data frame with two columns. If we wish to add those to the original data frame, we can simply assign the output of str.extract() to two columns that we can name df[['value', 'unit']. See Multi-Value Transformation
  • The regular expression we use in this case is r'(\d+)(\w+)\b' which works as follows:
    • (\d+) is capture group that matches a sequence of one or more digits.
    • (\w+) represents one or more ‘word’ character; i.e. letters, digits or underscore.
  • The output data frame df_2 is a copy of the input data frame df with two columns added value and unit containing the values captured by the two capture groups in the given regular expression.

Extension: Arbitrarily Many Capture Groups

df_2 = pd.concat([df,
                  df['col_1'].str.extract(r'(\d+)(\w+)\b')],
                 axis=1)

Here is how this work:

  • We use pd.concat() to add a set of arbitrarily many columns to a data frame. See Multi-Value Transformation and Appending.
  • The names of the columns of the data frame returned by str.extract() will be 0, 1, 2, etc. If needed, you may apply a function to rename these to the desired names. See Implicit Renaming.
  • This is a useful approach be when we do not know how many capture groups the regular expression may have; e.g. if the regex is passed dynamically at run time.

All Matches

We wish to extract from a given string all matches of multiple parts, specified as capture groups, of a given regular expression.

In this example, each element of the input column is a string that takes the form "a1 * b1 + a2 * b2 … + an * bn” where the a1, a2,.. and b1, b2,.. are each a sequence of digits. We wish to parse and evaluate each of these expressions and store the result in a new column col_2.

def sum_prods(p_df):
    return p_df.astype(int).prod(axis=1).sum()


df_2 = df.assign(
    col_2=df['col_1'] \
        .str.extractall(r'(\d+)\s*\*\s*(\d+)') \
        .groupby(level=0) \
        .apply(sum_prods))

Here is how this work:

  • First we use str.extractall() to extract all matches of all capture groups in the given regular expression for each element of the column col_1 as follows;
    • str.extractall() takes regex with at least one capture group () a DataFrame with one row for each match, and one column for each group.
    • The result of str.extractall() is always a DataFrame with a MultiIndex on its rows. The last level of the MultiIndex is named match and indicates the order in the subject.
  • In order to iterate over a data frame with a row MultiIndex, we use groupby(level=0) followed by apply() to loop over the first level of the row of the MultiIndex; i.e. apply() receives sub-data-frames each holding the matches obtained from one element of the column col_1.
  • The lambda function receives the sub-data-frame and applies the function sum_prods()
    • Converts all columns to integer via astype(int).
    • Computes the product of each row via prod(axis=1).
    • Computes the sum of the resulting row level products via sum().
  • The output data frame df_2 is a copy of the input data frame df with an added column col_2 holding the numeric value resulting from processing the corresponding string value of the column col_1.

Ignore Case

We wish to extract the first match of a regular expression pattern from a given string while ignoring case.

In this example, we wish to extract the first occurrence of a number (a sequence of digits) followed by the letters ‘kg’ regardless of case with zero or more empty space characters between.

import re

df_2 = df.assign(
    col_2=df['col_1'].str.extract('(\d+\s*kg)', expand=False, flags=re.IGNORECASE))

Here is how this works:

Pattern Column

We wish to match a column of strings against a column of patterns of the same size. This is often needed when the regular expression we are matching needs to be different for each row based on the value of a particular column.

In this example, we have a data frame df with two column col_1 and col_2, we wish to extract from each value in col_1 a substring composed of a sequence of digits followed by a set of characters representing a unit specified by the corresponding value of the column col_2.

import re

def extract_string(string, pattern):
    match = re.search(pattern, string)
    if match:
        return match.group()
    return None


df_2 = df.assign(
    col_3=df.apply(lambda x: extract_string(x['col_1'], '(\d+)\s*' + x['col_2']), axis=1))

Here is how this works:

  • str.extract() is not vectorized over the pattern, so we use search() function from re package which takes a string literal and a regular expression as arguments.
  • We create a new function extract_string() that takes a string and a pattern and returns the first substring that matches the given pattern.
  • We need to use apply() with axis=1 to apply a lambda in a row wise manner to each row of the data frame df. See Non-Vectorized Transformation.
  • We pass to extract_string():
    • the strings to look into which in this case is the column col_1 as the first argument.
    • and the patterns to look for which in this case is a transformation of the column col_2 and is naturally of the same size as col_1.
  • We construct the regular expression patterns by appending the value of the column col_2 to the regular expression '(\d+)\s*' for each row. For instance, if the value of the column col_2 is ‘kg’ then the value of the regular expression to extract will be '(\d+)\s*kg'.
  • The output data frame df_2 will be a copy of the input data frame df but with an additional column col_3 containing the substrings extracted.
  • See Entire Pattern above for a more detailed description of this code. We can extend the solutions presented in One Capture Group and Multiple Capture Groups the same way.
PYTHON
I/O