Filtering

We wish to filter data frame with a column of strings to retain only the rows that match a given pattern.

There are two common scenarios which we will cover below:

  • Return Matches: We wish to return the rows that match a given pattern.
  • Return Locations: We wish to return the indices of the rows that match a given pattern.

For each of these four scenarios, we will cover two cases:

  • Substring where the pattern is a plain character sequence
  • Regular Expressions where the pattern is a regular expression

In addition, we cover the following scenarios which can be applied to extend any of the above:

  • Ignore Case: Ignoring case (of both pattern and string) while matching.
  • Pattern Column: How to return rows that match corresponding patterns provided as elements of another column.
  • Multiple Patterns: Checking if any of a set of multiple patterns is matched.

Return Matches

Given a data frame with a column of strings, we wish to return only the rows that match a given pattern.

Substring

Given a data frame with a column of strings, we wish to return only the rows that contain a given substring.

In this example, we wish to return the rows where column col_1 contains the substring ‘XY’.

df_2 = df.loc[df['col_1'].str.contains('XY', regex=False)]

Here is how this works:

  • We use the str.contains() method from the str accessor set of string manipulation methods of Pandas Series to check for each value of the column col_1 whether that value of the column col_1 contains the substring 'XX'.
  • The str.contains() determines if pattern or regex is contained within a string. We pass regex=False as we are using a substring in this example.
  • The output data frame df_2 will have only the rows of the input data frame df where the value of the column col_1 contains the substring 'XX'.

Regular Expression

Given a data frame with a column of strings, we wish to return only the rows where col_1 matches a given regular expression.

In this example, we wish to return the rows where column col_1 contains the pattern ‘x’ followed by a sequence of digits.

df_2 = df.loc[df['col_1'].str.contains('x\d+')]

Here is how this works:

  • This works similarly to the Substring case above except that we pass a regular expression to str.contains().
  • By default, str.contains() expects the pattern to be a regular expression.
  • The output data frame df_2 will have only the rows of the input data frame df where the value of the column col_1 contains the pattern ‘x’ followed by a sequence of digits.

Return Locations

Given a data frame with a column of strings, we wish to return the indices of the rows that contain a given substring.

In this example, we wish to return the index of the first rows (the smallest row number) of the column col_3 that contains the substring ‘XY’ for each group where the groups are defined by the column col_2.

df_2 = df\
 .groupby('col_1', as_index=False)\
 .agg(first_match=('col_2', lambda x: x.index[x.str.contains('XY')].min()))

Here is how this works:

  • We apply str.contains(pattern) returns true if the string element of df['col_2'] matches the pattern XY.
  • We then use the boolean vector to filter the index x.index and then we get the min index using min.
  • The output is a data frame df_2 that has one row for each unique value of the column col_1 and two columns col_1 and first_match. The column first_match holds the index of the first row where the value of col_2 contains the substring ‘XY’ for each group.
  • This can be extended to regular expressions by setting regex=True.

Ignore Case

df_2 = (df
        .sort_values(by='col_3',
                     key=lambda x: x.str.contains('^x[xy]x$', case=False), ascending=False)
        .groupby('col_2', as_index=False)
        .first())

Here is how this works:

  • This works similarly to the code under Return Matches above.
  • To ignore case while matching, we pass the parameter case=False. See Ignore Case under Detecting for more details.

Pattern Column

We wish to match a column of strings against a column of patterns of the same size. This is often needed when we wish to check the presence of the value of a column in another column for each row.

In this example, we wish to filter the rows of the data frame df to retain only rows where the value of the column col_1 contains the value of the column col_2.

df_2 = df.loc[df.apply(lambda x: x['col_2'] in x['col_1'], axis=1)]

Here is how this works:

  • The built-in str.contains() method is not vectorized, therefore we need to use apply() with axis=1 to apply a lambda in a row wise manner to each row of the data frame df. See Non-Vectorized Transformation.
  • The lambda function passed to apply() takes a row of the data frame (represented by the variable x) and uses the python operator in to check if the value in col_2 is contained in col_1.
  • The output data frame df_2 will have only the rows of the input data frame df where the value of the column col_1 contains the value of the column col_2.

Multiple Patterns

Given a data frame with a column of strings, we wish to return the rows that match any pattern in a given set of patterns.

In this example, we wish to filter the rows of the data frame df to retain only rows where the value of the column col_1 contains the string ‘XX’ or the string ‘YY’.

df_2 = df.loc[df['col_1'].str.contains('XX|YY', regex=True)]

Here is how this works:

  • We use the or operator | of regular expressions to build a regular expression that captures all the patterns we wish to look for or’ed together. In this case that regular expression is 'XX|YY'.
  • We pass that regular expression to str.contains() which then returns the rows of the data frame where column col_1 matches the regular expression i.e. any of the patterns.
PYTHON
I/O