We wish to filter data frame with a column of strings to retain only the rows that match a given pattern.
There are two common scenarios which we will cover below:
For each of these four scenarios, we will cover two cases:
In addition, we cover the following scenarios which can be applied to extend any of the above:
Given a data frame with a column of strings, we wish to return only the rows that match a given pattern.
Substring
Given a data frame with a column of strings, we wish to return only the rows that contain a given substring.
In this example, we wish to return the rows where column col_1
contains the substring ‘XY’
.
df_2 = df.loc[df['col_1'].str.contains('XY', regex=False)]
Here is how this works:
str.contains()
method from the str
accessor set of string manipulation methods of
Pandas Series
to check for each value of the column col_1
whether that value of the
column col_1
contains the substring 'XX'
.str.contains()
determines if pattern or regex is contained within a string. We pass
regex=False
as we are using a substring in this example.df_2
will have only the rows of the input data frame df
where the value
of the column col_1
contains the substring 'XX'
.Regular Expression
Given a data frame with a column of strings, we wish to return only the rows where col_1
matches a
given regular expression.
In this example, we wish to return the rows where column col_1
contains the pattern ‘x’
followed
by a sequence of digits.
df_2 = df.loc[df['col_1'].str.contains('x\d+')]
Here is how this works:
str.contains()
.str.contains()
expects the pattern to be a regular expression.df_2
will have only the rows of the input data frame df
where the value
of the column col_1
contains the pattern ‘x’
followed by a sequence of digits.Given a data frame with a column of strings, we wish to return the indices of the rows that contain a given substring.
In this example, we wish to return the index of the first rows (the smallest row number) of the
column col_3
that contains the substring ‘XY’
for each group where the groups are defined by the
column col_2
.
df_2 = df\
.groupby('col_1', as_index=False)\
.agg(first_match=('col_2', lambda x: x.index[x.str.contains('XY')].min()))
Here is how this works:
str.contains(pattern)
returns true
if the string element of df['col_2']
matches the pattern XY
.x.index
and then we get the min index using min
.df_2
that has one row for each unique value of the column col_1
and
two columns col_1
and first_match
. The column first_match
holds the index of the first row
where the value of col_2
contains the substring ‘XY’
for each group.regex=True
.df_2 = (df
.sort_values(by='col_3',
key=lambda x: x.str.contains('^x[xy]x$', case=False), ascending=False)
.groupby('col_2', as_index=False)
.first())
Here is how this works:
case=False
.
See Ignore Case
under Detecting for more details.We wish to match a column of strings against a column of patterns of the same size. This is often needed when we wish to check the presence of the value of a column in another column for each row.
In this example, we wish to filter the rows of the data frame df
to retain only rows where the
value of the column col_1
contains the value of the column col_2
.
df_2 = df.loc[df.apply(lambda x: x['col_2'] in x['col_1'], axis=1)]
Here is how this works:
str.contains()
method is not vectorized, therefore we need to use apply()
with axis=1
to apply a lambda
in a row wise manner to each row of the data frame df
.
See Non-Vectorized Transformation.lambda
function passed to apply()
takes a row of the data frame (represented by the
variable x
) and uses the python operator in
to check if the value in col_2
is contained
in col_1
.df_2
will have only the rows of the input data frame df
where the value
of the column col_1
contains the value of the column col_2
.Given a data frame with a column of strings, we wish to return the rows that match any pattern in a given set of patterns.
In this example, we wish to filter the rows of the data frame df
to retain only rows where the
value of the column col_1
contains the string ‘XX’
or the string ‘YY’
.
df_2 = df.loc[df['col_1'].str.contains('XX|YY', regex=True)]
Here is how this works:
|
of regular expressions to build a regular expression that captures all
the patterns we wish to look for or’ed together. In this case that regular expression
is 'XX|YY'
.str.contains()
which then returns the rows of the data frame
where column col_1
matches the regular expression i.e. any of the patterns.