We wish to check whether a string matches a given pattern and to return a logical value TRUE
if
that is the case and FALSE
otherwise.
In this section we will cover the following four common string pattern detection scenarios:
For each of these four scenarios, we will cover two cases:
In addition, we cover the following scenarios which can be applied to extend any of the above:
String
We wish to check if a string exactly matches another string.
In this example, we wish to filter the rows of the data frame df
to retain only rows where the
value of the column col_1
exactly equals ‘XXX’
.
df_2 = df.loc[df['col_1'] == 'XXX']
Here is how this works:
==
to check whether two strings are equal.==
is vectorized)'XXX'
.df_2
will have only the rows of the input data frame df where the value of
the column col_1
is 'XXX'
.Alternative: Via Function
df_2 = df.loc[df['col_1'].str.fullmatch('XXX')]
Here is how this works:
str.fullmatch()
method from the str
accessor set of string manipulation methods of
Pandas Series
to check if the provided pattern matches elements in col_1
.str.fullmatch()
determines if each string entirely matches a string or a regular expression.df_2
will have only the rows of the input data frame df where the value of
the column col_1
is 'XXX'
.Regular Expression
We wish to check whether a given string matches a given regular expression and to return TRUE
if
that is the case and FALSE
otherwise.
In this example, we wish to filter the rows of the data frame df
to retain only rows where the
value of the column col_1
is of the form: ‘x’
followed by digits.
df_2 = df.loc[df['col_1'].str.fullmatch('x\d+')]
Here is how this works:
fullmatch()
function which returns true if the entire string matches the given
string or regular expression.'x\d+'
, wherex
matches the letter x.\d+
matches one or more digits.df_2
will have only the rows of the input data frame df
where the value
of the column col_1
matches the regular expression 'x\d+'
.str.contains()
with the regular expression '^\d+g$'
; where ^
and $
enforce start and end of string respectively.Substring
We wish to check whether a given substring occurs anywhere inside a given string and to
return TRUE
if there is a match and FALSE
otherwise.
In this example, we wish to filter the rows of the data frame df
to retain only rows where the
value of the column col_1
contains the substring ‘XX’
.
df_2 = df.loc[df['col_1'].str.contains('XX', regex=False)]
Here is how this works:
str.contains()
method from the str
accessor set of string manipulation methods of
Pandas Series
to check for each value of the column col_1
whether that value of the
column col_1
contains the substring 'XX'
.str.contains()
determines if pattern or regex is contained within a string. We pass
regex=False
as we are using a substring in this example.df_2
will have only the rows of the input data frame df
where the value
of the column col_1
contains the substring 'XX'
.Regular Expression
We wish to check whether a given regular expression has a match inside a given string and to
return TRUE
if there is such a match and FALSE
otherwise.
In this example, we wish to filter the rows of the data frame df to retain only rows where the value
of the column col_1
contains any integers represented by the regular expression '\\d+'
.
df_2 = df.loc[df['col_1'].str.contains('\d+')]
Here is how this works:
str.contains()
.str.contains()
expects the pattern to be a regular expression.df_2
will have only the rows of the input data frame df
where the value
of the column col_1
contains any digits.Substring
We wish to check whether a given substring occurs at the beginning of a given string and to
return TRUE
if that is the case and FALSE
otherwise.
In this example, we wish to filter the rows of the data frame df
to retain only rows where the
value of the column col_1
starts with the substring ‘XX’
.
df_2 = df.loc[df['col_1'].str.startswith('XX')]
Here is how this works:
str.startswith()
instead of str.contains()
.str.startswith()
returns TRUE
only if the given substring,
in this case ‘XX’
, occurs at the beginning of the string.df_2
will have only the rows of the input data frame df
where the value
of the column col_1
starts with ‘XX’
.Regular Expression
We wish to check whether a given regular expression occurs at the end of a given string and to
return TRUE
if that is the case and FALSE
otherwise.
In this example, we wish to filter the rows of the data frame df
to retain only rows where the
value of the column col_1
starts with any integers.
df_2 = df.loc[df['col_1'].str.contains('^\d+')]
Here is how this works:
str.startswith()
works with fixed strings only and not regular expressions.^
to specify that the regular expression must occur
at the start of the string.Alternative: Using the function match()
df_2 = df.loc[df['col_1'].str.match('\d+')]
Here is how this works:
str.match()
matches the start of a string by default, so we don't need to use the start
anchor '^'
.str.contains()
function and enforce any start or
end constraints via the regular expression anchor symbols for start ^
and end $
as needed.Substring
We wish to check whether a given substring occurs at the end of a given string and to return TRUE
if that is the case and FALSE
otherwise.
In this example, we wish to filter the rows of the data frame df
to retain only rows where the
value of the column col_1
ends with the substring ‘XX’
.
df_2 = df.loc[df['col_1'].str.endswith('XX')]
Here is how this works:
str.endswith()
instead of str.startswith()
.str.endswith()
returns TRUE
only if the given substring, in
this case ‘XX’
, occurs at the end of the string.df_2
will have only the rows of the input data frame df
where the value
of the column col_1
ends with ‘XX’
.Regular Expression
We wish to check whether a given regular expression occurs at the end of a given string and to
return TRUE
if that is the case and FALSE
otherwise.
In this example, we wish to filter the rows of the data frame df
to retain only rows where the
value of the column col_1
ends with any integers.
df_2 = df.loc[df['col_1'].str.contains('\d+$')]
Here is how this works:
str.endsswith()
works with fixed strings only and not regular expressions.$
to specify that the regular expression must occur
at the end of the string.Substring
We wish to check whether a given substring occurs anywhere inside a given string while ignoring case
and to return TRUE
if there is a match and FALSE
otherwise.
In this example, we wish to filter the rows of the data frame df
to retain only rows where the
value of the column col_1
contains the substring ‘XX’
regardless of case.
df_2 = df.loc[df['col_1'].str.lower().str.contains('xx', regex=False)]
Here is how this works:
str.lower()
, we lower the case of all the values of the column col_1
which are the strings
we are looking into.'xx'
.str
function.Alternative: Case Flag
df_2 = df.loc[df['col_1'].str.contains('XX', regex=False, case=False)]
Here is how this works:
str.contains()
.
See Contains above.regex=False
because str.contains()
expect a regular expression by default.case
to case=False
.str
, we recommend using the primary solution as
it can be used with any string function.Regular Expression
We wish to check whether a given regular expression has a match inside a given string and to
return TRUE
if there is such a match and FALSE
otherwise.
In this example, we wish to filter the rows of the data frame df
to retain only rows where the
value of the column col_1
is of the form: ‘x’
followed by digits while ignoring the case.
df_2 = df.loc[df['col_1'].str.lower().str.contains('^x\d+')]
Here is how this works:
regex=True
because it is the default.str.lower()
, we lower the case of all the values of the column col_1
which are the strings
we are looking into.'^x\d+'
.case=False
, by as with substring we recommend
using lower()
approach.We wish to return the inverse of the outcome of the string pattern matching operations described
above; i.e. if the outcome is TRUE
, return FALSE
and vice versa.
In this example, we wish to filter the rows of the data frame df
to retain only rows where the
value of the column col_1
does not contain any integers.
df_2 = df.loc[~ df['col_1'].str.contains('\d+')]
Here is how this works:
str.contains()
.
See Contains above.TRUE
when there is no match and FALSE
when there
is a match, we can use the complement operator ~
.'\d+'
checks for the occurrence of a substring comprised of one or more
digits.df_2
will have only the rows of the input data frame df
where the value
of the column col_1
does not contain any digits.We wish to match a column of strings against a column of patterns of the same size. This is often needed when we wish to check the presence of the value of a column in another column for each row.
In this example, we wish to filter the rows of the data frame df
to retain only rows where the
value of the column col_1
contains the value of the column col_2
.
df_2 = df.loc[df.apply(lambda x: x['col_2'] in x['col_1'], axis=1)]
Here is how this works:
str.contains()
method is not vectorized, therefore we need to use apply()
with axis=1
to apply a lambda
in a row wise manner to each row of the data frame df
.
See Non-Vectorized Transformation.lambda
function passed to apply()
takes a row of the data frame (represented by the
variable x
) and uses the python operator in
to check if the value in col_2
is contained
in col_1
.df_2
will have only the rows of the input data frame df
where the value
of the column col_1
contains the value of the column col_2
.OR
We wish to check whether any of a set of patterns occurs in a given string and to return TRUE
if
that is the case and FALSE
otherwise.
In this example, we wish to filter the rows of the data frame df
to retain only rows where the
value of the column col_1
contains the string ‘XX’
or the string ‘YY’
.
df_2 = df.loc[df['col_1'].str.contains('XX', regex=False) |
df['col_1'].str.contains('YY', regex=False)]
Here is how this works:
str.contains()
twice:col_1
contains 'XX'
andcol_1
contains 'YY'
.TRUE
or FALSE
) that has as many
elements as the size of col_1
.|
to combine these two logical vectors element wise into one logical
vector that is TRUE
if either call to str.contains()
returns TRUE
for that value of col_1
.
This final logical vector is passed to loc[]
.df_2
will have only the rows of the input data frame df
where the value
of the column col_1
contains the string ‘XX’
or ‘YY’
.Alternative: Via Regular Expression
df_2 = df.loc[df['col_1'].str.contains('XX|YY', regex=True)]
Here is how this works:
|
of regular expressions to build a regular expression that captures all the patterns
we wish to look for or’ed together. In this case that regular expression is 'XX|YY'
.str.contains()
which then returns TRUE
if the string being
checked (a value of the column col_1
) contains either the substring ‘XX’
or the
substring ‘YY’
.AND
We wish to check whether each of a set of patterns occurs in a given string and to return TRUE
if
that is the case and FALSE
otherwise.
In this example, we wish to filter the rows of the data frame df
to retain only rows where the
value of the column col_1
contains the ends with the string ‘x’
and contains a sequence of one
or more digits.
df_2 = df.loc[df['col_1'].str.contains('^x', regex=True) &
df['col_1'].str.contains('\d+', regex=True)]
Here is how this works:
str.contains()
twice:col_1
matches the regular expression '^x'
which checks if the string starts with the letter ‘x’
.col_1
matches the regular expression '\d+'
which checks if the string contains a sequence of one or more digits.TRUE
or FALSE
) that has as many
elements as the size of col_1
.TRUE
if and only if both calls to str.contains()
returns TRUE
for that value
of col_1
. This final logical vector is passed to loc[]
.df_2
will have only the rows of the input data frame df
where the value
of the column col_1
starts with 'x'
and contains at least one integer.Alternative: Via Regular Expression
df_2 = df.loc[df['col_1'].str.contains('^x.*\d+', regex=True)]
Here is how this works:
^x.*\d+
..*
matches any number of characters.str.contains()
which then returns TRUE
if the string being
checked (a value of the column col_1
) starts with 'x'
and contains at least one integer.