We wish to count the number of occurrences of a given pattern in a target string.
We will cover the following common pattern occurrence counting scenarios:
The scenarios above can be extended in multiple ways, the most common of which are:
We wish to count the number of occurrences of a given substring in a given string.
In this example, we count the number of occurrences of the string ‘XY’
in each value of the
column col_1
.
import re
df_2 = df.assign(col_2=df['col_1'].str.count(re.escape('XY')))
Here is how this works:
str.count()
method from the str
accessor set of string manipulation methods of
Pandas Series
to count the number of occurrences of the substring ‘XY’
in each value of the column col_1
.str.count()
function expects a regular expression by default. We wrap the substring
in re.escape()
to escape any special characters.df_2
will be a copy of the input data frame df
with an added
column col_2
holding the number of occurrences of the string ‘XY’
in the corresponding value
of the column col_1
.We wish to count the number of occurrences of a given regular expression in a given string.
In this example, we count the number of vowels (while ignoring case) in each value of the
column col_1
.
df_2 = df.assign(col_2=df['col_1'].str.count('[aeiou]'))
Here is how this works:
str.count()
instead of a substring.'[aeiou]'
matches any of the characters in the square brackets (which are
the five vowels in the English language). It can be thought of as an or
operation.df_2
will be a copy of the input data frame df
with an added
column col_2
holding the number of vowels in the corresponding value of the column col_1
.We wish to count the number of words in a string.
In this example, we wish to count the number of words in each value of the column col_1
and to
return that as a new integer column col_2
.
df_2 = df.assign(
col_2=df['col_1'].str.count('\w+'))
Here is how this works:
‘\w+’
where:\w
represents any ‘word’ character; i.e. letters, digits or underscore.+
specifies that we are looking for one or more ‘word’ characters.df_2
is a copy of the input data frame df
with an additional
column col_2
where each row holds the number of words in the corresponding value of the column col_1
.We wish to count the number of occurrences of a given substring in a given string while ignoring case.
In this example, we count the number of occurrences of the string ‘xy’
in each value of the
column col_1
while ignoring case.
import re
df_2 = df.assign(col_2=df['col_1'].str.lower().str.count(re.escape('xy')))
Here is how this works:
str.lower()
to convert values in col_1
to lower
case, and we pass a lower case expression.
See Ignore Case under
Detecting for more details.Alternative: Via re.IGNORECASE
import re
df_2 = df.assign(col_2=df['col_1'].str.count(re.escape('xy'), flags=re.IGNORECASE))
Here is how this works:
flags=re.IGNORECASE
to the second argument of str.count()
to perform
case-insensitive matching.We wish to count the number of occurrences of a value of one column in the value of another column for each row.
In this example, we have a data frame df
with two column col_1
and col_2
, we wish to count the
number of occurrences of the value of col_2
in col_1
.
df_2 = df.assign(
col_3=df.apply(lambda x: x['col_1'].count(x['col_2']), axis=1))
Here is how this works:
str.count()
is not vectorized over the pattern, so we use python's count()
function which
works on string.apply()
with axis=1
to apply a lambda
in a row wise manner to each row of
the data frame df
.
See Non-Vectorized Transformation.col_2
in col_1
as an integer.df_2
will be a copy of the input data frame df_1
with an added
column col_3
holding the number of occurrences of the value of col_2
in col_1
for the corresponding row.Count All
We wish to return the total number of occurrences of all patterns as a single integer value. In other words, we wish to return the sum of occurrences of a given set of patterns in a given string.
df_2 = df.assign(col_2=df['col_1'].str.count('XY|YX'))
Here is how this works:
|
of regular expressions to build a regular expression that captures all
the patterns we wish to look for or’ed together. In this case that regular expression
is 'XY|YX'
.col_1
, str.count()
will return the total number of occurrences of all
patterns.df_2
will be a copy of the input data frame df
with an added
column col_2
holding the sum of occurrences of the specified patterns in the corresponding value
of the column col_1
.Count Each
We wish to return the number of occurrences of each pattern in a set of patterns as a vector of integer values (one value for each pattern).
In this example, for each value of the column col_1
, we wish to compute the difference between the
number of occurrences of the string ‘)’
and the string ‘(’
.
def diff(p_list):
return p_list[0] - p_list[1]
df_2 = df.assign(
col_2=df.apply(lambda x: diff([x['col_1'].count(pattern) for pattern in ['(', ')']]), axis=1))
Here is how this works:
col_1
and for each pattern in ['(', ')']
, str.count()
will return holding
the number of occurrences of the respective pattern.diff()
to subtract the two values returned the list comprehension.apply()
with axis=1
to apply a lambda
in a row wise manner to each row of
the data frame df
.
See Non-Vectorized Transformation.df_2
will be a copy of the input data frame df
with an added
column col_2
where each cell holds the difference between the number of occurrences of the two
patterns in the corresponding value of the column col_1
.