Counting

We wish to count the number of occurrences of a given pattern in a target string.

We will cover the following common pattern occurrence counting scenarios:

  • Substring: how to obtain the number of occurrences of a given substring in a given string.
  • Regular Expression: how to obtain the number of occurrences of match of a given regular expression in a given string.
  • Word: the special case of how to count the number of words in a given string.

The scenarios above can be extended in multiple ways, the most common of which are:

  • Ignore Case: Ignoring case (of both pattern and string) while matching.
  • Pattern Column: Matching a column of patterns against a column of strings of the same size.
  • Multiple Patterns: Checking for multiple patterns at a time.

Substring

We wish to count the number of occurrences of a given substring in a given string.

In this example, we count the number of occurrences of the string ‘XY’ in each value of the column col_1.

import re

df_2 = df.assign(col_2=df['col_1'].str.count(re.escape('XY')))

Here is how this works:

  • We use the str.count() method from the str accessor set of string manipulation methods of Pandas Series to count the number of occurrences of the substring ‘XY’ in each value of the column col_1.
  • The str.count() function expects a regular expression by default. We wrap the substring in re.escape()to escape any special characters.
  • The output data frame df_2 will be a copy of the input data frame df with an added column col_2 holding the number of occurrences of the string ‘XY’ in the corresponding value of the column col_1.

Regular Expression

We wish to count the number of occurrences of a given regular expression in a given string.

In this example, we count the number of vowels (while ignoring case) in each value of the column col_1.

df_2 = df.assign(col_2=df['col_1'].str.count('[aeiou]'))

Here is how this works:

  • This works similarly to the substring scenario described above except that we pass a regular expression to str.count() instead of a substring.
  • The regular expression '[aeiou]' matches any of the characters in the square brackets (which are the five vowels in the English language). It can be thought of as an or operation.
  • The output data frame df_2 will be a copy of the input data frame df with an added column col_2 holding the number of vowels in the corresponding value of the column col_1.

Word

We wish to count the number of words in a string.

In this example, we wish to count the number of words in each value of the column col_1 and to return that as a new integer column col_2.

df_2 = df.assign(
    col_2=df['col_1'].str.count('\w+'))

Here is how this works:

  • This works similarly to the substring scenario described above in Regular Expression.
  • We use the regular expression ‘\w+’ where:
    • \w represents any ‘word’ character; i.e. letters, digits or underscore.
    • + specifies that we are looking for one or more ‘word’ characters.
  • The output data frame df_2 is a copy of the input data frame df with an additional column col_2 where each row holds the number of words in the corresponding value of the column col_1.

Ignore Case

We wish to count the number of occurrences of a given substring in a given string while ignoring case.

In this example, we count the number of occurrences of the string ‘xy’ in each value of the column col_1 while ignoring case.

import re

df_2 = df.assign(col_2=df['col_1'].str.lower().str.count(re.escape('xy')))

Here is how this works:

  • To ignore the case while matching, we use str.lower() to convert values in col_1 to lower case, and we pass a lower case expression. See Ignore Case under Detecting for more details.
  • We can extend the solutions presented in Regular Expression in the same way.

Alternative: Via re.IGNORECASE

import re

df_2 = df.assign(col_2=df['col_1'].str.count(re.escape('xy'), flags=re.IGNORECASE))

Here is how this works:

  • This code is similar to the code under Regular Expression above except that we pass flags=re.IGNORECASE to the second argument of str.count() to perform case-insensitive matching.
  • We can extend the solutions presented in Substring in the same way.

Pattern Column

We wish to count the number of occurrences of a value of one column in the value of another column for each row.

In this example, we have a data frame df with two column col_1 and col_2, we wish to count the number of occurrences of the value of col_2 in col_1.

df_2 = df.assign(
    col_3=df.apply(lambda x: x['col_1'].count(x['col_2']), axis=1))

Here is how this works:

  • str.count() is not vectorized over the pattern, so we use python's count() function which works on string.
  • We need to use apply() with axis=1 to apply a lambda in a row wise manner to each row of the data frame df. See Non-Vectorized Transformation.
  • For each row, we will return the number of occurrences of the value of col_2 in col_1 as an integer.
  • The output data frame df_2 will be a copy of the input data frame df_1 with an added column col_3 holding the number of occurrences of the value of col_2 in col_1 for the corresponding row.
  • We can extend the solutions presented in Regular Expression in the same way.

Multiple Patterns

Count All

We wish to return the total number of occurrences of all patterns as a single integer value. In other words, we wish to return the sum of occurrences of a given set of patterns in a given string.

df_2 = df.assign(col_2=df['col_1'].str.count('XY|YX'))

Here is how this works:

  • We use the or operator | of regular expressions to build a regular expression that captures all the patterns we wish to look for or’ed together. In this case that regular expression is 'XY|YX'.
  • For each value of col_1, str.count() will return the total number of occurrences of all patterns.
  • The output data frame df_2 will be a copy of the input data frame df with an added column col_2 holding the sum of occurrences of the specified patterns in the corresponding value of the column col_1.
  • See Regular Expression above for a more detailed description.

Count Each

We wish to return the number of occurrences of each pattern in a set of patterns as a vector of integer values (one value for each pattern).

In this example, for each value of the column col_1, we wish to compute the difference between the number of occurrences of the string ‘)’ and the string ‘(’.

def diff(p_list):
    return p_list[0] - p_list[1]

df_2 = df.assign(
    col_2=df.apply(lambda x: diff([x['col_1'].count(pattern) for pattern in ['(', ')']]), axis=1))

Here is how this works:

  • We use list comprehension to count the occurrence of each string and append it to a list.
  • For each value of col_1 and for each pattern in ['(', ')'], str.count() will return holding the number of occurrences of the respective pattern.
  • We use the function diff() to subtract the two values returned the list comprehension.
  • We need to use apply() with axis=1 to apply a lambda in a row wise manner to each row of the data frame df. See Non-Vectorized Transformation.
  • The output data frame df_2 will be a copy of the input data frame df with an added column col_2 where each cell holds the difference between the number of occurrences of the two patterns in the corresponding value of the column col_1.
PYTHON
I/O