Counting

We wish to count the number of occurrences of a given pattern in a target string.

We will cover the following common pattern occurrence counting scenarios:

  • Substring: how to obtain the number of occurrences of a given substring in a given string.
  • Regular Expression: how to obtain the number of occurrences of match of a given regular expression in a given string.
  • Word: the special case of how to count the number of words in a given string.

The scenarios above can be extended in multiple ways, the most common of which are:

  • Ignore Case: Ignoring case (of both pattern and string) while matching.
  • Pattern Column: Matching a vector of patterns against a vector of strings of the same size
  • Multiple Patterns: Checking for multiple patterns at a time.

Substring

We wish to count the number of occurrences of a given substring in a given string.

In this example, we count the number of occurrences of the string ‘XY’ in each value of the column col_1.

df_2 = df %>% 
  mutate(col_2 = str_count(col_1, fixed('XY')))

Here is how this works:

  • We use str_count() from the stringr package (part of the tidyverse) to count the number of occurrences of the substring ‘XY’ in each value of the column col_1.
  • The str_count() function takes the following arguments:
    • The column whose values we wish to check against; which in this case is col_1.
    • The substring whose occurrences we wish to count; in this case ‘XY’. We wrap the substring in the helper fixed() because by default str_count() assumes that the pattern passed is a regular expression. fixed() specifies that the pattern is a fixed string.
  • The output data frame df_2 will be a copy of the input data frame df with an added column col_2 holding the number of occurrences of the string ‘XY’ in the corresponding value of the column col_1.

Regular Expression

We wish to count the number of occurrences of a given regular expression in a given string.

In this example, we count the number of vowels (while ignoring case) in each value of the column col_1.

df_2 = df %>% 
  mutate(col_2 = str_count(col_1, '[aeiou]'))

Here is how this works:

  • This works similarly to the substring scenario described above except that we pass a regular expression to str_count() instead of a substring.
  • The regular expression '[aeiou]' matches any of the characters in the square brackets (which are the five vowels in the English language). It can be thought of as an or operation.
  • The output data frame df_2 will be a copy of the input data frame df with an added column col_2 holding the number of vowels in the corresponding value of the column col_1.

Word

We wish to count the number of words in a string.

In this example, we wish to count the number of words in each value of the column col_1 and to return that as a new integer column col_2.

df_2 = df %>% 
    mutate(col_2 = str_count(col_1, boundary("word")))

Here is how this works:

  • In boundary("word"), we match the boundaries between words. boundary() is a stringr function that matches boundaries between characters, lines, sentences, or words depending on the argument passed to it.
  • We pass boundary("word") as the pattern to be matched to the second argument of str_count().
  • The output data frame df_2 is a copy of the input data frame df with an additional column col_2 where each row holds the number of words in the corresponding value of the column col_1.

Alternative: Via Regular Expression

df_2 = df %>% 
  mutate(col_2 = str_count(col_1, '\\w+'))

Here is how this works:

  • We use the regular expression ‘\\w+’ where:
    • \w represents any ‘word’ character; i.e. letters, digits or underscore.
    • + specifies that we are looking for one or more ‘word’ characters.
  • The output is the same as the primary solution above.

Ignore Case

Substring

We wish to count the number of occurrences of a given substring in a given string while ignoring case.

In this example, we count the number of occurrences of the string ‘xy’ in each value of the column col_1 while ignoring case.

df_2 = df %>% 
  mutate(col_2 = str_count(col_1, fixed('xy', ignore_case=TRUE)))

Here is how this works:

  • This code is similar to the code under Substring above except that we pass to fixed() the argument ignore_case=TRUE which specifies that we wish to ignore case while matching.
  • The default is ignore_case=FALSE. Therefore, we do not need to set ignore_case when we wish to perform case-sensitive matching.

Regular Expression

df_2 = df %>% 
  mutate(col_2 = str_count(col_1, regex('[aeiou]', ignore_case=TRUE)))

Here is how this works:

  • This code is similar to the code under Regular Expression above except that we pass regex('[aeiou]', ignore_case=TRUE) to the second argument of str_count() to perform case-insensitive matching.
  • The default is ignore_case=FALSE. Therefore, we do not need to set ignore_case when we wish to perform case-sensitive matching. Also, the str_count() expects a regular expression by default so when we do not need to pass arguments to regex() we can simply pass the regular expression to str_count() like we do in Regular Expression above.

Pattern Column

We wish to count the number of occurrences of a value of one column in the value of another column for each row.

In this example, we have a data frame df with two column col_1 and col_2, we wish to count the number of occurrences of the value of col_2 in col_1.

df_2 = df %>% 
  mutate(col_3 = str_count(col_1, col_2))

Here is how this works:

  • str_count() is vectorized over both the string and the pattern and can operate in one of three modes:
    • Count the occurrences of one pattern in each element in a vector of strings. This is the mode we used in all the above scenarios.
    • Check n patterns against n strings. The size of both vectors must be the same for this pattern (or a multiple). This is the mode we use in this solution since we have a vector of strings, the column col_1, and a vector of patterns of the same size, the column col_2.
    • Count the occurrences of multiple patterns against a single string. This is the mode we use in Multiple Patterns below.
  • We pass to str_count():
    • the strings to look into which in this case is the column col_1 as the first argument
    • and the patterns to count which in this case is the column col_2 and is naturally of the same size as col_1.
  • For each row, str_count() will return the number of occurrences of the value of col_2 in col_1 as an integer.
  • The output data frame df_2 will be a copy of the input data frame df_1 with an added column col_3 holding the number of occurrences of the value of col_2 in col_1 for the corresponding row.

Multiple Patterns

Count All

We wish to return the total number of occurrences of all patterns as a single integer value. In other words, we wish to return the sum of occurrences of a given set of patterns in a given string.

df_2 = df %>% 
  mutate(col_2 = str_count(col_1, 'XY|YX'))

Here is how this works:

  • We use the or operator | of regular expressions to build a regular expression that captures all the patterns we wish to look for or’ed together. In this case that regular expression is 'XY|YX'.
  • For each value of col_1, str_count() will return the total number of occurrences of all patterns.
  • The output data frame df_2 will be a copy of the input data frame df with an added column col_2 holding the sum of occurrences of the specified patterns in the corresponding value of the column col_1.
  • See Regular Expression above for a more detailed description.

Count Each

We wish to return the number of occurrences of each pattern in a set of patterns as a vector of integer values (one value for each pattern).

In this example, for each value of the column col_1, we wish to compute the difference between the number of occurrences of the string ‘X’ and the string ‘Y’.

df_2 = df %>% 
  rowwise() %>% 
  mutate(col_2 = diff(str_count(col_1, c('X', 'Y'))))

Here is how this works:

  • We pass to str_count() a vector of patterns c('X', 'Y').
  • For each value of col_1, str_count() will return a vector of two values holding the number of occurrences of the two patterns 'X' and 'Y'.
  • We use the function diff() to subtract the two values returned by diff().
  • In this solution, we use rowwise() to execute str_count() for each value of the column col_1 individually. See Non-Vectorized Transformation.
  • The output data frame df_2 will be a copy of the input data frame df with an added column col_2 where each cell holds the difference between the number of occurrences of the two patterns in the corresponding value of the column col_1.
R
I/O