Counting

We wish to count the number of occurrences of a given pattern in a target string.

We will cover the following common pattern occurrence counting scenarios:

Substring: how to obtain the number of occurrences of a given substring in a given string.
Regular Expression: how to obtain the number of occurrences of match of a given regular expression in a given string.
Word: the special case of how to count the number of words in a given string.

The scenarios above can be extended in multiple ways, the most common of which are:

Ignore Case: Ignoring case (of both pattern and string) while matching.
Pattern Column: Matching a vector of patterns against a vector of strings of the same size
Multiple Patterns: Checking for multiple patterns at a time.

Substring

We wish to count the number of occurrences of a given substring in a given string.

In this example, we count the number of occurrences of the string ‘XY’ in each value of the column col_1.

df_2 = df %>% 
  mutate(col_2 = str_count(col_1, fixed('XY')))

Here is how this works:

We use str_count() from the stringr package (part of the tidyverse) to count the number of occurrences of the substring ‘XY’ in each value of the column col_1.
The str_count() function takes the following arguments:
- The column whose values we wish to check against; which in this case is col_1.
- The substring whose occurrences we wish to count; in this case ‘XY’. We wrap the substring in the helper fixed() because by default str_count() assumes that the pattern passed is a regular expression. fixed() specifies that the pattern is a fixed string.
The output data frame df_2 will be a copy of the input data frame df with an added column col_2 holding the number of occurrences of the string ‘XY’ in the corresponding value of the column col_1.

Regular Expression

We wish to count the number of occurrences of a given regular expression in a given string.

In this example, we count the number of vowels (while ignoring case) in each value of the column col_1.

df_2 = df %>% 
  mutate(col_2 = str_count(col_1, '[aeiou]'))

Here is how this works:

This works similarly to the substring scenario described above except that we pass a regular expression to str_count() instead of a substring.
The regular expression '[aeiou]' matches any of the characters in the square brackets (which are the five vowels in the English language). It can be thought of as an or operation.
The output data frame df_2 will be a copy of the input data frame df with an added column col_2 holding the number of vowels in the corresponding value of the column col_1.

Word

We wish to count the number of words in a string.

In this example, we wish to count the number of words in each value of the column col_1 and to return that as a new integer column col_2.

df_2 = df %>% 
    mutate(col_2 = str_count(col_1, boundary("word")))

Here is how this works:

In boundary("word"), we match the boundaries between words. boundary() is a stringr function that matches boundaries between characters, lines, sentences, or words depending on the argument passed to it.
We pass boundary("word") as the pattern to be matched to the second argument of str_count().
The output data frame df_2 is a copy of the input data frame df with an additional column col_2 where each row holds the number of words in the corresponding value of the column col_1.

Alternative: Via Regular Expression

df_2 = df %>% 
  mutate(col_2 = str_count(col_1, '\\w+'))

Here is how this works:

We use the regular expression ‘\\w+’ where:
- \w represents any ‘word’ character; i.e. letters, digits or underscore.
- + specifies that we are looking for one or more ‘word’ characters.
The output is the same as the primary solution above.

Ignore Case

Substring

We wish to count the number of occurrences of a given substring in a given string while ignoring case.

In this example, we count the number of occurrences of the string ‘xy’ in each value of the column col_1 while ignoring case.

df_2 = df %>% 
  mutate(col_2 = str_count(col_1, fixed('xy', ignore_case=TRUE)))

Here is how this works:

This code is similar to the code under Substring above except that we pass to fixed() the argument ignore_case=TRUE which specifies that we wish to ignore case while matching.
The default is ignore_case=FALSE. Therefore, we do not need to set ignore_case when we wish to perform case-sensitive matching.

Regular Expression

df_2 = df %>% 
  mutate(col_2 = str_count(col_1, regex('[aeiou]', ignore_case=TRUE)))

Here is how this works:

This code is similar to the code under Regular Expression above except that we pass regex('[aeiou]', ignore_case=TRUE) to the second argument of str_count() to perform case-insensitive matching.
The default is ignore_case=FALSE. Therefore, we do not need to set ignore_case when we wish to perform case-sensitive matching. Also, the str_count() expects a regular expression by default so when we do not need to pass arguments to regex() we can simply pass the regular expression to str_count() like we do in Regular Expression above.

Pattern Column

We wish to count the number of occurrences of a value of one column in the value of another column for each row.

In this example, we have a data frame df with two column col_1 and col_2, we wish to count the number of occurrences of the value of col_2 in col_1.

df_2 = df %>% 
  mutate(col_3 = str_count(col_1, col_2))

Here is how this works:

str_count() is vectorized over both the string and the pattern and can operate in one of three modes:
- Count the occurrences of one pattern in each element in a vector of strings. This is the mode we used in all the above scenarios.
- Check n patterns against n strings. The size of both vectors must be the same for this pattern (or a multiple). This is the mode we use in this solution since we have a vector of strings, the column col_1, and a vector of patterns of the same size, the column col_2.
- Count the occurrences of multiple patterns against a single string. This is the mode we use in Multiple Patterns below.
We pass to str_count():
- the strings to look into which in this case is the column col_1 as the first argument
- and the patterns to count which in this case is the column col_2 and is naturally of the same size as col_1.
For each row, str_count() will return the number of occurrences of the value of col_2 in col_1 as an integer.
The output data frame df_2 will be a copy of the input data frame df_1 with an added column col_3 holding the number of occurrences of the value of col_2 in col_1 for the corresponding row.

Multiple Patterns

Count All

We wish to return the total number of occurrences of all patterns as a single integer value. In other words, we wish to return the sum of occurrences of a given set of patterns in a given string.

df_2 = df %>% 
  mutate(col_2 = str_count(col_1, 'XY|YX'))

Here is how this works:

We use the or operator | of regular expressions to build a regular expression that captures all the patterns we wish to look for or’ed together. In this case that regular expression is 'XY|YX'.
For each value of col_1, str_count() will return the total number of occurrences of all patterns.
The output data frame df_2 will be a copy of the input data frame df with an added column col_2 holding the sum of occurrences of the specified patterns in the corresponding value of the column col_1.
See Regular Expression above for a more detailed description.

Count Each

We wish to return the number of occurrences of each pattern in a set of patterns as a vector of integer values (one value for each pattern).

In this example, for each value of the column col_1, we wish to compute the difference between the number of occurrences of the string ‘X’ and the string ‘Y’.

df_2 = df %>% 
  rowwise() %>% 
  mutate(col_2 = diff(str_count(col_1, c('X', 'Y'))))

Here is how this works:

We pass to str_count() a vector of patterns c('X', 'Y').
For each value of col_1, str_count() will return a vector of two values holding the number of occurrences of the two patterns 'X' and 'Y'.
We use the function diff() to subtract the two values returned by diff().
In this solution, we use rowwise() to execute str_count() for each value of the column col_1 individually. See Non-Vectorized Transformation.
The output data frame df_2 will be a copy of the input data frame df with an added column col_2 where each cell holds the difference between the number of occurrences of the two patterns in the corresponding value of the column col_1.

Optima.io Reference beta

Counting

Substring

Regular Expression

Word

Ignore Case

Pattern Column

Multiple Patterns