We wish to count the number of occurrences of a given pattern in a target string.
We will cover the following common pattern occurrence counting scenarios:
The scenarios above can be extended in multiple ways, the most common of which are:
We wish to count the number of occurrences of a given substring in a given string.
In this example, we count the number of occurrences of the string ‘XY’
in each value of the column col_1
.
df_2 = df %>%
mutate(col_2 = str_count(col_1, fixed('XY')))
Here is how this works:
str_count()
from the stringr
package (part of the tidyverse
) to count the number of occurrences of the substring ‘XY’
in each value of the column col_1
.str_count()
function takes the following arguments:col_1
.‘XY’
. We wrap the substring in the helper fixed()
because by default str_count()
assumes that the pattern passed is a regular expression. fixed()
specifies that the pattern is a fixed string.df_2
will be a copy of the input data frame df
with an added column col_2
holding the number of occurrences of the string ‘XY’
in the corresponding value of the column col_1
.We wish to count the number of occurrences of a given regular expression in a given string.
In this example, we count the number of vowels (while ignoring case) in each value of the column col_1
.
df_2 = df %>%
mutate(col_2 = str_count(col_1, '[aeiou]'))
Here is how this works:
str_count()
instead of a substring.'[aeiou]'
matches any of the characters in the square brackets (which are the five vowels in the English language). It can be thought of as an or
operation.df_2
will be a copy of the input data frame df
with an added column col_2
holding the number of vowels in the corresponding value of the column col_1
.We wish to count the number of words in a string.
In this example, we wish to count the number of words in each value of the column col_1
and to return that as a new integer column col_2
.
df_2 = df %>%
mutate(col_2 = str_count(col_1, boundary("word")))
Here is how this works:
boundary("word")
, we match the boundaries between words. boundary()
is a stringr
function that matches boundaries between characters, lines, sentences, or words depending on the argument passed to it.boundary("word")
as the pattern to be matched to the second argument of str_count()
.df_2
is a copy of the input data frame df
with an additional column col_2
where each row holds the number of words in the corresponding value of the column col_1
.Alternative: Via Regular Expression
df_2 = df %>%
mutate(col_2 = str_count(col_1, '\\w+'))
Here is how this works:
‘\\w+’
where:\w
represents any ‘word’ character; i.e. letters, digits or underscore.+
specifies that we are looking for one or more ‘word’ characters.Substring
We wish to count the number of occurrences of a given substring in a given string while ignoring case.
In this example, we count the number of occurrences of the string ‘xy’
in each value of the column col_1
while ignoring case.
df_2 = df %>%
mutate(col_2 = str_count(col_1, fixed('xy', ignore_case=TRUE)))
Here is how this works:
fixed()
the argument ignore_case=TRUE
which specifies that we wish to ignore case while matching.ignore_case=FALSE
. Therefore, we do not need to set ignore_case
when we wish to perform case-sensitive matching.Regular Expression
df_2 = df %>%
mutate(col_2 = str_count(col_1, regex('[aeiou]', ignore_case=TRUE)))
Here is how this works:
regex('[aeiou]', ignore_case=TRUE)
to the second argument of str_count()
to perform case-insensitive matching.ignore_case=FALSE
. Therefore, we do not need to set ignore_case
when we wish to perform case-sensitive matching. Also, the str_count()
expects a regular expression by default so when we do not need to pass arguments to regex()
we can simply pass the regular expression to str_count()
like we do in Regular Expression above.We wish to count the number of occurrences of a value of one column in the value of another column for each row.
In this example, we have a data frame df
with two column col_1
and col_2
, we wish to count the number of occurrences of the value of col_2
in col_1
.
df_2 = df %>%
mutate(col_3 = str_count(col_1, col_2))
Here is how this works:
str_count()
is vectorized over both the string and the pattern and can operate in one of three modes:col_1
, and a vector of patterns of the same size, the column col_2
.str_count()
:col_1
as the first argumentcol_2
and is naturally of the same size as col_1
.str_count()
will return the number of occurrences of the value of col_2
in col_1
as an integer.df_2
will be a copy of the input data frame df_1
with an added column col_3
holding the number of occurrences of the value of col_2
in col_1
for the corresponding row.Count All
We wish to return the total number of occurrences of all patterns as a single integer value. In other words, we wish to return the sum of occurrences of a given set of patterns in a given string.
df_2 = df %>%
mutate(col_2 = str_count(col_1, 'XY|YX'))
Here is how this works:
|
of regular expressions to build a regular expression that captures all the patterns we wish to look for or’ed together. In this case that regular expression is 'XY|YX'
.col_1
, str_count()
will return the total number of occurrences of all patterns.df_2
will be a copy of the input data frame df
with an added column col_2
holding the sum of occurrences of the specified patterns in the corresponding value of the column col_1
.Count Each
We wish to return the number of occurrences of each pattern in a set of patterns as a vector of integer values (one value for each pattern).
In this example, for each value of the column col_1
, we wish to compute the difference between the number of occurrences of the string ‘X’
and the string ‘Y’
.
df_2 = df %>%
rowwise() %>%
mutate(col_2 = diff(str_count(col_1, c('X', 'Y'))))
Here is how this works:
str_count()
a vector of patterns c('X', 'Y')
.col_1
, str_count()
will return a vector of two values holding the number of occurrences of the two patterns 'X'
and 'Y'.
diff()
to subtract the two values returned by diff()
.rowwise()
to execute str_count()
for each value of the column col_1
individually. See Non-Vectorized Transformation.df_2
will be a copy of the input data frame df
with an added column col_2
where each cell holds the difference between the number of occurrences of the two patterns in the corresponding value of the column col_1
.