We wish to count the number of occurrences of a given pattern in a target string.

We will cover the following common pattern occurrence counting scenarios:

**Substring**: how to obtain the number of occurrences of a given substring in a given string.**Regular Expression**: how to obtain the number of occurrences of match of a given regular expression in a given string.**Word**: the special case of how to count the number of words in a given string.

The scenarios above can be extended in multiple ways, the most common of which are:

**Ignore Case:**Ignoring case (of both pattern and string) while matching.**Pattern Column:**Matching a vector of patterns against a vector of strings of the same size**Multiple Patterns:**Checking for multiple patterns at a time.

We wish to count the number of occurrences of a given substring in a given string.

In this example, we count the number of occurrences of the string `‘XY’`

in each value of the column `col_1`

.

```
df_2 = df %>%
mutate(col_2 = str_count(col_1, fixed('XY')))
```

Here is how this works:

- We use
`str_count()`

from the`stringr`

package (part of the`tidyverse`

) to count the number of occurrences of the substring`‘XY’`

in each value of the column`col_1`

. - The
`str_count()`

function takes the following arguments:- The column whose values we wish to check against; which in this case is
`col_1`

. - The substring whose occurrences we wish to count; in this case
`‘XY’`

. We wrap the substring in the helper`fixed()`

because by default`str_count()`

assumes that the pattern passed is a regular expression.`fixed()`

specifies that the pattern is a fixed string.

- The column whose values we wish to check against; which in this case is
- The output data frame
`df_2`

will be a copy of the input data frame`df`

with an added column`col_2`

holding the number of occurrences of the string`‘XY’`

in the corresponding value of the column`col_1`

.

We wish to count the number of occurrences of a given regular expression in a given string.

In this example, we count the number of vowels (while ignoring case) in each value of the column `col_1`

.

```
df_2 = df %>%
mutate(col_2 = str_count(col_1, '[aeiou]'))
```

Here is how this works:

- This works similarly to the substring scenario described above except that we pass a regular expression to
`str_count()`

instead of a substring. - The regular expression
`'[aeiou]'`

matches any of the characters in the square brackets (which are the five vowels in the English language). It can be thought of as an`or`

operation. - The output data frame
`df_2`

will be a copy of the input data frame`df`

with an added column`col_2`

holding the number of vowels in the corresponding value of the column`col_1`

.

We wish to count the number of words in a string.

In this example, we wish to count the number of words in each value of the column `col_1`

and to return that as a new integer column `col_2`

.

```
df_2 = df %>%
mutate(col_2 = str_count(col_1, boundary("word")))
```

Here is how this works:

- In
`boundary("word")`

, we match the boundaries between words.`boundary()`

is a`stringr`

function that matches boundaries between characters, lines, sentences, or words depending on the argument passed to it. - We pass
`boundary("word")`

as the pattern to be matched to the second argument of`str_count()`

. - The output data frame
`df_2`

is a copy of the input data frame`df`

with an additional column`col_2`

where each row holds the number of words in the corresponding value of the column`col_1`

.

*Alternative: Via Regular Expression*

```
df_2 = df %>%
mutate(col_2 = str_count(col_1, '\\w+'))
```

Here is how this works:

- We use the regular expression
`‘\\w+’`

where:`\w`

represents any ‘word’ character; i.e. letters, digits or underscore.`+`

specifies that we are looking for one or more ‘word’ characters.

- The output is the same as the primary solution above.

**Substring**

We wish to count the number of occurrences of a given substring in a given string while ignoring case.

In this example, we count the number of occurrences of the string `‘xy’`

in each value of the column `col_1`

while ignoring case.

```
df_2 = df %>%
mutate(col_2 = str_count(col_1, fixed('xy', ignore_case=TRUE)))
```

Here is how this works:

- This code is similar to the code under Substring above except that we pass to
`fixed()`

the argument`ignore_case=TRUE`

which specifies that we wish to ignore case while matching. - The default is
`ignore_case=FALSE`

. Therefore, we do not need to set`ignore_case`

when we wish to perform case-sensitive matching.

**Regular Expression**

```
df_2 = df %>%
mutate(col_2 = str_count(col_1, regex('[aeiou]', ignore_case=TRUE)))
```

Here is how this works:

- This code is similar to the code under Regular Expression above except that we pass
`regex('[aeiou]', ignore_case=TRUE)`

to the second argument of`str_count()`

to perform case-insensitive matching. - The default is
`ignore_case=FALSE`

. Therefore, we do not need to set`ignore_case`

when we wish to perform case-sensitive matching. Also, the`str_count()`

expects a regular expression by default so when we do not need to pass arguments to`regex()`

we can simply pass the regular expression to`str_count()`

like we do in Regular Expression above.

We wish to count the number of occurrences of a value of one column in the value of another column for each row.

In this example, we have a data frame `df`

with two column `col_1`

and `col_2`

, we wish to count the number of occurrences of the value of `col_2`

in `col_1`

.

```
df_2 = df %>%
mutate(col_3 = str_count(col_1, col_2))
```

Here is how this works:

`str_count()`

is vectorized over both the string and the pattern and can operate in one of three modes:- Count the occurrences of one pattern in each element in a vector of strings. This is the mode we used in all the above scenarios.
- Check n patterns against n strings. The size of both vectors must be the same for this pattern (or a multiple). This is the mode we use in this solution since we have a vector of strings, the column
`col_1`

, and a vector of patterns of the same size, the column`col_2`

. - Count the occurrences of multiple patterns against a single string. This is the mode we use in Multiple Patterns below.

- We pass to
`str_count()`

:- the strings to look into which in this case is the column
`col_1`

as the first argument - and the patterns to count which in this case is the column
`col_2`

and is naturally of the same size as`col_1`

.

- the strings to look into which in this case is the column
- For each row,
`str_count()`

will return the number of occurrences of the value of`col_2`

in`col_1`

as an integer. - The output data frame
`df_2`

will be a copy of the input data frame`df_1`

with an added column`col_3`

holding the number of occurrences of the value of`col_2`

in`col_1`

for the corresponding row.

**Count All**

We wish to return the total number of occurrences of all patterns as a single integer value. In other words, we wish to return the sum of occurrences of a given set of patterns in a given string.

```
df_2 = df %>%
mutate(col_2 = str_count(col_1, 'XY|YX'))
```

Here is how this works:

- We use the or operator
`|`

of regular expressions to build a regular expression that captures all the patterns we wish to look for or’ed together. In this case that regular expression is`'XY|YX'`

. - For each value of
`col_1`

,`str_count()`

will return the total number of occurrences of all patterns. - The output data frame
`df_2`

will be a copy of the input data frame`df`

with an added column`col_2`

holding the sum of occurrences of the specified patterns in the corresponding value of the column`col_1`

. - See Regular Expression above for a more detailed description.

**Count Each**

We wish to return the number of occurrences of each pattern in a set of patterns as a vector of integer values (one value for each pattern).

In this example, for each value of the column `col_1`

, we wish to compute the difference between the number of occurrences of the string `‘X’`

and the string `‘Y’`

.

```
df_2 = df %>%
rowwise() %>%
mutate(col_2 = diff(str_count(col_1, c('X', 'Y'))))
```

Here is how this works:

- We pass to
`str_count()`

a vector of patterns`c('X', 'Y')`

. - For each value of
`col_1`

,`str_count()`

will return a vector of two values holding the number of occurrences of the two patterns`'X'`

and`'Y'.`

- We use the function
`diff()`

to subtract the two values returned by`diff()`

. - In this solution, we use
`rowwise()`

to execute`str_count()`

for each value of the column`col_1`

individually. See Non-Vectorized Transformation. - The output data frame
`df_2`

will be a copy of the input data frame`df`

with an added column`col_2`

where each cell holds the difference between the number of occurrences of the two patterns in the corresponding value of the column`col_1`

.

R