We wish to check whether a string matches a given pattern and to return a logical value TRUE
if that is the case and FALSE
otherwise.
In this section we will cover the following four common string pattern detection scenarios:
For each of these four scenarios, we will cover two cases:
In addition, we cover the following scenarios which can be applied to extend any of the above:
String
We wish to check if a string exactly matches another string.
In this example, we wish to filter the rows of the data frame df
to retain only rows where the value of the column col_1
exactly equals ‘XXX’
.
df_2 = df %>%
filter(col_1 == 'XXX')
Here is how this works:
==
to check whether two strings are equal.==
is vectorized)'XXX'
.df_2
will have only the rows of the input data frame df where the value of the column col_1
is 'XXX'
.Alternative: Via Function
df_2 = df %>%
filter(str_equal(col_1, 'XXX'))
Here is how this works:
==
to compare strings is to use the str_equal()
function provided by the Stringr
package.Regular Expression
We wish to check whether a given string matches a given regular expression and to return TRUE
if that is the case and FALSE
otherwise.
In this example, we wish to filter the rows of the data frame df
to retain only rows where the value of the column col_1
is of the form: ‘x’
followed by digits.
df_2 = df %>%
filter(str_detect(col_1, '^x\\d+$'))
Here is how this works:
str_detect()
function from the stringr
package (part of the tidyverse
) to check, for each value of the column col_1
, whether a regular expression is a match.str_detect()
function takes the following arguments:col_1
.'^x\\d+$'
.'^x\\d+$'
, where^
matches the start of a string.x
matches the letter x.\\d+
matches one or more digits.$
matches the end of a string.^
and end $
of the string in the regular expression we specify that it must be a full match.df_2
will have only the rows of the input data frame df
where the value of the column col_1
matches the regular expression '^x\\d+$'
.Substring
We wish to check whether a given substring occurs anywhere inside a given string and to return TRUE
if there is a match and FALSE
otherwise.
In this example, we wish to filter the rows of the data frame df
to retain only rows where the value of the column col_1
contains the substring ‘XX’
.
df_2 = df %>%
filter(str_detect(col_1, fixed('XX')))
Here is how this works:
str_detect()
function (from the stringr
package) to check for each value of the column col_1
, whether that value of the column col_1
contains the substring 'XX'
.str_detect()
function takes the following arguments:col_1
.‘XX’
. We wrap the substring in the helper fixed()
because by default str_detect()
assumes that the pattern passed is a regular expression. fixed()
specifies that the pattern is a fixed string.df_2
will have only the rows of the input data frame df
where the value of the column col_1
contains the substring 'XX'
.Regular Expression
We wish to check whether a given regular expression has a match inside a given string and to return TRUE
if there is such a match and FALSE
otherwise.
In this example, we wish to filter the rows of the data frame df to retain only rows where the value of the column col_1
contains any integers represented by the regular expression '\\d+'
.
df_2 = df %>%
filter(str_detect(col_1, '\\d+'))
Here is how this works:
str_detect()
.str_detect()
expects the pattern to be a regular expression.df_2
will have only the rows of the input data frame df
where the value of the column col_1
contains any digits.Substring
We wish to check whether a given substring occurs at the beginning of a given string and to return TRUE
if that is the case and FALSE
otherwise.
In this example, we wish to filter the rows of the data frame df
to retain only rows where the value of the column col_1
starts with the substring ‘XX’
.
df_2 = df %>%
filter(str_starts(col_1, fixed('XX')))
Here is how this works:
str_starts()
instead of str_detect()
.str_starts()
from the stringr package returns TRUE
only if the given substring, in this case ‘XX’
, occurs at the beginning of the string.df_2
will have only the rows of the input data frame df
where the value of the column col_1
starts with ‘XX’
.Regular Expression
We wish to check whether a given regular expression occurs at the beginning of a given string and to return TRUE
if that is the case and FALSE
otherwise.
In this example, we wish to filter the rows of the data frame df
to retain only rows where the value of the column col_1
starts with any integers represented by the regular expression '\\d+'
.
df_2 = df %>%
filter(str_starts(col_1, '\\d+'))
Here is how this works:
str_starts()
instead of str_detect()
.str_starts()
from the stringr package returns TRUE
only if the given regular expression, in this case ‘\\d+’
, occurs at the beginning of the string.df_2
will have only the rows of the input data frame df
where the value of the column col_1
starts with one or more digits.Alternative: Starts-With in Regex
df_2 = df %>%
filter(str_detect(col_1, '^\\d+'))
Here is how this works:
^
to specify that the regular expression must occur at the start of the string.str_detect()
function and enforce any start or end constraints via the regular expression anchor symbols for start ^
and end $
as needed.Substring
We wish to check whether a given substring occurs at the end of a given string and to return TRUE
if that is the case and FALSE
otherwise.
In this example, we wish to filter the rows of the data frame df
to retain only rows where the value of the column col_1
ends with the substring ‘XX’
.
df_2 = df %>%
filter(str_ends(col_1, 'XX'))
Here is how this works:
str_ends()
instead of str_detect()
.str_ends()
from the stringr package returns TRUE
only if the given substring, in this case ‘XX’
, occurs at the end of the string.df_2
will have only the rows of the input data frame df
where the value of the column col_1
ends with ‘XX’
.Regular Expression
We wish to check whether a given regular expression occurs at the end of a given string and to return TRUE
if that is the case and FALSE
otherwise.
In this example, we wish to filter the rows of the data frame df
to retain only rows where the value of the column col_1
ends with any integers.
df = df %>%
filter(str_ends(col_1, '\\d+'))
Here is how this works:
str_ends()
instead of str_detect()
.str_ends()
from the stringr package returns TRUE
only if the given regular expression, in this case ‘\\d+’
, occurs at the end of the string.df_2
will have only the rows of the input data frame df
where the value of the column col_1
ends with one or more digits.Alternative: Ends-With in Regex
df = df %>%
filter(str_detect(col_1, '\\d+$'))
Here is how this works:
$
to specify that the regular expression must occur at the end of the string.str_detect()
function and enforce any start or end constraints via the regular expression anchor symbols for start ^
and end $
as needed.Substring
We wish to check whether a given substring occurs anywhere inside a given string while ignoring case and to return TRUE
if there is a match and FALSE
otherwise.
In this example, we wish to filter the rows of the data frame df
to retain only rows where the value of the column col_1
contains the substring ‘XX’
regardless of case.
df_2 = df %>%
filter(str_detect(col_1,
fixed('XX', ignore_case=TRUE)))
Here is how this works:
str_detect()
. See Contains above.fixed()
because str_detect()
expect a regular expression by default.ignore_case
of fixed()
to ignore_case=TRUE
.str_detect()
, we can pass an ignore_case
argument to the fixed()
helper passed to str_starts()
and str_ends()
.Alternative: Lower Case
df_2 = df %>%
filter(str_detect(str_to_lower(col_1),
fixed('xx')))
Here is how this works:
str_to_lower(col_1)
, we lower the case of all the values of the column col_1
which are the strings we are looking into.'xx'
.Regular Expression
We wish to check whether a given regular expression has a match inside a given string and to return TRUE
if there is such a match and FALSE
otherwise.
In this example, we wish to filter the rows of the data frame df
to retain only rows where the value of the column col_1
is of the form: ‘x’
followed by digits while ignoring the case.
df_2 = df %>%
filter(str_detect(col_1,
regex('^x\\d+', ignore_case=TRUE)))
Here is how this works:
str_detect()
. See Contains above.str_detect()
expect a regular expression by default.regex()
and we pass the desired arguments to regex()
.ignore_case
of regex()
to ignore_case=TRUE
.str_detect()
, we can pass an ignore_case
argument to the regex()
helper passed to str_starts()
and str_ends()
.We wish to return the inverse of the outcome of the string pattern matching operations described above; i.e. if the outcome is TRUE
, return FALSE
and vice versa.
In this example, we wish to filter the rows of the data frame df
to retain only rows where the value of the column col_1
does not contain any integers.
df_2 = df %>%
filter(!str_detect(col_1, '\\d+'))
Here is how this works:
str_detect()
. See Contains above.TRUE
when there is no match and FALSE
when there is a match, we can use the complement operator !
.'\\d+'
checks for the occurrence of a substring comprised of one or more digits.df_2
will have only the rows of the input data frame df
where the value of the column col_1
does not contain any digits.Alternative: Using Function Argument
df_2 = df %>%
filter(str_detect(col_1, '\\d+', negate=TRUE))
Here is how this works:
!
, we can use the negate
argument of str_detect()
(as well as its siblings str_starts()
and str_ends()
).We wish to match a vector of strings against a vector of patterns of the same size. This is often needed when we wish to check the presence of the value of a column in another column for each row.
In this example, we wish to filter the rows of the data frame df
to retain only rows where the value of the column col_1
contains the value of the column col_2
.
df_2 = df %>%
filter(str_detect(col_1, col_2))
Here is how this works:
str_detect()
is vectorized over both the string and the pattern and can operate in one of three modes:col_1
, and a vector of patterns of the same size, the column col_2
.str_detect()
:col_1
as the first argumentcol_2
and is naturally of the same size as col_1
.str_detect()
will return TRUE
for rows where the value of col_2
is contained in col_1
. See Contains above.df_2
will have only the rows of the input data frame df
where the value of the column col_1
contains the value of the column col_2
.OR
We wish to check whether any of a set of patterns occurs in a given string and to return TRUE
if that is the case and FALSE
otherwise.
In this example, we wish to filter the rows of the data frame df
to retain only rows where the value of the column col_1
contains the string ‘XX’
or the string ‘YY’
.
df_2 = df %>%
filter(str_detect(col_1, fixed('XX'))
| str_detect(col_1, fixed('YY')))
Here is how this works:
str_detect()
twice:col_1
contains 'XX'
andcol_1
contains 'YY'
.TRUE
or FALSE
) that has as many elements as the size of col_1
.|
to combine these two logical vectors element wise into one logical vector that is TRUE
if either call to str_detect()
returns TRUE
for that value of col_1
. This final logical vector is passed to filter()
.df_2
will have only the rows of the input data frame df
where the value of the column col_1
contains the string ‘XX’
or ‘YY’
.Alternative: Via Regular Expression
df_2 = df %>%
filter(str_detect(col_1, 'XX|YY'))
Here is how this works:
|
of regular expressions to build a regular expression that captures all the patterns we wish to look for or’ed together. In this case that regular expression is 'XX|YY'
.str_detect()
which then returns TRUE
if the string being checked (a value of the column col_1
) contains either the substring ‘XX’
or the substring ‘YY’
.Alternative: Via Non-Vectorized Operation
df_2 = df %>%
filter(
map_lgl(col_1, ~any(str_detect(., c('XX', 'YY')))))
Here is how this works:
str_detect()
is vectorized over both the string and the pattern and can operate in one of three modes:map_lgl()
to iterate over each value of the column col_1
and pass that to str_detect()
along with a vector of multiple patterns. See Non-Vectorized Transformation.c('XX', 'YY')
.any()
to combine those logical values into a single value that is the or’ing of all values. This is the value returned by map_lgl()
to filter()
. See Logical Operations.AND
We wish to check whether each of a set of patterns occurs in a given string and to return TRUE
if that is the case and FALSE
otherwise.
In this example, we wish to filter the rows of the data frame df
to retain only rows where the value of the column col_1
contains the ends with the string ‘x’
and contains a sequence of one or more digits.
df_2 = df %>%
filter(str_detect(col_1, '^x'),
str_detect(col_1, '\\d+'))
Here is how this works:
str_detect()
twice:col_1
matches the regular expression '^x'
which checks if the string starts with the letter ‘x’
.col_1
matches the regular expression '\\d+'
which checks if the string contains a sequence of one or more digits.TRUE
or FALSE
) that has as many elements as the size of col_1
.TRUE
if and only if both calls to str_detect()
returns TRUE
for that value of col_1
. This final logical vector is passed to filter()
.df_2
will have only the rows of the input data frame df
where the value of the column col_1
starts with 'x'
and contains at least one integer.Alternative: Via Non-Vectorized Operation
df_2 = df %>%
filter(
map_lgl(col_1, ~all(str_detect(., c('^x', '\\d+')))))
Here is how this works:
map_lgl()
to iterate over each value of the column col_1
and pass that to str_detect()
along with a vector of multiple patterns. See Non-Vectorized Transformation.all()
to return TRUE
if and only if both patterns match. See Logical Operations.