Detecting

We wish to check whether a string matches a given pattern and to return a logical value TRUE if that is the case and FALSE otherwise.

In this section we will cover the following four common string pattern detection scenarios:

  • Full Match where we cover how to check if a given string exactly matches a given pattern.
  • Contains where we cover how to check if a given pattern is contained in a given string.
  • Starts With where we cover how to check if a given string starts with a given pattern.
  • Ends With where we cover how to check if a given string ends with a given pattern.

For each of these four scenarios, we will cover two cases:

  • Substring where the pattern is a plain character sequence
  • Regular Expressions where the pattern is a regular expression

In addition, we cover the following scenarios which can be applied to extend any of the above:

  • Ignore Case where we cover how to ignore the case (of both pattern and string) while matching.
  • Complement where we cover how to return the inverse of the outcome of the string pattern matching scenarios described above.
  • Pattern Column where we cover how to match a vector of strings against a vector of patterns of the same size. This is often needed when we wish to check the presence of the value of a column in another column for each row.
  • Multiple Patterns where we cover how to extend any of the above scenarios to check against multiple patterns at a time.

Full Match

String

We wish to check if a string exactly matches another string.

In this example, we wish to filter the rows of the data frame df to retain only rows where the value of the column col_1 exactly equals ‘XXX’.

df_2 = df %>%
  filter(col_1 == 'XXX')

Here is how this works:

  • We use the basic equality comparison operator == to check whether two strings are equal.
  • The comparison is between:
    • Each of the values of the column col_1 (== is vectorized)
    • and the character sequence 'XXX'.
  • The output data frame df_2 will have only the rows of the input data frame df where the value of the column col_1 is 'XXX'.

Alternative: Via Function

df_2 = df %>% 
    filter(str_equal(col_1, 'XXX'))

Here is how this works:

  • An alternative to using the basic equality operator == to compare strings is to use the str_equal() function provided by the Stringr package.
  • The output of this code is the same as the primary solution above.

Regular Expression

We wish to check whether a given string matches a given regular expression and to return TRUE if that is the case and FALSE otherwise.

In this example, we wish to filter the rows of the data frame df to retain only rows where the value of the column col_1 is of the form: ‘x’ followed by digits.

df_2 = df %>% 
  filter(str_detect(col_1, '^x\\d+$'))

Here is how this works:

  • We use the str_detect() function from the stringr package (part of the tidyverse) to check, for each value of the column col_1, whether a regular expression is a match.
  • The str_detect() function takes the following arguments:
    • The column whose values we wish to check against; which in this case is col_1.
    • A regular expression; which in this case is '^x\\d+$'.
  • The regular expression in this case is '^x\\d+$', where
    • ^ matches the start of a string.
    • x matches the letter x.
    • \\d+ matches one or more digits.
    • $ matches the end of a string.
  • By including the start ^ and end $ of the string in the regular expression we specify that it must be a full match.
  • The output data frame df_2 will have only the rows of the input data frame df where the value of the column col_1 matches the regular expression '^x\\d+$'.

Contains

Substring

We wish to check whether a given substring occurs anywhere inside a given string and to return TRUE if there is a match and FALSE otherwise.

In this example, we wish to filter the rows of the data frame df to retain only rows where the value of the column col_1 contains the substring ‘XX’.

df_2 = df %>%  
  filter(str_detect(col_1, fixed('XX')))

Here is how this works:

  • We use the str_detect() function (from the stringr package) to check for each value of the column col_1, whether that value of the column col_1 contains the substring 'XX'.
  • The str_detect() function takes the following arguments:
    • The column whose values we wish to check against; which in this case is col_1.
    • The substring whose presence we wish to check; in this case ‘XX’. We wrap the substring in the helper fixed() because by default str_detect() assumes that the pattern passed is a regular expression. fixed() specifies that the pattern is a fixed string.
  • The output data frame df_2 will have only the rows of the input data frame df where the value of the column col_1 contains the substring 'XX'.

Regular Expression

We wish to check whether a given regular expression has a match inside a given string and to return TRUE if there is such a match and FALSE otherwise.

In this example, we wish to filter the rows of the data frame df to retain only rows where the value of the column col_1 contains any integers represented by the regular expression '\\d+'.

df_2 = df %>% 
  filter(str_detect(col_1, '\\d+'))

Here is how this works:

  • This works similarly to the Substring case above except that we pass a regular expression to str_detect().
  • By default, str_detect() expects the pattern to be a regular expression.
  • The output data frame df_2 will have only the rows of the input data frame df where the value of the column col_1 contains any digits.

Starts With

Substring

We wish to check whether a given substring occurs at the beginning of a given string and to return TRUE if that is the case and FALSE otherwise.

In this example, we wish to filter the rows of the data frame df to retain only rows where the value of the column col_1 starts with the substring ‘XX’.

df_2 = df %>% 
  filter(str_starts(col_1, fixed('XX')))

Here is how this works:

  • This works similarly to the code under Contains above except that we use str_starts() instead of str_detect().
  • The function str_starts() from the stringr package returns TRUE only if the given substring, in this case ‘XX’, occurs at the beginning of the string.
  • The output data frame df_2 will have only the rows of the input data frame df where the value of the column col_1 starts with ‘XX’.

Regular Expression

We wish to check whether a given regular expression occurs at the beginning of a given string and to return TRUE if that is the case and FALSE otherwise.

In this example, we wish to filter the rows of the data frame df to retain only rows where the value of the column col_1 starts with any integers represented by the regular expression '\\d+'.

df_2 = df %>% 
  filter(str_starts(col_1, '\\d+'))

Here is how this works:

  • This works similarly to the code under Contains above except that we use str_starts() instead of str_detect().
  • The function str_starts() from the stringr package returns TRUE only if the given regular expression, in this case ‘\\d+’, occurs at the beginning of the string.
  • The output data frame df_2 will have only the rows of the input data frame df where the value of the column col_1 starts with one or more digits.

Alternative: Starts-With in Regex

df_2 = df %>% 
  filter(str_detect(col_1, '^\\d+'))

Here is how this works:

  • This performs the same operation as the code above. However, we use the regular expression start anchor ^ to specify that the regular expression must occur at the start of the string.
  • Our recommendation is to use just the generic str_detect() function and enforce any start or end constraints via the regular expression anchor symbols for start ^ and end $ as needed.

Ends With

Substring

We wish to check whether a given substring occurs at the end of a given string and to return TRUE if that is the case and FALSE otherwise.

In this example, we wish to filter the rows of the data frame df to retain only rows where the value of the column col_1 ends with the substring ‘XX’.

df_2 = df %>% 
  filter(str_ends(col_1, 'XX'))

Here is how this works:

  • This works similarly to the code under Contains above except that we use str_ends() instead of str_detect().
  • The function str_ends() from the stringr package returns TRUE only if the given substring, in this case ‘XX’, occurs at the end of the string.
  • The output data frame df_2 will have only the rows of the input data frame df where the value of the column col_1 ends with ‘XX’.

Regular Expression

We wish to check whether a given regular expression occurs at the end of a given string and to return TRUE if that is the case and FALSE otherwise.

In this example, we wish to filter the rows of the data frame df to retain only rows where the value of the column col_1 ends with any integers.

df = df %>% 
  filter(str_ends(col_1, '\\d+'))

Here is how this works:

  • This works similarly to the code under Contains above except that we use str_ends() instead of str_detect().
  • The function str_ends() from the stringr package returns TRUE only if the given regular expression, in this case ‘\\d+’, occurs at the end of the string.
  • The output data frame df_2 will have only the rows of the input data frame df where the value of the column col_1 ends with one or more digits.

Alternative: Ends-With in Regex

df = df %>% 
  filter(str_detect(col_1, '\\d+$'))

Here is how this works:

  • This performs the same operation as the code above. However, we use the regular expression end anchor $ to specify that the regular expression must occur at the end of the string.
  • Our recommendation is to use just the generic str_detect() function and enforce any start or end constraints via the regular expression anchor symbols for start ^ and end $ as needed.

Ignore Case

Substring

We wish to check whether a given substring occurs anywhere inside a given string while ignoring case and to return TRUE if there is a match and FALSE otherwise.

In this example, we wish to filter the rows of the data frame df to retain only rows where the value of the column col_1 contains the substring ‘XX’ regardless of case.

df_2 = df %>% 
  filter(str_detect(col_1, 
                    fixed('XX', ignore_case=TRUE)))

Here is how this works:

  • To check whether a string contains a given pattern, we use str_detect(). See Contains above.
  • When we wish to pass a plain string as a pattern for detection, we wrap it in the wrapper fixed() because str_detect() expect a regular expression by default.
  • In order to ignore case, we set the argument ignore_case of fixed() to ignore_case=TRUE.
  • Like str_detect(), we can pass an ignore_case argument to the fixed() helper passed to str_starts() and str_ends().

Alternative: Lower Case

df_2 = df %>% 
  filter(str_detect(str_to_lower(col_1), 
                    fixed('xx')))

Here is how this works:

  • One way to ignore case is to set the case of the string we are looking into and the pattern we are looking for to the same case, say lower case.
  • In str_to_lower(col_1), we lower the case of all the values of the column col_1 which are the strings we are looking into.
  • … and we pass a lower case expression to test against 'xx'.

Regular Expression

We wish to check whether a given regular expression has a match inside a given string and to return TRUE if there is such a match and FALSE otherwise.

In this example, we wish to filter the rows of the data frame df to retain only rows where the value of the column col_1 is of the form: ‘x’ followed by digits while ignoring the case.

df_2 = df %>% 
  filter(str_detect(col_1, 
                    regex('^x\\d+', ignore_case=TRUE)))

Here is how this works:

  • To check whether a string contains a given pattern, we use str_detect(). See Contains above.
  • When we wish to pass a regular expression as a pattern for detection we can pass it without any wrapper because str_detect() expect a regular expression by default.
  • However, when we wish to pass arguments specifying how some aspect of matching is to be carried out, we wrap the pattern in the helper regex() and we pass the desired arguments to regex().
  • In order to ignore case, we set the argument ignore_case of regex() to ignore_case=TRUE.
  • Like str_detect(), we can pass an ignore_case argument to the regex() helper passed to str_starts() and str_ends().

Complement

We wish to return the inverse of the outcome of the string pattern matching operations described above; i.e. if the outcome is TRUE, return FALSE and vice versa.

In this example, we wish to filter the rows of the data frame df to retain only rows where the value of the column col_1 does not contain any integers.

df_2 = df %>% 
  filter(!str_detect(col_1, '\\d+'))

Here is how this works:

  • To check whether a string contains a given pattern, we use str_detect(). See Contains above.
  • To check for the complement, i.e. to return TRUE when there is no match and FALSE when there is a match, we can use the complement operator !.
  • The regular expression '\\d+' checks for the occurrence of a substring comprised of one or more digits.
  • The output data frame df_2 will have only the rows of the input data frame df where the value of the column col_1 does not contain any digits.

Alternative: Using Function Argument

df_2 = df %>% 
  filter(str_detect(col_1, '\\d+', negate=TRUE))

Here is how this works:

  • As an alternative to using the complement operator !, we can use the negate argument of str_detect() (as well as its siblings str_starts() and str_ends()).
  • The output is the same as the primary solution above.

Pattern Column

We wish to match a vector of strings against a vector of patterns of the same size. This is often needed when we wish to check the presence of the value of a column in another column for each row.

In this example, we wish to filter the rows of the data frame df to retain only rows where the value of the column col_1 contains the value of the column col_2.

df_2 = df %>% 
  filter(str_detect(col_1, col_2))

Here is how this works:

  • str_detect() is vectorized over both the string and the pattern and can operate in one of three modes:
    • Check for one pattern in each element in a vector of strings. This is the mode we used in all the above scenarios.
    • Check n patterns against n strings. The size of both vectors must be the same for this pattern (or a multiple). This is the mode we use in this solution since we have a vector of strings, the column col_1, and a vector of patterns of the same size, the column col_2.
    • Check multiple patterns against a single string. This is the mode we use in Multiple Patterns below.
  • We pass to str_detect():
    • the strings to look into which in this case is the column col_1 as the first argument
    • and the patterns to look for which in this case is the column col_2 and is naturally of the same size as col_1.
  • str_detect() will return TRUE for rows where the value of col_2 is contained in col_1. See Contains above.
  • The output data frame df_2 will have only the rows of the input data frame df where the value of the column col_1 contains the value of the column col_2.

Multiple Patterns

OR

We wish to check whether any of a set of patterns occurs in a given string and to return TRUE if that is the case and FALSE otherwise.

In this example, we wish to filter the rows of the data frame df to retain only rows where the value of the column col_1 contains the string ‘XX’ or the string ‘YY’.

df_2 = df %>% 
  filter(str_detect(col_1, fixed('XX')) 
         | str_detect(col_1, fixed('YY')))

Here is how this works:

  • We call the function str_detect() twice:
    • once to check whether a value of the column col_1 contains 'XX' and
    • another to check whether a value fo the column col_1 contains 'YY'.
  • Each of these calls returns a vector of logical values (TRUE or FALSE) that has as many elements as the size of col_1.
  • We use the or operator | to combine these two logical vectors element wise into one logical vector that is TRUE if either call to str_detect() returns TRUE for that value of col_1. This final logical vector is passed to filter().
  • The output data frame df_2 will have only the rows of the input data frame df where the value of the column col_1 contains the string ‘XX’ or ‘YY’.

Alternative: Via Regular Expression

df_2 = df %>% 
  filter(str_detect(col_1, 'XX|YY'))

Here is how this works:

  • An alternative approach that scales well when we have multiple patterns is to use the or operator | of regular expressions to build a regular expression that captures all the patterns we wish to look for or’ed together. In this case that regular expression is 'XX|YY'.
  • We pass that regular expression to str_detect() which then returns TRUE if the string being checked (a value of the column col_1) contains either the substring ‘XX’ or the substring ‘YY’.
  • The output is the same as in the primary solution above.

Alternative: Via Non-Vectorized Operation

df_2 = df %>% 
  filter(
    map_lgl(col_1, ~any(str_detect(., c('XX', 'YY')))))

Here is how this works:

  • str_detect() is vectorized over both the string and the pattern and can operate in one of three modes:
    • Check for one pattern in each element in a vector of strings. This is the mode we used in all the above scenarios except Pattern Column.
    • Check n patterns against n strings. The size of both vectors must be the same for this pattern (or a multiple). This is the mode we used in Pattern Column above.
    • Check multiple patterns against a single string. This is the mode we use in this solution.
  • In this solution, we use map_lgl() to iterate over each value of the column col_1 and pass that to str_detect() along with a vector of multiple patterns. See Non-Vectorized Transformation.
  • The output of each iteration is as many logical values as there are patterns. In this case that is two since we have a vector of two patterns c('XX', 'YY').
  • We use any() to combine those logical values into a single value that is the or’ing of all values. This is the value returned by map_lgl() to filter(). See Logical Operations.
  • The output is the same as in the primary solution above.

AND

We wish to check whether each of a set of patterns occurs in a given string and to return TRUE if that is the case and FALSE otherwise.

In this example, we wish to filter the rows of the data frame df to retain only rows where the value of the column col_1 contains the ends with the string ‘x’ and contains a sequence of one or more digits.

df_2 = df %>% 
  filter(str_detect(col_1, '^x'), 
         str_detect(col_1, '\\d+'))

Here is how this works:

  • We call the function str_detect() twice:
    • once to check whether a value of the column col_1 matches the regular expression '^x' which checks if the string starts with the letter ‘x’.
    • another to check whether a value fo the column col_1 matches the regular expression '\\d+' which checks if the string contains a sequence of one or more digits.
  • Each of these calls returns a vector of logical values (TRUE or FALSE) that has as many elements as the size of col_1.
  • We can use the and operator & to combine these two logical vectors element wise into one logical vector that is TRUE if and only if both calls to str_detect() returns TRUE for that value of col_1. This final logical vector is passed to filter().
  • The output data frame df_2 will have only the rows of the input data frame df where the value of the column col_1 starts with 'x' and contains at least one integer.

Alternative: Via Non-Vectorized Operation

df_2 = df %>% 
  filter(
    map_lgl(col_1, ~all(str_detect(., c('^x', '\\d+')))))

Here is how this works:

  • In this solution, we use map_lgl() to iterate over each value of the column col_1 and pass that to str_detect() along with a vector of multiple patterns. See Non-Vectorized Transformation.
  • We use all() to return TRUE if and only if both patterns match. See Logical Operations.
  • The output is the same as in the primary solution above.
  • For more details on how this code works, see “Extension: Non-Vectorized Operation” under OR above.
R
I/O