Filtering

We wish to filter a vector of strings to retain only the elements that match a given pattern.

There are two common scenarios which we will cover below:

  • Return Matches: We wish to return the list’s string elements that match a given pattern.
  • Return Locations: We wish to return the indices of the list’s string elements that match a given pattern.

For each of these four scenarios, we will cover two cases:

  • Substring where the pattern is a plain character sequence
  • Regular Expressions where the pattern is a regular expression

In addition, we cover the following scenarios which can be applied to extend any of the above:

  • Ignore Case: Ignoring case (of both pattern and string) while matching.
  • Complement: How to obtain the elements of a vector that do not match the given pattern.
  • Pattern Column: How to return elements of a vector that match corresponding patterns provided as elements of another vector.
  • Multiple Patterns: Checking if any of a set of multiple patterns is matched.

Return Matches

Given a vector of strings, we wish to return only the elements that match a given pattern.

Substring

Given a vector of strings, we wish to return only the elements that contain a given substring.

In this example, we wish to return the unique values of the column col_1 that contain the substring ‘xy’.


df %>% 
  pull(col_1) %>% 
  str_unique() %>% 
  str_subset(fixed('xy'))

Here is how this works:

  • In pull(col_1), we extract the column col_1 as a vector. See Selecting Single Column.
  • We then use the str_unique() function to return the unique values of the column col_1 i.e. remove duplicated values. See Uniqueness.
  • We then use the str_subset() function to filter the unique elements of the column col_1 that contain the substring ‘xy’.
  • We wrap the substring in the helper fixed() because by default str_subset() assumes that the pattern passed is a regular expression. fixed() specifies that the pattern is a fixed string. In this case though, since our simple pattern would behave the same whether we treat it as a string or a regular expression, we can drop fixed().
  • Note that str_subset() is convenience for using str_detect(), which we cover in Detecting, and then filtering i.e. x[str_detect(x, pattern)].
  • The output will be a vector containing the unique elements of the column col_1 that contain the substring ‘xy’.

Regular Expression

Given a vector of strings, we wish to return only the elements that match a given regular expression.

In this example, we wish to return the unique values of the column col_1 that contain the pattern ‘x’ followed by a sequence of digits.

df %>% 
  pull(col_1) %>% 
  str_unique() %>% 
  str_subset('x\\d+')

Here is how this works:

  • This works similarly to the Substring scenario above except that we pass to str_subset() a regular expression.
  • The regular expression is 'x\\d+' where ‘x’ matches the character x and ‘\\d+’ matches a sequence of one or more digits.
  • str_subset() expects a regular expression by default.
  • The output will be a vector containing the unique elements of the column col_1 that contain the pattern ‘x’ followed by a sequence of digits.

Return Locations

Given a vector of strings, we wish to return the indices of the elements that match a given pattern.

Substring

Given a vector of strings, we wish to return the indices of the elements that contain a given substring.

In this example, we wish to return the index of the first element (the smallest row number) of the column col_2 that contains the substring ‘XY’ for each group where the groups are defined by the column col_1.

df_2 = df %>%
  group_by(col_1) %>%
  summarize(
    first_match = min(str_which(col_2, 'XY'))
)

Here is how this works:

  • We apply group_by(col_1) on the data frame df so that the subsequent call to summarize() acts on each group separately. See Basic Aggregation.
  • We create an aggregate column first_match that comprises the minimum of the indices of all matching strings for each group. We obtain that as follows:
    • We use str_which(col_2, 'XY') to return the indices of matching elements of the values of the column col_2 for the current group.
    • We then use min() to obtain the smallest value.
  • Note that str_which() is convenience for using str_detect(), which we cover in Detecting, and then using which() i.e. which[str_detect(x, pattern)].
  • The output is a data frame df_2 that has one row for each unique value of the column col_1 and two columns col_1 and first_match. The column first_match holds the index of the first row where the value of col_2 contains the substring ‘XY’ for each group.

Regular Expression

Given a vector of strings, we wish to return the indices of the elements that match a given regular expression.

In this example, we wish to return the index of the first element (the smallest row number) of the column col_2 that start with ‘X’ and end with ‘Y’ (and the middle is either ‘X’ or ‘Y’) for each group where the groups are defined by the column col_1.

df = tibble(
  col_1 = c('a', 'a', 'a', 'b', 'b', 'b'),
  col_2 = c('XXX', 'XXY', 'XYX', 'XXX', 'YYY', 'XYY'))

df_2 = df %>%
  group_by(col_1) %>%
  summarize(
    first_match = min(str_which(col_2, '^X[XY]Y$'))
  )

Here is how this works:

  • This works similarly to the Substring scenario above except that we pass to str_which() a regular expression.
  • The regular expression is '^X[XY]Y$' where:
    • X and Y match the corresponding characters
    • [XY] matches one character that is either X or Y.
    • ^ and $ are the string start and string end anchors respectively.
  • str_which() expects a regular expression by default.
  • The output is a data frame df_2 that has one row for each unique value of the column col_1 and two columns col_1 and first_match. The column first_match holds the index of the first row where the value of col_2 matches the given pattern for each group.

Ignore Case

Substring

df %>% 
  pull(col_1) %>% 
    str_unique() %>% 
  str_subset(fixed('XY', ignore_case=TRUE))

Here is how this works:

  • This works similarly to the code under Return Matches above.
  • To ignore case while matching, we wrap the substring in fixed() and pass the parameter ignore_case=TRUE. See Ignore Case under Detecting for more details.

Regular Expression

df %>% 
  pull(col_1) %>% 
    str_unique() %>% 
  str_subset(regex('\\d+gm', ignore_case=TRUE))

Here is how this works:

  • This works similarly to the code under Return Matches above.
  • To ignore case while matching, we wrap the regular expression in regex() and pass the parameter ignore_case=TRUE. See Ignore Case under Detecting for more details.

Complement

Given a vector of strings, we wish to return only the elements that do not match a given pattern.

df %>% 
  pull(col_1) %>% 
    str_unique() %>% 
  str_subset('XY', negate=TRUE)

Here is how this works:

  • This works similarly to the code under Return Matches above.
  • The same approach of passing negate=TRUE can be applied to str_which().

Pattern Column

We wish to match a vector of strings against a vector of patterns of the same size. This is often needed when we wish to locate the value of a column in the value of another column for each row.

In this example, we wish to return the elements of the column col_2 that contain the corresponding value of the column col_3 concatenated together for each group, where the groups are defined by the value of the column col_1.

df = tibble(
  col_1 = c('a', 'a', 'a', 'b', 'b', 'b'),
  col_2 = c('XXX', 'YYY', 'XYX', 'YXY', 'YXX', 'XYY'),
  col_3 = c('xx', 'xx', 'xy', 'xy', 'yy', 'yy'))

df_2 = df %>%
  group_by(col_1) %>%
  summarize(
    summary = col_2 %>% 
      str_subset(fixed(col_3, ignore_case = TRUE)) %>% 
      str_flatten(collapse = ', '))

Here is how this works:

  • We apply group_by(col_1) on the data frame df so that the subsequent call to summarize() acts on each group separately. See Basic Aggregation.
  • str_subset() is vectorized over both the string and the pattern and can operate in one of two modes:
    • Check for one pattern in each element in a vector of strings. This is the mode we used in all the above scenarios.
    • Check n patterns against n strings. The size of both vectors must be the same for this pattern (or a multiple). This is the mode we use in this solution since we have a vector of strings, the column col_2, and a vector of patterns of the same size, the column col_3.
  • We pass to str_subst():
    • First Argument: The column col_2 via the pipe %>% which is the vector we wish to filter
    • Second Argument: The column col_3 which contains the patterns to match against the corresponding elements of col_2 and is naturally of the same size as col_2.
  • We then use str_flatten() to collapse the values returned by str_subset() for each group with comma separators. See Collapsing.
  • The output is a data frame df_2 that has one row for each unique value of the column col_1 and two columns col_1 and summary. The column summary holds values of column col_2 that contain the corresponding values of the column col_3 separated by commas, for each group.

Multiple Patterns

Given a vector of strings, we wish to return the elements that match any pattern in a given set of patterns.

In this example, we wish to retain elements of the column col_1 that contain a sequence of digits followed by the string ‘gm’ or a sequence of digits followed by the string ‘g’.

df %>% 
  pull(col_1) %>% 
  str_subset('\\d+gm|\\d+g')

Here is how this works:

  • We use the or operator | of regular expressions to build a regular expression that captures all the patterns we wish to look for or’ed together. In this case that regular expression is '\\d+gm|\\d+g'.
  • We pass that regular expression to str_subset() which then returns the elements of the column col_1 that match the regular expression i.e. any of the patterns.
R
I/O