We wish to filter a vector of strings to retain only the elements that match a given pattern.
There are two common scenarios which we will cover below:
For each of these four scenarios, we will cover two cases:
In addition, we cover the following scenarios which can be applied to extend any of the above:
Given a vector of strings, we wish to return only the elements that match a given pattern.
Substring
Given a vector of strings, we wish to return only the elements that contain a given substring.
In this example, we wish to return the unique values of the column col_1
that contain the substring ‘xy’
.
df %>%
pull(col_1) %>%
str_unique() %>%
str_subset(fixed('xy'))
Here is how this works:
pull(col_1)
, we extract the column col_1
as a vector. See Selecting Single Column.str_unique()
function to return the unique values of the column col_1
i.e. remove duplicated values. See Uniqueness.str_subset()
function to filter the unique elements of the column col_1
that contain the substring ‘xy’
.fixed()
because by default str_subset()
assumes that the pattern passed is a regular expression. fixed()
specifies that the pattern is a fixed string. In this case though, since our simple pattern would behave the same whether we treat it as a string or a regular expression, we can drop fixed()
.str_subset()
is convenience for using str_detect()
, which we cover in Detecting, and then filtering i.e. x[str_detect(x, pattern)]
.col_1
that contain the substring ‘xy’
.Regular Expression
Given a vector of strings, we wish to return only the elements that match a given regular expression.
In this example, we wish to return the unique values of the column col_1
that contain the pattern ‘x’
followed by a sequence of digits.
df %>%
pull(col_1) %>%
str_unique() %>%
str_subset('x\\d+')
Here is how this works:
str_subset()
a regular expression.'x\\d+'
where ‘x’
matches the character x and ‘\\d+’
matches a sequence of one or more digits.str_subset()
expects a regular expression by default.col_1
that contain the pattern ‘x’
followed by a sequence of digits.Given a vector of strings, we wish to return the indices of the elements that match a given pattern.
Substring
Given a vector of strings, we wish to return the indices of the elements that contain a given substring.
In this example, we wish to return the index of the first element (the smallest row number) of the column col_2
that contains the substring ‘XY’
for each group where the groups are defined by the column col_1
.
df_2 = df %>%
group_by(col_1) %>%
summarize(
first_match = min(str_which(col_2, 'XY'))
)
Here is how this works:
group_by(col_1)
on the data frame df so that the subsequent call to summarize()
acts on each group separately. See Basic Aggregation.first_match
that comprises the minimum of the indices of all matching strings for each group. We obtain that as follows:str_which(col_2, 'XY')
to return the indices of matching elements of the values of the column col_2
for the current group.min()
to obtain the smallest value.str_which()
is convenience for using str_detect()
, which we cover in Detecting, and then using which()
i.e. which[str_detect(x, pattern)]
.df_2
that has one row for each unique value of the column col_1
and two columns col_1
and first_match
. The column first_match
holds the index of the first row where the value of col_2
contains the substring ‘XY’
for each group.Regular Expression
Given a vector of strings, we wish to return the indices of the elements that match a given regular expression.
In this example, we wish to return the index of the first element (the smallest row number) of the column col_2
that start with ‘X’
and end with ‘Y’
(and the middle is either ‘X’
or ‘Y’
) for each group where the groups are defined by the column col_1
.
df = tibble(
col_1 = c('a', 'a', 'a', 'b', 'b', 'b'),
col_2 = c('XXX', 'XXY', 'XYX', 'XXX', 'YYY', 'XYY'))
df_2 = df %>%
group_by(col_1) %>%
summarize(
first_match = min(str_which(col_2, '^X[XY]Y$'))
)
Here is how this works:
str_which()
a regular expression.'^X[XY]Y$'
where:X
and Y
match the corresponding characters[XY]
matches one character that is either X or Y.^
and $
are the string start and string end anchors respectively.str_which()
expects a regular expression by default.df_2
that has one row for each unique value of the column col_1
and two columns col_1
and first_match
. The column first_match
holds the index of the first row where the value of col_2
matches the given pattern for each group.Substring
df %>%
pull(col_1) %>%
str_unique() %>%
str_subset(fixed('XY', ignore_case=TRUE))
Here is how this works:
fixed()
and pass the parameter ignore_case=TRUE
. See Ignore Case under Detecting for more details.Regular Expression
df %>%
pull(col_1) %>%
str_unique() %>%
str_subset(regex('\\d+gm', ignore_case=TRUE))
Here is how this works:
regex()
and pass the parameter ignore_case=TRUE
. See Ignore Case under Detecting for more details.Given a vector of strings, we wish to return only the elements that do not match a given pattern.
df %>%
pull(col_1) %>%
str_unique() %>%
str_subset('XY', negate=TRUE)
Here is how this works:
negate=TRUE
can be applied to str_which()
.We wish to match a vector of strings against a vector of patterns of the same size. This is often needed when we wish to locate the value of a column in the value of another column for each row.
In this example, we wish to return the elements of the column col_2
that contain the corresponding value of the column col_3
concatenated together for each group, where the groups are defined by the value of the column col_1
.
df = tibble(
col_1 = c('a', 'a', 'a', 'b', 'b', 'b'),
col_2 = c('XXX', 'YYY', 'XYX', 'YXY', 'YXX', 'XYY'),
col_3 = c('xx', 'xx', 'xy', 'xy', 'yy', 'yy'))
df_2 = df %>%
group_by(col_1) %>%
summarize(
summary = col_2 %>%
str_subset(fixed(col_3, ignore_case = TRUE)) %>%
str_flatten(collapse = ', '))
Here is how this works:
group_by(col_1)
on the data frame df so that the subsequent call to summarize()
acts on each group separately. See Basic Aggregation.str_subset()
is vectorized over both the string and the pattern and can operate in one of two modes:col_2
, and a vector of patterns of the same size, the column col_3
.str_subst()
:col_2
via the pipe %>%
which is the vector we wish to filtercol_3
which contains the patterns to match against the corresponding elements of col_2
and is naturally of the same size as col_2
.str_flatten()
to collapse the values returned by str_subset()
for each group with comma separators. See Collapsing.df_2
that has one row for each unique value of the column col_1
and two columns col_1
and summary
. The column summary
holds values of column col_2
that contain the corresponding values of the column col_3
separated by commas, for each group.Given a vector of strings, we wish to return the elements that match any pattern in a given set of patterns.
In this example, we wish to retain elements of the column col_1
that contain a sequence of digits followed by the string ‘gm’
or a sequence of digits followed by the string ‘g’
.
df %>%
pull(col_1) %>%
str_subset('\\d+gm|\\d+g')
Here is how this works:
|
of regular expressions to build a regular expression that captures all the patterns we wish to look for or’ed together. In this case that regular expression is '\\d+gm|\\d+g'
.str_subset()
which then returns the elements of the column col_1
that match the regular expression i.e. any of the patterns.