Oftentimes, we wish to select columns whose names fit a particular pattern. This is helpful in situations where we would otherwise need to type in many similar column names explicitly or in situations where we don’t know the exact column names in advance (but we know the pattern they would follow).
Column names are strings. We can therefore apply string matching operations to them. We will cover the string matching operations most commonly used in a column selection context on this page. Please see String Operations for a detailed coverage of string matching.
We wish to select columns whose names start with a given prefix.
In this example, we select all columns whose name starts with the prefix ‘col’
.
df_2 = df %>% select(starts_with('col'))
Here is how this works:
starts_with()
is a helper function meant to be used inside select().
starts_with()
takes a string prefix (here ‘col’
) and returns the names of the columns that start with the given prefix (not logical TRUE
or FALSE
values).select()
then extracts the columns whose names are returned by starts_with()
from the original Data Frame.We wish to select columns whose names end with a given suffix.
In this example, we select all columns whose name ends with the suffix ‘_id’
.
df %>% select(ends_with('_id'))
Here is how this works:
This works similarly to the scenario described above, except that we use ends_with()
instead of starts_with()
.
We wish to select columns whose names contain a given substring.
In this example, we select all columns whose name contains the substring ‘col’
.
df_2 = df %>% select(contains('col'))
Here is how this works:
This works similarly to the scenario described above, except that we use contains()
instead of starts_with()
.
We wish to select columns whose names match a given regular expression.
In this example we wish to select all columns whose names end with the digit 3 matched by the regular expression '3$'
.
df_2 = df %>% select(matches('3$'))
Here is how this works:
matches()
is a helper function meant to be used inside select().
matches()
takes a regular expression (here ‘3$’
) and returns the names of the columns that match the given regular expression (not logical TRUE
or FALSE
values).select()
then extracts the columns whose names are returned by matches()
from the original Data Frame.We wish to select columns whose names include any one of a set of substrings.
In this example, we wish to select columns whose names contain any of the words ‘lat’
, or ‘long’
, or ‘location’
.
df_2 = df %>% select(matches('lat|long|location'))
Here is how this works:
contains()
or matches()
multiple times (once for each of the tokens then combining the results (taking the union), a relatively simpler way is to use matches()
once and leverage the or
’ing capability of regular expressions.‘lat’
, or ‘long’
, or ‘location’
is 'lat|long|location’
.matches()
and select()
as described in the RegEx scenario above.