Selecting by Name Pattern

Oftentimes, we wish to select columns whose names fit a particular pattern. This is helpful in situations where we would otherwise need to type in many similar column names explicitly or in situations where we don’t know the exact column names in advance (but we know the pattern they would follow).

Column names are strings. We can therefore apply string matching operations to them. We will cover the string matching operations most commonly used in a column selection context on this page. Please see String Operations for a detailed coverage of string matching.

Starts With

We wish to select columns whose names start with a given prefix.

In this example, we select all columns whose name starts with the prefix ‘col’.

df_2 = df %>% select(starts_with('col'))

Here is how this works:

  • The starts_with() is a helper function meant to be used inside select().
  • starts_with() takes a string prefix (here ‘col’) and returns the names of the columns that start with the given prefix (not logical TRUE or FALSE values).
  • select() then extracts the columns whose names are returned by starts_with() from the original Data Frame.

Ends With

We wish to select columns whose names end with a given suffix.

In this example, we select all columns whose name ends with the suffix ‘_id’.

df %>% select(ends_with('_id'))

Here is how this works:

This works similarly to the scenario described above, except that we use ends_with() instead of starts_with().

Contains

We wish to select columns whose names contain a given substring.

In this example, we select all columns whose name contains the substring ‘col’.

df_2 = df %>% select(contains('col'))

Here is how this works:

This works similarly to the scenario described above, except that we use contains() instead of starts_with().

Matches RegEx

We wish to select columns whose names match a given regular expression.

In this example we wish to select all columns whose names end with the digit 3 matched by the regular expression '3$'.

df_2 = df %>% select(matches('3$'))

Here is how this works:

  • The matches() is a helper function meant to be used inside select().
  • matches() takes a regular expression (here ‘3$’) and returns the names of the columns that match the given regular expression (not logical TRUE or FALSE values).
  • select() then extracts the columns whose names are returned by matches() from the original Data Frame.

Multiple Tokens

We wish to select columns whose names include any one of a set of substrings.

In this example, we wish to select columns whose names contain any of the words ‘lat’, or ‘long’, or ‘location’ .

df_2 = df %>% select(matches('lat|long|location'))

Here is how this works:

  • Instead of executing contains() or matches() multiple times (once for each of the tokens then combining the results (taking the union), a relatively simpler way is to use matches() once and leverage the or’ing capability of regular expressions.
  • A regular expression that matches ‘lat’, or ‘long’, or ‘location’ is 'lat|long|location’.
  • Now that we have the regular expression we need, we use matches() and select() as described in the RegEx scenario above.
R
I/O