Dynamic Column Specification

We wish to pass the names of the columns to be selected dynamically.

We will cover the following:

  1. As Environment Variable: The names of the columns to be selected are specified as a variable in the environment. This is useful for instance when structuring a script and wish to have column specification as part of a configuration section at the beginning of the script separate from the logic.
  2. As Function Argument: Column selection happens inside a function and the names of the columns to be selected are passed to the function as an argument. This is useful when we wish to isolate our data manipulation code into a reusable function.
  3. Flexible Matching: We see how to ignore elements in the passed list of column names that do not match any actual column names.

As Environment Variable

We wish to specify the names of the columns to be selected as a variable in the environment.

In this example, we specify the names of the columns we wish to select as a variable cols_to_select and then use that variable for column selection.

cols_to_select = c('col_1','col_2','col_3')
df_2 = df %>% select(all_of(cols_to_select))

Here is how this works:

  • The column names to be selected are specified as a vector stored in the variable cols_to_select.
  • In order to pass the vector cols_to_select to select(), we wrap it in the all_of() function.
  • all_of() allows the selection of columns in the data frame df whose name matches the elements in the cols_to_select list.
  • all_of() is strict i.e. all strings in cols_to_select must match a column name otherwise an error is thrown. To ignore list elements that do not match any of the data frame’s column names, see “Flexible Matching” below.

As Function Argument

We wish to pass the names of the columns to be selected as ab argument to a function. The actual column selection happens inside the function.

In this example, column selection happens inside the function pipeline() which takes the names of the columns to be selected as an argument cols_to_select.

pipeline <- function(df, cols_to_select) {
  df %>%
    select(all_of(cols_to_select))
}

df_2 = df %>%
  pipeline(c('col_1','col_2','col_3'))

Here is how this works:

  • The function pipeline() has two arguments: the data frame df and the names of the columns to be selected cols_to_select.
  • We pass the data frame to the first argument of the function pipeline() via the pipe %>%.
  • We pass the names of the columns to be selected as a vector to the cols_to_select argument.
  • Inside the function we use all_of() with select() as described in the Environment Variables scenario described above.
  • all_of() is strict i.e. all strings in cols_to_select must match a column name otherwise an error is thrown. To ignore list elements that do not match any of the data frame’s column names, see “Flexible Matching” below.

Flexible Matching

We wish to select any columns whose names are in a list of strings where the list may have strings that do not match any column names. We wish to select columns with matching names and ignore the non matching strings (do not throw an error).

cols_to_select = c('col_1','col_2','col_3')
df_2 = df %>% select(any_of(cols_to_select))

Here is how this works:

  • This works similarly to the scenarios above except that we use any_of() instead of all_of().
  • any_of() is forgiving in that it ignores any strings in the vector passed to it (here cols_to_select) that do not match any column names.
R
I/O