Implicit Column Selection

Oftentimes we wish to select columns, not by explicitly spelling out their names or positions, but by criteria satisfied by the desired columns. The three most common scenarios are:

  1. Name Pattern where we cover how to select columns whose names satisfy a given pattern e.g. select columns with a name that contains the string ‘_id’ .
  2. Data Type where we cover how to select columns of one or more data types e.g. select columns with a numeric data type.
  3. Data Criteria where we cover how select columns whose data satisfies a certain condition e.g. the percentage of missing values is below 10%.

In Multiple Conditions, we cover how to combine multiple conditions in different ways to realize more complex column selection logic

Implicit column selection works in two steps:

  1. Identification We write logic (as a function or an anonymous function) that acts on each column of a data frame to check if certain conditions are satisfied and returns one of three types of output:
    • A boolean vector with as many elements as there are columns in the data frame and where the boolean element is TRUE if the corresponding column is to be selected.
    • The names of the columns to be selected.
    • The positions of the columns to be selected.
  2. Extraction This output from step 1 is then passed to the selection operator select() to carry out the extraction of columns from the original data frame.
    • If the output of step 1 is column names or positions, we can pass it to select() directly.
    • If the output of step 1 is a boolean vector, it needs to be wrapped inside where() i.e. select(where(bool_func)).
R
I/O