Selecting by Multiple Conditions

We wish to select columns that meet a logical combination of multiple conditions.

In Selecting by Name Pattern, Selecting by Data Type, and Selecting by Data Criteria, we covered the three most common implicit column selection patterns. In this section we cover how to combine multiple logical conditions to fulfill more elaborate column selection scenarios.

We will look at four approaches for selecting columns by multiple conditions that together cover a wide range of possible situations.

  1. Multiple Arguments: If we wish to select columns that match one of multiple conditions, we can pass those conditions as multiple arguments to select().
  2. Set Operations: This is the natural approach in the tidyverse. If the operations involved each returns a vector of column names, we can combine them together via set operations (intersection, union, difference). For example, say we have three sets of column names A, B, and C and we wish to select columns whose names are in set A or set B but not in set C.
  3. Sequential Application: To obtain the AND of multiple conditions, we can apply multiple select() operations in sequence.
  4. Boolean Logic: In the above approaches, each condition is evaluated over the entire data frame and then the sets of columns resulting from each are combined via set operations. In this approach, the logic combination happens for each column individually by applying boolean logic at the column level.

Multiple Arguments

If we wish to select columns that match one of multiple conditions, we can pass those conditions as multiple arguments to select().

In this example, we wish to select any column whose name starts with the string ‘is_’, or whose data type is logical, or whose values take two unique values or less.

df_2 = df %>% select(
  starts_with('is_'), 
  where(is.logical),
  where(~ n_distinct(.) < 3)
  )

Here is how this works:

  • select() accepts multiple comma separated conditions. It evaluates each condition then takes the union of the results.
  • We pass to select() three column selection expressions:
    • starts_with('is_') returns any column whose name starts with ‘is_’
    • where(is.logical) returns any column of a logical data type
    • where(~ n_distinct(.) < 3) returns any column whose values take less than 3 unique values
  • The returned columns are arranged following the conditions that generated them. If a particular column satisfies multiple conditions, it is arranged with the first condition that generated it.

Set Operations

We wish to select columns that match a certain logical combination of conditions.

In this example, we wish to select columns that match any one or more of three criteria which are: (1) Has the name timestamp or latlong. (2) Has a name that ends with _id. (3) Is of a numeric data type and has 10% or more missing value. This is a plausible scenario in a data cleaning context where we wish to extract columns of a given type and to keep columns that allow us to create meaningful groups for inspection.

df_2 = df %>% 
  select((where(is.numeric) 
         & where(~mean(is.na(.x)) > 0.1) 
         & ! starts_with("old_"))
         | ends_with('_id')
                 | c('timestamp', 'latlong'))

Here is how this works:

  • While the &, | and ! operators used in the code above may look like logical operators, they really are set operations between sets of column identifiers (names).
    • & denotes an intersection operation.
    • | denotes a union operation and
    • & ! denotes a difference operation (in a and not in b is a difference b).
  • where() converts boolean vectors into a set of column names to which set operations can then be applied.
  • We can replace the last two | operators with commas since select() automatically takes the union of all conditions passed to it.
  • Combining conditions via set operations is the “natural” approach in the tidyverse.

Sequential Application

We wish to select columns that match multiple column selection conditions i.e. all selected columns must match all conditions. One approach is to execute a series of selection operations. This approach is useful if some of the selection operations are isolated in standalone functions that accept a data frame and return another data frame that has a subset of the columns of the original data frame.

In this example, we wish to select columns that are of a numeric data type (integer or float) and whose name starts with the string ‘col_’.

df_2 = df %>%
  select(where(is.numeric)) %>%
  select(starts_with('col_1_'))

Here is how this works:

  • We obtain the same output as above by applying the individual conditions to be ANDed as a sequence of select() operations.
  • The first select() statement extracts columns of the data frame df that are of a numeric data type.
  • The second select() statement acts on the output of the first select statement and of the columns of a numeric data type extracts those whose name starts with the prefix ‘col_1_’ .

Logical Operations

We wish to apply a complex logical expression of multiple conditions to each column individually and return any columns for which the expression evaluates to TRUE.

In this example, we wish to select columns that are likely to hold boolean (True or False) data even if those columns are not assigned a boolean data type. To do so we decided to select string columns whose values take less than 5 possible values or that contain words that are likely to denote a logical type (e.g. Yes or No). We wish to exclude any column where more than 10% of the values are missing and finally we wish to include any columns that have the logical data type. This is a plausible scenario in a data cleaning context.

df_2 = df %>% 
  select(
    where(~ (is.character(.x) 
          && ((n_distinct(.x) > 5 
                        | any(c('Yes', 'No', 'True', 'False') %in% .x)))
          & ! (mean(is.na(.x)) > 0.1))
          | is.logical(.x)))

Here is how this works:

  • Each selection condition is applied to one column at a time retuning a scalar boolean value that is TRUE if the column matches the condition or FALSE otherwise.
  • To combine the logical conditions, we apply boolean logic operations (&, |, !) to those boolean scalar values (scalar transformation).
  • We used && so that if the left hand side condition is.character(.x) evaluates to FALSE, the right hand side (remainder of the logic) is not evaluated
  • The entire logic is constructed as a function that is called by the where() helper of select().
  • The boolean logic that we compose needs to be: (1) Applied to each column individually. (2) Encapsulated into one function (or anonymous function), and (3) The conditions apply to column content and data type but not column name (where() does not pass the column name to the function).
R
I/O