Selecting by Data Criteria

We wish to select columns whose data meets certain criteria.

Oftentimes we wish to select columns whose data satisfies certain conditions e.g. the percentage of missing values is below 10%.

Essentially, we apply a predicate operation (i.e. an operation that returns TRUE or FALSE) to each column of the data frame and select the Columns for which the function evaluates to TRUE. There are two common scenarios:

  1. Data Type Agnostic: The predicate function we are applying is valid for all data types e.g. checking for missing values.
  2. Data Type Dependent: The predicate function we are applying is only valid for particular data types. In this case, we need to first select columns of the appropriate data type before we run the predicate function on each of those columns e.g. checking that the mean is greater than a certain value is only valid for numeric and logical data types.

For a detailed coverage of data type specific operations, see the respective data type operations section.

Data Type Agnostic

We wish to select columns whose data satisfies a certain condition where the condition is applicable to all data types.

In this example, we wish to return columns where the percentage of missing values to the total number of values is less than 10%.

df_2 = df %>% select(where(~ mean(is.na(.)) < 0.1))

Here is how this works:

  • select() has a where() helper which allows us to pass a function that is applied to one column at a time and should return TRUE or FALSE.
  • We use an anonymous function ~ to express the logic ~ mean(is.na(.)) < 0.1 which acts on each column (referred to by the dot .) and checks whether the proportion of NA values mean(is.na(.)) is less than 0.1 (10%).
  • where() maps TRUE values returned by the anonymous function (~) to their corresponding column names which are then extracted from the data frame df by select().

Alternatively

We can use a function instead of an anonymous function

col_select_fun <- function(col) {
  col %>% is.na() %>% mean() < 0.1
}

df_2 = df %>% select(where(col_select_fun))

Here is how this works:

  • We isolate the logic into a function (here called col_select_fun()) that we call from where() inside select().
  • An anonymous function ~ is a good fit when the logic is simple, e.g. the simple one liner above. In case more elaborate logic is involved, it is better for feasibility, reliability, and readability to isolate the logic into a separate function.

Data Type Dependent

We wish to select columns whose data satisfies a certain condition where the condition is applicable only to columns of particular data types. In this scenario, we need to first select the columns of the appropriate data type(s) and then apply the function to those.

In this example, we wish to return numeric columns where the mean is less than 1.

df_2 = df %>% 
  select(where(~ is.numeric(.x) && mean(.x) > 1))

Here is how this works:

  • select() has a where() helper which allows us to pass a function that is applied to one column at a time and should return TRUE or FALSE.
  • We first apply is.numeric() to each column. The output is TRUE for columns of a numeric data type and FALSE otherwise.
  • We use the logical and operator && which evaluates the left hand side first and only examines the right hand side if the left hand side is TRUE. Therefore, only for columns for which is.numeric() evaluates to TRUE does the mean(.x) > 1 operation get computed.
  • We use an anonymous function ~ to express the logic which acts on each column (referred to by the dot .) and first checks if a columns is of a numeric data type and if so (&&), checks whether the mean is less than 1.
  • where() maps TRUE values returned by the anonymous function (~) to their corresponding column names which are then extracted from the data frame df by select().

Alternatively

We can use a function instead of an anonymous function

col_select_fun <- function(col) {
  if (!is.numeric(col)) return(FALSE)
  mean(col, na.rm = TRUE) > 1
}

df %>% select(where(col_select_fun))

Here is how this works:

  • We isolate the logic into a function (here called col_select_fun()) that we call from where() inside select().
  • An anonymous function ~ is a good fit when the logic is simple, e.g. the simple one liner above. In case more elaborate logic is involved, it is better for feasibility, reliability, and readability to isolate the logic into a separate function.
R
I/O