We wish to select columns that meet a logical combination of multiple conditions.
In Selecting by Name Pattern, Selecting by Data Type, and Selecting by Data Criteria, we covered the three most common implicit column selection patterns. In this section we cover how to combine multiple logical conditions to fulfill more elaborate column selection scenarios.
We will look at four approaches for selecting columns by multiple conditions that together cover a wide range of possible situations.
select()
.tidyverse
. If the operations involved each returns a vector of column names, we can combine them together via set operations (intersection, union, difference). For example, say we have three sets of column names A, B, and C and we wish to select columns whose names are in set A or set B but not in set C. select()
operations in sequence.If we wish to select columns that match one of multiple conditions, we can pass those conditions as multiple arguments to select()
.
In this example, we wish to select any column whose name starts with the string ‘is_’
, or whose data type is logical, or whose values take two unique values or less.
df_2 = df %>% select(
starts_with('is_'),
where(is.logical),
where(~ n_distinct(.) < 3)
)
Here is how this works:
select()
accepts multiple comma separated conditions. It evaluates each condition then takes the union of the results.select()
three column selection expressions:starts_with('is_')
returns any column whose name starts with ‘is_’
where(is.logical)
returns any column of a logical data typewhere(~ n_distinct(.) < 3)
returns any column whose values take less than 3 unique valuesWe wish to select columns that match a certain logical combination of conditions.
In this example, we wish to select columns that match any one or more of three criteria which are: (1) Has the name timestamp
or latlong
. (2) Has a name that ends with _id
. (3) Is of a numeric data type and has 10% or more missing value. This is a plausible scenario in a data cleaning context where we wish to extract columns of a given type and to keep columns that allow us to create meaningful groups for inspection.
df_2 = df %>%
select((where(is.numeric)
& where(~mean(is.na(.x)) > 0.1)
& ! starts_with("old_"))
| ends_with('_id')
| c('timestamp', 'latlong'))
Here is how this works:
&
, |
and !
operators used in the code above may look like logical operators, they really are set operations between sets of column identifiers (names).&
denotes an intersection operation.|
denotes a union operation and& !
denotes a difference operation (in a
and not in b
is a
difference b
).where()
converts boolean vectors into a set of column names to which set operations can then be applied.|
operators with commas since select()
automatically takes the union of all conditions passed to it.tidyverse
.We wish to select columns that match multiple column selection conditions i.e. all selected columns must match all conditions. One approach is to execute a series of selection operations. This approach is useful if some of the selection operations are isolated in standalone functions that accept a data frame and return another data frame that has a subset of the columns of the original data frame.
In this example, we wish to select columns that are of a numeric data type (integer or float) and whose name starts with the string ‘col_’
.
df_2 = df %>%
select(where(is.numeric)) %>%
select(starts_with('col_1_'))
Here is how this works:
select()
operations.select()
statement extracts columns of the data frame df
that are of a numeric data type.select()
statement acts on the output of the first select statement and of the columns of a numeric data type extracts those whose name starts with the prefix ‘col_1_’
.We wish to apply a complex logical expression of multiple conditions to each column individually and return any columns for which the expression evaluates to TRUE
.
In this example, we wish to select columns that are likely to hold boolean (True
or False
) data even if those columns are not assigned a boolean data type. To do so we decided to select string columns whose values take less than 5 possible values or that contain words that are likely to denote a logical type (e.g. Yes or No). We wish to exclude any column where more than 10% of the values are missing and finally we wish to include any columns that have the logical data type. This is a plausible scenario in a data cleaning context.
df_2 = df %>%
select(
where(~ (is.character(.x)
&& ((n_distinct(.x) > 5
| any(c('Yes', 'No', 'True', 'False') %in% .x)))
& ! (mean(is.na(.x)) > 0.1))
| is.logical(.x)))
Here is how this works:
TRUE
if the column matches the condition or FALSE
otherwise.&
, |
, !
) to those boolean scalar values (scalar transformation).&&
so that if the left hand side condition is.character(.x) evaluates to FALSE, the right hand side (remainder of the logic) is not evaluatedwhere()
helper of select()
.where()
does not pass the column name to the function).