We wish to select columns whose data meets certain criteria.
Oftentimes we wish to select columns whose data satisfies certain conditions e.g. the percentage of missing values is below 10%.
Essentially, we apply a predicate operation (i.e. an operation that returns TRUE
or FALSE
) to each column of the data frame and select the Columns for which the function evaluates to TRUE
. There are two common scenarios:
For a detailed coverage of data type specific operations, see the respective data type operations section.
We wish to select columns whose data satisfies a certain condition where the condition is applicable to all data types.
In this example, we wish to return columns where the percentage of missing values to the total number of values is less than 10%.
df_2 = df %>% select(where(~ mean(is.na(.)) < 0.1))
Here is how this works:
select()
has a where()
helper which allows us to pass a function that is applied to one column at a time and should return TRUE
or FALSE
.~
to express the logic ~ mean(is.na(.)) < 0.1
which acts on each column (referred to by the dot .
) and checks whether the proportion of NA
values mean(is.na(.))
is less than 0.1 (10%).where()
maps TRUE
values returned by the anonymous function (~
) to their corresponding column names which are then extracted from the data frame df
by select()
.Alternatively
We can use a function instead of an anonymous function
col_select_fun <- function(col) {
col %>% is.na() %>% mean() < 0.1
}
df_2 = df %>% select(where(col_select_fun))
Here is how this works:
col_select_fun()
) that we call from where()
inside select()
.~
is a good fit when the logic is simple, e.g. the simple one liner above. In case more elaborate logic is involved, it is better for feasibility, reliability, and readability to isolate the logic into a separate function.We wish to select columns whose data satisfies a certain condition where the condition is applicable only to columns of particular data types. In this scenario, we need to first select the columns of the appropriate data type(s) and then apply the function to those.
In this example, we wish to return numeric columns where the mean is less than 1.
df_2 = df %>%
select(where(~ is.numeric(.x) && mean(.x) > 1))
Here is how this works:
select()
has a where()
helper which allows us to pass a function that is applied to one column at a time and should return TRUE
or FALSE
.is.numeric()
to each column. The output is TRUE
for columns of a numeric data type and FALSE
otherwise.&&
which evaluates the left hand side first and only examines the right hand side if the left hand side is TRUE
. Therefore, only for columns for which is.numeric()
evaluates to TRUE
does the mean(.x) > 1
operation get computed.~
to express the logic which acts on each column (referred to by the dot .
) and first checks if a columns is of a numeric data type and if so (&&
), checks whether the mean is less than 1.where()
maps TRUE
values returned by the anonymous function (~
) to their corresponding column names which are then extracted from the data frame df
by select()
.Alternatively
We can use a function instead of an anonymous function
col_select_fun <- function(col) {
if (!is.numeric(col)) return(FALSE)
mean(col, na.rm = TRUE) > 1
}
df %>% select(where(col_select_fun))
Here is how this works:
col_select_fun()
) that we call from where()
inside select()
.~
is a good fit when the logic is simple, e.g. the simple one liner above. In case more elaborate logic is involved, it is better for feasibility, reliability, and readability to isolate the logic into a separate function.