We wish to select columns whose data meets certain criteria.

Oftentimes we wish to select columns whose data satisfies certain conditions e.g. the percentage of missing values is below 10%.

Essentially, we apply a predicate operation (i.e. an operation that returns `TRUE`

or `FALSE`

) to each column of the data frame and select the Columns for which the function evaluates to `TRUE`

. There are two common scenarios:

**Data Type Agnostic**: The predicate function we are applying is valid for all data types e.g. checking for missing values.**Data Type Dependent**: The predicate function we are applying is only valid for particular data types. In this case, we need to first select columns of the appropriate data type before we run the predicate function on each of those columns e.g. checking that the mean is greater than a certain value is only valid for numeric and logical data types.

For a detailed coverage of data type specific operations, see the respective data type operations section.

We wish to select columns whose data satisfies a certain condition where the condition is applicable to all data types.

In this example, we wish to return columns where the percentage of missing values to the total number of values is less than 10%.

```
df_2 = df %>% select(where(~ mean(is.na(.)) < 0.1))
```

Here is how this works:

`select()`

has a`where()`

helper which allows us to pass a function that is applied to one column at a time and should return`TRUE`

or`FALSE`

.- We use an anonymous function
`~`

to express the logic`~ mean(is.na(.)) < 0.1`

which acts on each column (referred to by the dot`.`

) and checks whether the proportion of`NA`

values`mean(is.na(.))`

is less than 0.1 (10%). `where()`

maps`TRUE`

values returned by the anonymous function (`~`

) to their corresponding column names which are then extracted from the data frame`df`

by`select()`

.

*Alternatively*

We can use a function instead of an anonymous function

```
col_select_fun <- function(col) {
col %>% is.na() %>% mean() < 0.1
}
df_2 = df %>% select(where(col_select_fun))
```

Here is how this works:

- We isolate the logic into a function (here called
`col_select_fun()`

) that we call from`where()`

inside`select()`

. - An anonymous function
`~`

is a good fit when the logic is simple, e.g. the simple one liner above. In case more elaborate logic is involved, it is better for feasibility, reliability, and readability to isolate the logic into a separate function.

We wish to select columns whose data satisfies a certain condition where the condition is applicable only to columns of particular data types. In this scenario, we need to first select the columns of the appropriate data type(s) and then apply the function to those.

In this example, we wish to return numeric columns where the mean is less than 1.

```
df_2 = df %>%
select(where(~ is.numeric(.x) && mean(.x) > 1))
```

Here is how this works:

`select()`

has a`where()`

helper which allows us to pass a function that is applied to one column at a time and should return`TRUE`

or`FALSE`

.- We first apply
`is.numeric()`

to each column. The output is`TRUE`

for columns of a numeric data type and`FALSE`

otherwise. - We use the logical and operator
`&&`

which evaluates the left hand side first and only examines the right hand side if the left hand side is`TRUE`

. Therefore, only for columns for which`is.numeric()`

evaluates to`TRUE`

does the`mean(.x) > 1`

operation get computed. - We use an anonymous function
`~`

to express the logic which acts on each column (referred to by the dot`.`

) and first checks if a columns is of a numeric data type and if so (`&&`

), checks whether the mean is less than 1. `where()`

maps`TRUE`

values returned by the anonymous function (`~`

) to their corresponding column names which are then extracted from the data frame`df`

by`select()`

.

*Alternatively*

We can use a function instead of an anonymous function

```
col_select_fun <- function(col) {
if (!is.numeric(col)) return(FALSE)
mean(col, na.rm = TRUE) > 1
}
df %>% select(where(col_select_fun))
```

Here is how this works:

- We isolate the logic into a function (here called
`col_select_fun()`

) that we call from`where()`

inside`select()`

. - An anonymous function
`~`

is a good fit when the logic is simple, e.g. the simple one liner above. In case more elaborate logic is involved, it is better for feasibility, reliability, and readability to isolate the logic into a separate function.

R