We wish to specify dynamically the columns to which filtering logic would be applied.
We will cover two scenarios:
tidyverse
chain to a function. One Column to One Argument
We wish to pass one column to a function as a data variable which would apply a logical expression to that column to filter rows of a data frame.
In this example, we wish to create a custom row filtering function which takes a data frame and one column referred to via a data variable. The function returns the rows of the data frame where the value of the column var
is greater than zero.
m_filter <- function(.df, var) {
.df %>%
filter({{ var }} > 0)
}
df_2 = df %>%
m_filter(col_1)
Here is how this works:
m_filter()
to which we pass the data frame to be filtered df
via the pipe operator %>%
and the column to use for filtering col_1
.{{ }}
operator to "tunnel" the data variable col_1
through the function argument var
.N Columns to N Arguments
We wish to pass a fixed number of columns as data variables to a function; each to a separate argument of the function. The function would apply one or more logical expression to those columns to filter rows of a data frame.
In this example, we wish to create a custom row filtering function which takes a data frame and two columns referred to via data variables. The function returns the rows of the data frame where the value of the first column var_1
is not NA
and the value of the second column var_2
is greater than zero.
m_filter <- function(.df, var_1, var_2) {
.df %>%
filter(!is.na({{ var_1 }}),
{{ var_2 }} > 0)
}
df_2 = df %>%
m_filter(col_1, col_2)
Here is how this works:
{{ }}
operator to "tunnel" each of the data variables through their respective function argument (col_1
through var_1
and col_2
through var_2
).N Columns to One Argument
We wish to pass an arbitrary number of columns (i.e. the number of columns is not known in advance) to a function wherein filtering the rows of the data frame takes place.
In this example, we wish to create a custom row filtering function which takes a data frame and any number of columns referred to via data variables. The function returns the rows of the data frame where any of the columns is NA
.
m_filter <-function(.data, vars){
.data %>%
filter(if_any({{ vars }}, is.na))
}
df_2 = df %>%
m_filter(c(col_1, col_2, col_3))
Here is how this works:
{{ }}
operator to "tunnel" a vector or list of data variables through a function argument to if_any()
.if_any()
applies the given function (here is.na()
) to each of the selected columns and returns TRUE
if the function returns TRUE
for any of the columns (See Implicit Filtering).The names of the column that we wish to use for filtering a data frame are specified as strings (which can be environment variables or function arguments).
One Column
We wish to specify the name of one column as a string variable and then use the string variable to refer to the corresponding column in a filtering expression.
In this example, we wish to return the rows of the data frame df
where the column whose name is stored as a string in the variable col
has a value greater than zero.
col = 'col_1'
df_2 = df %>%
filter(.data[[col]] > 0)
Here is how this works:
tidyverse
, the recommended approach to select columns whose name is expressed as a string variable is via the .data
construct..data
is not a data frame. Rather, it’s a special construct, referred to as a pronoun in tidyverse
lingo, that allows us to access the columns of the current data frame (the data variables) via string variables like we do in .data[[col]]
..data[[col]]
returns the column col_1
so filter(.data[[col]] > 0)
is equivalent to filter(col_1 > 0)
where col = ‘col_1’
.Multiple Columns: Explicit Application
We wish to specify the names of multiple columns as a string vector. We wish to use that string vector to refer to the corresponding columns in a filtering expression.
cols = c('col_1', 'col_2', 'col_3')
df_2 = df %>%
filter(!is.na(.data[[cols[[1]]]]),
.data[[cols[[2]]]] < .data[[cols[[3]]]])
Here is how this works:
.data
pronoun to select columns whose names are expressed as a string variable as described in the “One Column” scenario above..data[[cols[[1]]]]
returns the column col_1
since the first entry in the vector cols is ‘col_1’
.Multiple Columns: Implicit Application
We wish to specify the names of multiple columns as a string vector. We wish to apply the same predicate function to each of the columns and select the rows where each returns TRUE
.
In this example, we wish to return rows of the data frame df
where all the columns whose names are in the string vector cols
have a value of NA.
cols = c('col_1', 'col_3', 'col_5', 'col_x')
df_2 = df %>%
filter(if_all(any_of(cols), is.na))
Here is how this works:
if_all()
. In this alternative solution we wish to apply the same predicate function is.na()
to each of the selected columns and select the rows where each returns TRUE
.if_all(any_of(cols), is.na)
, if_all()
takes a selection of columns and a function is.na()
and applies the function to each of the selected columns then returns TRUE
if the function returns TRUE
for all the columns. See Implicit Filtering for a detailed description.any_of()
returns any columns of the current data frame whose names are in the given string vector. any_of()
and its sibling all_of()
are the recommended approach in the tidyverse
to select multiple columns via a vector of string column names. See Dynamic Selection.