Dynamic Column Specification

We wish to specify dynamically the columns to which filtering logic would be applied.

We will cover two scenarios:

  1. As Data Variable where we cover how to pass column names as data variables from a tidyverse chain to a function.
  2. As String Variable where we cover how to parse column names expressed as string variables and use the corresponding columns in filtering expressions.

As Data Variable

One Column to One Argument

We wish to pass one column to a function as a data variable which would apply a logical expression to that column to filter rows of a data frame.

In this example, we wish to create a custom row filtering function which takes a data frame and one column referred to via a data variable. The function returns the rows of the data frame where the value of the column var is greater than zero.

m_filter <- function(.df, var) {  
  .df %>%    
    filter({{ var }} > 0)
}

df_2 = df %>% 
    m_filter(col_1)

Here is how this works:

  • We have a custom function m_filter() to which we pass the data frame to be filtered df via the pipe operator %>% and the column to use for filtering col_1.
  • We use the {{ }} operator to "tunnel" the data variable col_1 through the function argument var.

N Columns to N Arguments

We wish to pass a fixed number of columns as data variables to a function; each to a separate argument of the function. The function would apply one or more logical expression to those columns to filter rows of a data frame.

In this example, we wish to create a custom row filtering function which takes a data frame and two columns referred to via data variables. The function returns the rows of the data frame where the value of the first column var_1 is not NA and the value of the second column var_2 is greater than zero.

m_filter <- function(.df, var_1, var_2) {  
  .df %>%    
    filter(!is.na({{ var_1 }}), 
           {{ var_2 }} > 0)
}

df_2 = df %>% 
  m_filter(col_1, col_2)

Here is how this works:

  • This works similarly to the “One Column” scenario except that we pass N columns where N here is 2.
  • We use the {{ }} operator to "tunnel" each of the data variables through their respective function argument (col_1 through var_1 and col_2 through var_2).

N Columns to One Argument

We wish to pass an arbitrary number of columns (i.e. the number of columns is not known in advance) to a function wherein filtering the rows of the data frame takes place.

In this example, we wish to create a custom row filtering function which takes a data frame and any number of columns referred to via data variables. The function returns the rows of the data frame where any of the columns is NA.

m_filter <-function(.data, vars){
  .data %>%    
    filter(if_any({{ vars }}, is.na))
}

df_2 = df %>%
  m_filter(c(col_1, col_2, col_3))

Here is how this works:

  • We use the {{ }} operator to "tunnel" a vector or list of data variables through a function argument to if_any().
  • if_any() applies the given function (here is.na()) to each of the selected columns and returns TRUE if the function returns TRUE for any of the columns (See Implicit Filtering).

As String Variable

The names of the column that we wish to use for filtering a data frame are specified as strings (which can be environment variables or function arguments).

One Column

We wish to specify the name of one column as a string variable and then use the string variable to refer to the corresponding column in a filtering expression.

In this example, we wish to return the rows of the data frame df where the column whose name is stored as a string in the variable col has a value greater than zero.

col = 'col_1'

df_2 = df %>%
  filter(.data[[col]] > 0)

Here is how this works:

  • In the tidyverse, the recommended approach to select columns whose name is expressed as a string variable is via the .data construct.
  • While it may seem like one, .data is not a data frame. Rather, it’s a special construct, referred to as a pronoun in tidyverse lingo, that allows us to access the columns of the current data frame (the data variables) via string variables like we do in .data[[col]].
  • .data[[col]] returns the column col_1 so filter(.data[[col]] > 0) is equivalent to filter(col_1 > 0) where col = ‘col_1’.

Multiple Columns: Explicit Application

We wish to specify the names of multiple columns as a string vector. We wish to use that string vector to refer to the corresponding columns in a filtering expression.

cols = c('col_1', 'col_2', 'col_3')

df_2 = df %>%
  filter(!is.na(.data[[cols[[1]]]]),
         .data[[cols[[2]]]] < .data[[cols[[3]]]])

Here is how this works:

  • We use the .data pronoun to select columns whose names are expressed as a string variable as described in the “One Column” scenario above.
  • We refer to each column by indexing the corresponding column name in the cols vector. For instance .data[[cols[[1]]]] returns the column col_1 since the first entry in the vector cols is ‘col_1’.

Multiple Columns: Implicit Application

We wish to specify the names of multiple columns as a string vector. We wish to apply the same predicate function to each of the columns and select the rows where each returns TRUE.

In this example, we wish to return rows of the data frame df where all the columns whose names are in the string vector cols have a value of NA.

cols = c('col_1', 'col_3', 'col_5', 'col_x')

df_2 = df %>%
  filter(if_all(any_of(cols), is.na))

Here is how this works:

  • If we wish to apply the same filtering logic to all columns, a more succinct approach may be to use if_all(). In this alternative solution we wish to apply the same predicate function is.na() to each of the selected columns and select the rows where each returns TRUE.
  • In if_all(any_of(cols), is.na), if_all() takes a selection of columns and a function is.na() and applies the function to each of the selected columns then returns TRUE if the function returns TRUE for all the columns. See Implicit Filtering for a detailed description.
  • any_of() returns any columns of the current data frame whose names are in the given string vector. any_of() and its sibling all_of() are the recommended approach in the tidyverse to select multiple columns via a vector of string column names. See Dynamic Selection.
R
I/O