Function Specification

We wish to specify one or more logical expression(s) or predicate function(s) (functions that return TRUE or FALSE) to apply to each of the selected columns in an implicit filtering context.

In this section, we cover the following function specification scenarios:

  • A named function which may be a built in function or a custom function.
  • An anonymous function defined via ~.
  • Multiple functions each of which is applied separately to each of the selected columns.
  • A Non Vectorized Function i.e. one that acts on one row at a time.

This section is complemented by

  • Column Selection where we cover how to select the columns to each of which we will apply filtering logic.
  • Relationship Specification where we cover how to combine the results of applying the specified function(s) to the specified column(s) in either an AND manner or an OR manner.

Named Function

We wish to filter rows of a data frame by applying a named predicate function to each of a selected set of columns and then taking a logical combination of the results.

In this example, we wish to filter the rows of the data frame df for which the value of any column whose name contains the string ‘cvr’ is missing (NA).

df_2 = df %>% 
    filter(if_any(contains('cvr'), is.na))

Here is how this works:

  • We pass the data frame df to the function filter().
  • Inside filter() we use if_any() to specify our implicit filtering logic as follows
    • The first argument to if_any() is a selection of columns. In this example, we use contains('cvr') to select any column whose name contains the substring ‘cvr’.
    • The second argument to if_any() is the logical expression that we wish to apply to every column selected in the first argument. In this case the logical expression we wish to apply is the named function is.na() which returns TRUE if a value is NA.
  • The named function is.na (the second argument to if_any()) is applied to each column selected via contains('cvr') (the first argument to if_any()) and the results are combined via an OR operation i.e. a row is retained (included in the output) if its value is NA for any of the columns.
  • The counter part of if_any() is if_all() which requires that the logical expression evaluates to TRUE for all columns. See Relationship Selection.

Anonymous function

We wish to filter rows of a data frame by applying an anonymous predicate function to each of a selected set of columns and then taking a logical combination of the results.

In this example, we wish to filter the rows of the data frame df for which the value of any column whose name contains the string ‘cvr’ is less than 0.1.

df_2 = df %>% 
    filter(if_any(contains('cvr'), ~ .x < 0.1))

Here is how this works:

  • We use if_any() inside filter() to carry out implicit filtering as described in the “Named Function” scenario above.
  • We pass the anonymous function ~ .x < 0.1 as the second argument to if_any().
  • The anonymous function ~ .x < 0.1 (the second argument to if_any()) is applied to each column selected via contains('cvr') (the first argument to if_any()) and the results are combined via an OR operation i.e. a row is retained (included in the output) if its value is less than 0.1 for any of the columns.

Multiple functions

We wish to filter rows of a data frame by applying multiple predicate function to each of a selected set of columns and then taking a logical combination of the results.

In this example, we wish to filter the rows of the data frame df for which the value of any column whose name contains the string ‘cvr’ is missing (NA) or infinite (Inf).

df_2 = df %>% 
  filter(if_any(contains('cvr'), 
                list(is.na, is.infinite)))

Here is how this works:

  • We use if_any() inside filter() to carry out implicit filtering as described in the “Named Function” scenario above.
  • We pass the functions we wish to apply as a list to the second argument of if_any() which in this case is list(is.na, is.infinite).
  • Each function in the list list(is.na, is.infinite) (the second argument to if_any()) is applied to each column selected via contains('cvr') (the first argument to if_any()) and the results are combined via an OR operation i.e. a row is retained (included in the output) if its value is either NA or Inf for any of the columns.

Non Vectorized Function

We wish to filter rows of a data frame by a logical expression that involves applying a non-vectorized function (i.e. one that acts on one row at a time) to a set of selected columns.

In this example, we wish to filter the rows of the data frame df for which the mean of the values of the columns, whose names contain the string ‘cvr’, is less than 0.1.

df_2 = df %>% 
  rowwise() %>% 
  filter(mean(c_across(contains('cvr'))) < 0.1)

Here is how this works:

  • rowwise() switches the mode of execution of the operations that follow from column wise operation to row wise operation which allows us to apply a non-vectorized function one row at a time.
  • Because of rowwise(), the expression inside filter() will be applied one row at a time (instead of the usual execution on entire columns).
  • c_across() works with rowwise() to make it possible to select the columns on which to perform row-wise operations.
  • All column selection techniques covered in Column Selection can be used inside c_across() (just like we did inside of if_all() or if_any()).
  • In c_across(contains('cvr')), we use c_across() to select all columns whose name contains the string ‘cvr’ .
  • The values of the selected columns are passed one row at a time to mean(). The output of computing the mean() is compared to 0.1 and the row is retained (included in the output) if the mean value is less than 0.1.
  • See Non Vectorized Transformation and Implicit Non Vectorized Transformation for a deeper coverage of non vectorized operations. All the scenarios covered there can also be applied for filtering.
R
I/O