Column Selection

We wish to identify the columns on each of which we will apply filtering logic.

We will cover the following scenarios

  • All Columns where we cover how to apply a logical expression to all columns of a data frame and return the rows for which any of the columns satisfy the expression.
  • Explicit Selection where we cover how to apply a logical expression to each of a set of explicitly selected columns of a data frame (e.g. by spelling out the names of the columns of interest) and return the rows for which any of the columns satisfy the expression.
  • Implicit Selection where we cover how to apply a logical expression to each of a set of implicitly selected columns of a data frame (e.g. by selecting columns whose names contain a certain substring) and return the rows for which any of the columns satisfy the expression.
  • Exclude Columns where we cover how to apply a logical expression to each column of a data frame but a set of excluded columns and return the rows for which any of the columns satisfy the expression.

This section is complemented by

  • Function Specification where we cover how to specify one or more logical expressions or functions to apply to the selected set of columns
  • Relationship Specification where we cover how to combine the results of applying the specified function(s) to the specified column(s) in either an AND manner or an OR manner.

All Columns

We wish to apply a logical expression to every column and to return any row for which any column satisfies that logical expression.

In this example, we wish to return any row in the data frame df for which any column has a missing value NA.

df_2 = df %>% 
    filter(if_any(everything(), is.na))

Here is how this works:

  • We pass the data frame df to the function filter().
  • Inside filter() we use if_any() to specify our implicit filtering logic as follows
    • The first argument to if_any() is a selection of columns. In this example, we use everything() to select all columns.
    • The second argument to if_any() is the logical expression that we wish to apply to each column selected in the first argument. In this case the logical expression we wish to apply is the function is.na() which returns TRUE if a value is NA (see Missing Values).
  • The logical expression is.na (the second argument to if_any()) is applied to each column selected via everything() (the first argument to if_any()) .
  • The resulting logical TRUE or FALSE values (one for each column) are combined via an OR operation (because we used if_any()) i.e. a row is retained (included in the output) if its value is NA for any of the columns.
  • The counter part of if_any() is if_all() which requires that the logical expression evaluates to TRUE for all columns. See Relationship Selection.

Explicit Selection

We wish to apply a logical expressions to a set of explicitly specified column and to return any row for which any of those columns satisfies the logical expression.

In this example, we wish to return any row in the data frame df for which any of the columns col_1, col_2 or col_4 has a missing value NA.

df_2 = df %>% 
  filter(if_any(c(col_1, col_2, col_4), is.na))

Here is how this works:

  • We use if_any() inside filter() to carry out implicit filtering as described in the “All Columns” scenario above.
  • In c(col_1, col_2, col_4) we identify the columns we wish to select by name. See Basic Selection for a detailed coverage of explicit column selection scenarios.

Implicit Selection

We wish to apply a logical expression to a set of implicitly specified columns and to return any row for which any of those columns satisfies that logical expression. Implicit column selection is when we do not spell out the column names or positions explicitly but rather identify the columns via a property of their name or their data.

In this example, we wish to return any row in the data frame df for which any column whose name starts with the substring ‘cvr_’ is missing.

df_2 = df %>% filter(if_any(starts_with('cvr_'), is.na))

Here is how this works:

  • We use if_any() inside filter() to carry out implicit filtering as described in the “All Columns” scenario above.
  • We use starts_with('cvr_') to select all columns whose name starts with the substring ‘cvr_’. See Implicit Selection for a coverage of the most common scenarios of implicit column selection including by name pattern, data type, and Criteria satisfied by the column’s data.

Exclude Columns

We wish to apply a logical expressions to all but a set of columns and to return any row for which any of those columns satisfies the logical expression.

In this example, we wish to return any row in the data frame df for which any column but the columns col_1 and col_2 is missing.

df_2 = df %>% 
  filter(if_any(!c(col_1, col_2), is.na))

Here is how this works:

  • We use if_any() inside filter() to carry out implicit filtering as described in the “All Columns” scenario above.
  • In !c(col_1, col_2) we identify the columns we wish to exclude by name. See Exclude Columns for a coverage of column exclusion scenarios, all of which can be used for implicit filtering.
R
I/O