Custom Filtering

We wish to apply filtering logic is more involved than a simple comparison.

We will cover the following scenarios:

  • Aggregation where we cover how to filter a data frame via a logical expression that involves an aggregation of one or more columns.
  • Transformation where we cover how to filter a data frame via a logical expression that involves a transformation of one or more columns.
  • Non-Vectorized Transformation where we cover how to filter a data frame via a logical expression that involves a non-vectorized transformation of one or more columns.
  • Custom Function where we cover isolating filtering logic into a function that we call from filter().
  • Missing Valuers where we cover how to override the default behavior where filter() drops any rows for which the filtering expression evaluates to NA.

It is worth noting that: To reduce mistakes and improve readability, it is usually preferable to split complex filtering operations in two steps. A data transformation step to create a new column and then a simple filtering step to compare against the value of that column.

Sometimes though, we do not wish to create additional columns and in those situations this would be the appropriate approach. It still helps, to construct the transformation as a column first (maybe on a subset of the data) to verify that it works as expected before executing it as part of the filtering operation.

Aggregation

We wish to filter a data frame, i.e. return a subset of rows, via a logical expression that involves an aggregation.

In this example, we wish to return all rows for which the value of a numerical column is larger than the value of the mean for that column.

df %>%
    filter(col_1 > mean(col_1))

Here is how this works:

  • We can compute an aggregation of one or more columns as part of a logical filtering expression passed to filter().
  • The mean() of the column col_1 is computed.
  • Each individual value of the column col_1 is then compared with the value of the mean.
  • Rows for which the expression col_1 > mean(col_1) evaluates to TRUE (i.e. the value of col_1 is larger than the mean value of col_1) are returned.

Transformation

We wish to filter a data frame, i.e. return a subset of rows, via a logical expression that involves a transformation of one or more of the data frame’s columns.

In this example, we wish to return all rows for which the standardized value of a numerical column is larger 1. We compute the standard value (or z-score) by subtracting the mean and dividing by the standard deviation of the variable (the column).

df %>%
    filter((col_1 - mean(col_1)) / sd(col_1) > 1)

Here is how this works:

  • We can compute a transformation of one or more columns as part of a logical filtering expression passed to filter().
  • The standard value (z-score) is computed by subtracting the mean (computed via mean(col_1)) and dividing by the standard deviation (computed via sd(col_1)).
  • The resulting z-scores are compared with 1.
  • Rows for which the comparison evaluates to TRUE (i.e. the value of the z-score is greater than one) are returned.

Non-Vectorized Transformation

In some situations, the filtering logic we wish to carry out can not be applied in a vectorized manner column wise, rather it needs to be applied in a non-vectorized manner to each row individually.

In this example, we wish to filter rows where the mean of the values of the columns col_1 and col_2 is greater than 0.

df_2 = df %>%
  rowwise() %>%
  filter(mean(c(col_1, col_2)) > 0)

Here is how this works:

  • rowwise() switches the mode of execution of the operations that follow from column wise operation to row wise operation which allows us to apply a non-vectorized function one row at a time.
  • Because of rowwise(), the expression inside filter() will be applied one row at a time (instead of the usual execution on entire columns).
  • In mean(c(col_1, col_2)) > 0, the mean of the values of col_1 and col_2 for the current row is computed and then compared with 0. If the result is True, the row is retained, else it is not included in the output.
  • Note that any operations carried out after filter() will also be carried out in a non-vectorized manner. To switch back to regular vectorized operation, add ungroup() to the chain.
  • See Non Vectorized Transformation for a deeper coverage of non vectorized operations. All the scenarios covered there can also be applied for filtering.

Custom Function

If the filtering logic gets too complicated or if we wish to execute the same logic multiple times (more than twice), it is often a good idea to isolate the logic into a function which would then be called for filtering.

In this example, we wish to return all rows for which the value of a particular column is deemed an outlier to inspect them. We use the rudimentary definition that a point is an outlier if it is more than 1.5x the Inter-quartile range above the third quartile or below the first quartile. See Descriptive Statistics for a description of these and other descriptive statistics operations.

is_outlier <- function(col) {
  quantiles = quantile(df$col_1, probs = c(0.25, 0.5, 0.75), na.rm = T)
  iqr = quantiles[3] - quantiles[1]
  return(!between(col, quantiles[1] - 1.5 * iqr, quantiles[3] + 1.5 * iqr))
}

df %>%
  filter(is_outlier(col_1))

Here is how this works:

  • Inside filter() we can apply a function to one or more columns so long as the function returns a logical vector that has as many values as there are rows in the data frame. Rows where the value is TRUE will be returned.
  • For this example, we wrote a custom function is_outlier() in which we isolated the logic for checking whether a value is an outlier. The function takes as input a numeric column and returns a logical vector of the same length where the value is TRUE if the corresponding numerical value is deemed an outlier and FALSE otherwise.
  • The logic to identify outliers is following the common heuristic of treating any value is that is greater than 1.5 times the interquartile range (above the third quartile or below the first quartile) as an outlier.

Missing Values

By default, filter() excludes any rows for which the columns involved in the filtering expression have a missing value NA. In this section we cover how to change this behavior and include rows with missing values along with the rows that satisfy the filtering condition.

In this example, we wish to return rows of the data frame df for which the column col_1 is either less than zero or missing (NA).

df_2 = df %>% 
  filter(col_1 < 0 | is.na(col_1))

Here is how this works:

  • By default filter() drops any rows for which the filtering condition evaluates to NA.
  • To include rows for which col_1 is NA, we add an or | clause to the condition where we explicitly check if the value of col_1 is missing via is.na(col_1).
R
I/O