Custom Filtering

We wish to apply filtering logic is more involved than a simple comparison.

We will cover three scenarios:

  • Involves Aggregation where we cover how to filter a data frame via a logical expression that involves an aggregation of one or more columns.
  • Involves Transformation where we cover how to filter a data frame via a logical expression that involves a transformation of one or more columns.
  • Involves Non-Vectorized Transformation where we cover how to filter a data frame via a logical expression that involves a non-vectorized transformation of one or more columns.
  • Custom Function where we cover isolating filtering logic into a function that we call from within loc[].

It is worth noting that: To reduce mistakes and improve readability, it is usually preferable to split complex filtering operations in two steps. A data transformation step to create a new column and then a simple filtering step to compare against the value of that column.

Sometimes though, we do not wish to create additional columns and in those situations this would be the appropriate approach. It still helps, to construct the transformation as a column first (maybe on a subset of the data) to verify that it works as expected before executing it as part of the filtering operation.

Aggregation

We wish to filter a data frame, i.e. return a subset of rows, via a logical expression that involves an aggregation.

In this example, we wish to return all rows for which the value of a numerical column is larger than the value of the mean for that column.

df.loc[df['col_1'] > df['col_1'].mean()]

Here is how this works:

  • We can compute an aggregation of one or more columns as part of a logical filtering expression passed to loc[].
  • The mean() of the column ‘col_1’ is computed.
  • Each individual value of the column ‘col_1’ is then compared with the value of the mean.
  • Rows for which the expression df['col_1'] > df['col_1'].mean() evaluates to True (i.e. the value of ‘col_1’ is larger than the mean value of ‘col_1’) are returned.

Transformation

We wish to filter a data frame, i.e. return a subset of rows, via a logical expression that involves a transformation of one or more of the data frame’s columns.

In this example, we wish to return all rows for which the standardized value of a numerical column is larger 1. We compute the standard value (or z-score) by subtracting the mean and dividing by the standard deviation of the variable (the column).

df.loc[(df['col_1'] - df['col_1'].mean()) / df['col_1'].std() > 1]

Here is how this works:

  • We can compute a transformation of one or more columns as part of a logical filtering expression passed to loc[].
  • The standard value (z-score) is computed by subtracting the mean (computed via mean()) and dividing by the standard deviation (computed via std()).
  • The resulting z-scores are compared with 1.
  • Rows for which the comparison evaluates to True (i.e. the value of the z-score is greater than one) are returned.

Non-Vectorized Transformation

In some situations, the filtering logic we wish to carry out can not be applied in a vectorized manner column wise, rather it needs to be applied in a non-vectorized manner to each row individually.

In this example, we wish to filter rows where the mean of the values of the columns ‘col_1' and ‘col_2' is greater than 0.

df_2 = df.loc[(df
        .apply(lambda x: np.mean([x['col_1'], x['col_2']]), axis=1)
        .gt(0))]

Here is how this works:

  • As covered in Custom Filtering, we can run a function inside loc[] to subset rows so long as it results in a Series of logical True or False values with the same length as the number of rows in the data frame.
  • We use apply() while setting axis=1 to compute the mean of the values of ‘col_1’ and ‘col_2’ for each row.
  • We compare the output of apply() with 0 via gt(0) (a function form of the greater than operator covered in more details in Numerical Operations) to get a logical Series that has a value of True where the mean of the values of the columns ‘col_1' and ‘col_2' is greater than 0 and False otherwise.
  • The resulting logical Series is then passed to loc[] to return the rows corresponding to values of True.
  • See Non Vectorized Transformation for a deeper coverage of non vectorized operations. All the scenarios covered there can also be applied for filtering.

Alternative: Select then Apply

df_2 = df.loc[(df[['col_1', 'col_2']]
        .apply(np.mean, axis=1)
        .gt(0))]

Here is how this works:

  • In this particular case we can pass the values of 'col_1' and 'col_2' as a Series to the np.mean() function.
  • To do so, we select the columns we wish to pass to the function via basic indexing in df[['col_1', 'col_2'] .
  • We then apply() the np.pean() function while setting axis=1 to specify that apply should act on rows one at a time.

Custom Function

If the filtering logic gets too complicated or if we wish to execute the same logic multiple times (more than twice), it is often a good idea to isolate the logic into a function which would then be called for filtering.

In this example, we wish to return all rows for which the value of a particular column is deemed an outlier to inspect them. We use the rudimentary definition that a point is an outlier if it is more than 1.5x the Inter-quartile range above the third quartile or below the first quartile. See Descriptive Statistics for a description of these and other descriptive statistics operations.

def is_outlier(col):
    quantiles = np.quantile(col, [0.25, 0.5, 0.75])
    iqr = quantiles[2] - quantiles[0]
    return (~col.between(quantiles[0] - 1.5 * iqr, quantiles[2] + 1.5 * iqr))

df_2 = df.loc[lambda x: is_outlier(x['col_1'])]

Here is how this works:

  • The logic for evaluating whether a data point is an outlier is isolated in the is_outlier() function.
  • The is_outlier() function takes a numerical column as input and returns a Boolean Series of the same size that is True for values that are outliers are False otherwise.
  • The boolean Series resulting from computing lambda x: is_outlier(x['col_7']) is passed to loc[] which then returns the subset of rows of the DataFrame (in this example df) corresponding to values of True i.e. rows where the value of the column col_1 are outliers.
  • We used the function numpy.quantile() from the NumPy package to compute the quantile values.
  • We compute the interquartile range (IQR) by subtracting the value of the first quantile from that of the third quantile (as per the definition of IQR).
PYTHON
I/O