Relationship Specification

In an implicit filtering scenario, we wish to specify whether to AND or OR the logical values resulting from applying one or more logical expression(s) to each of a set of columns.

This section is complemented by

  • Column Selection where we cover how to select the columns to each of which we will apply filtering logic.
  • Function Specification where we cover how to specify one or more logical expressions or predicate functions (functions that return True or False) to apply to the selected set of columns

AND

We wish to filter rows for which a logical expression is True for all of a selected set of columns.

In this example, we wish to filter the rows of the data frame df for which the value of every column whose name contains the string ‘cvr’ is less than 0.1.

df_2 = df.loc[(df
               .loc[:, lambda x: x.columns.str.contains('cvr_', regex=False)]
               .apply(lambda x: x < 0.1)
               .all(axis=1))]

Here is how this works:

  • The data frame method all() returns True if all elements along the designated axis (rows or columns) of the data frame it's called on is True.
  • We define the axis that all() will evaluate along by setting the argument axis where axis=0 evaluates along columns and axis=1 evaluates along rows.
  • In the code above, we use apply() to apply the row filtering logic to each of the selected columns. The result is a data frame of logical True or False values where a value is True if the corresponding column meets the filtering condition (which here is being < 0.1) for the corresponding row.
  • We use all() to combine the logical values for each row such that the output is True if the filtering condition evaluated to True for all the selected columns.
  • Note: We should check that column selection returns some columns. If no columns are returned (i.e. column selection returns an empty data frame), all rows will be returned which is often not the desired behavior. This is because when applied to an empty (or NA) row or column all() returns True.

OR

We wish to filter rows for which a logical expression is True for any of a selected set of columns.

In this example, we wish to filter the rows of the data frame df for which the value of any column whose name contains the string ‘cvr’ is less than 0.1.

df_2 = df.loc[(df
               .loc[:, lambda x: x.columns.str.contains('cvr_', regex=False)]
               .apply(lambda x: x < 0.1)
               .any(axis=1))]

Here is how this works:

  • The data frame method any() returns True if any elements along the designated axis (rows or columns) of the data frame it's called on is True.
  • We define the axis that any() will evaluate along by setting the argument axis where axis=0 evaluates along columns and axis=1 evaluates along rows.
  • In the code above, we use apply() to apply the row filtering logic to each of the selected columns. The result is a data frame of logical True or False values where a value is True if the corresponding column meets the filtering condition (which here is being < 0.1) for the corresponding row.
  • We use any() to combine the logical values for each row such that the output is True if the filtering condition evaluated to True for any of the selected columns.
PYTHON
I/O