Selecting by Multiple Conditions

We wish to select columns that meet a logical combination of multiple conditions.

In Selecting by Name Pattern, Selecting by Data Type, and Selecting by Data Criteria, we covered the three most common implicit column selection patterns. In this section we cover how to combine multiple logical conditions to fulfill more elaborate column selection scenarios.

We will look at three approaches for selecting columns by multiple conditions that together cover a wide range of possible situations.

  1. Logical Operations: If each of the column selection operations we wish to carry out returns a logical Series, we can use logical operations to combine those logical Series.
  2. Sequential Application: If some of the column selection operations we wish to apply act on the input data frame and return another data frame that has a subset of selected columns, such as select_dtypes(), we can apply the column selection operations as a sequence of selection functions e.g. select_dtypes() followed by loc[].
  3. Set Operations: If the operations involved each return a vector of column names, we can combine them together via set operations (intersection, union, difference). For example, say we have three sets of column names A, B, and C and we wish to select columns whose names are in set A or set B but not in set C.

Logical Operations

If the column selection operations we wish to carry out return a logical Series (as opposed to column names), we can use logical operations to combine those logical Series.

In this example, we wish to select columns that are likely to hold boolean (True or False) data even if those columns are not assigned a boolean data type. To do so we decided to select string columns whose values take less than 5 possible values or that contain words that are likely to denote a logical type (e.g. Yes or No). We wish to exclude any column where more than 10% of the values are missing and finally we wish to include any columns that have the logical data type. This is a plausible scenario in a data cleaning context.

df.loc[:, lambda x:
          (x.dtypes == 'object')
          & ((x.nunique() < 5) | x.isin(['Yes', 'No', 'True', 'False']).any())
          & ~ x.isna().mean().gt(0.1)
          | (x.dtypes == 'bool')]

Here is how this works:

  • Each operation is applied to the entire data frame and returns a Series of boolean values of the same length as the number of columns in the Data Frame where True corresponds to columns to be selected.
  • We then apply boolean logic: & (and), | (or), as well as ~ (not) to combine those boolean Series and produce the required logic. See Chapter X for more on Logical Operations.
  • Note that the boolean operations are applied to the entire series in a vectorized manner (i.e. the boolean logic is applied to the elements at the corresponding index of every series).
  • The operations are as follows:
    • x.dtypes == 'object' checks whether the column is of a string data type.
    • x.nunique() < 5 checks if the column can take less than 5 possible values.
    • x.isin(['Yes', 'No', 'True', 'False']).any() checks whether the column has any of these values
    • x.isna().mean().gt(0.1) checks whether the column has more than 10% missing values
    • x.dtypes == 'bool' checks whether the column is of a boolean data type
  • Combining column selection conditions via logical operations is the “natural” approach in Pandas.

Sequential Application

If some of the column selection operations we wish to apply act on the input data frame and return another data frame that has a subset of selected columns, such as select_dtypes(), we can apply the column selection operations as a sequence of selection functions.

In this example, the objective is to select columns that are of a numeric data type (integer or float) and whose name starts with the string ‘col_’.

df_2 = df \
    .select_dtypes('number') \
    .loc[:, lambda x: x.columns.str.startswith('col_')]

Here is how this works:

  • We apply two column selection operations in sequence.
  • The first operation select_dtypes('number') returns only columns whose data type is numeric (integer and float).
  • The second operation loc[:, lambda x: x.columns.str.startswith('col_')] takes the output from the first operation and of those numerical columns returns only those columns whose name starts with ‘col_’.
  • As a whole, the pipeline returns columns of the DataFrame df that are both of a numeric data type and whose name starts with ‘col_’..
  • This approach is especially useful when we wish to multiple conditions where each is applied via a function that returns a different data type e.g. a boolean Series, vs a Series of selected column names, vs a DataFrame of the selected columns (such as select_dtypes()).

Set Operations

If some of the column selection operations we wish to carry out return column labels (as opposed to a logical Series), we can use set operations to combine those sets of column names.

In this example, we wish to select columns that match any one or more of three criteria. Those are: (1) Has the name timestamp or latlong. (2) Has a name that ends with _id. (3) Is of a numeric data type and has 10% or more missing value. This is a plausible scenario in a data cleaning context where we extract columns of a given type and keep columns that allow us to create meaningful groups for inspection.

df_2 = df.loc[:, lambda x : 
           x.select_dtypes('number').columns 
           .intersection(x.columns[x.isna().mean().gt(0.1)]) 
           .difference(x.columns[x.columns.str.startswith('old_')]) 
           .union(x.columns[x.columns.str.endswith('_id')]) 
           .union(['timestamp', 'latlong'])]

Here is how this works:

  • The individual column selections conditions in this example are
    • select_dtypes('number')
    • x.isna().mean().gt(0.1)
    • x.columns.str.startswith('old_')
    • x.columns.str.endswith('_id')
    • ['timestamp', 'latlong']
  • We can use the primary set operations, intersection, union, and difference, to combine multiple column selection conditions in any way we need to select the columns we want.
  • If one of the conditions we wish to apply is carried out by a function that returns a Series of boolean True or False values such as str.endswith() in this example, we can convert that to a set of column names (an Index actually) on which set operations can be carried out by passing (indexing) the boolean Series into DataFrame.columns[boolean_series] e.g. x.columns[x.columns.str.endswith('_id')].
  • We could have gone without a lambda function i.e. df.loc[:, df.select_dtypes('number').columns .. and that would have worked but it is would cause an error if loc[] is applied later in a chain after, say, some new columns were created via assign() .
PYTHON
I/O