Selecting by Data Criteria

We wish to select columns whose data meets certain criteria.

Oftentimes we wish to select columns whose data satisfies certain conditions e.g. the percentage of missing values is below 10%.

Essentially, we apply a predicate operation (i.e. an operation that returns True or False) to each column of the data frame and select the Columns for which the function evaluates to True. There are two common scenarios:

  1. Data Type Agnostic: The predicate function we are applying is valid for all data types e.g. checking for missing values.
  2. Data Type Dependent: The predicate function we are applying is only valid for particular data types. In this case, we need to first select columns of the appropriate data type before we run the predicate function on each of those columns e.g. checking that the mean is greater than a certain value is only valid for numeric and logical data types.

For a detailed coverage of data type specific operations, see the respective data type operations section.

Data Type Agnostic

We wish to select columns whose data satisfies a certain condition where the condition is applicable to all data types.

In this example, we wish to return columns where the percentage of missing values to the total number of values is less than 10%.

df_2 = df.loc[:, lambda x: x.isna().mean().lt(0.1)]

Here is how this works:

  • loc[] can take a callable function with one argument to which the calling data frame (or Series) is passed and that returns an output that is valid for indexing i.e. column names or a boolean vector.
  • We could have gone without a lambda function df.loc[:, df.isna().mean().lt(0.1)] and that would have worked but it is would cause an error if loc[] is applied later in a chain after, say, some new columns were created via assign().
  • lt() is a convenience function for the less than operation < that allows cleaner and easier to chain code. We cover numerical comparison functions in Chapter X.

Alternatively,

We can use a function instead of a lambda function

def col_select_fun(df):
    return df.isna().mean().lt(0.1)

df_2 = df.loc[:, lambda x: col_select_fun(x)]

Here is how this works:

  • We isolate the logic into a function (here called col_select_fun()) that we call from loc[].
  • A lambda function is a good fit when the logic is simple, e.g. the simple one liner above. In case more elaborate logic is involved, it is better for feasibility, reliability, and readability to isolate the logic into a separate function.

Data Type Dependent

We wish to select columns whose data satisfies a certain condition where the condition is applicable only to columns of particular data types. In this scenario, we need to first select the columns of the appropriate data type(s) and then apply the function to those.

In this example, we wish to return numeric columns where the mean is less than 1.

df_2 = (df
  .select_dtypes('number')
  .loc[:, lambda x: x.mean().gt(1)])

Here is how this works:

  • The data frame resulting from applying select_dtypes() to df is comprised of only numeric columns.
  • That resulting numeric column data frame is passed through .loc[] to the lambda function (referred to in this example as x).
  • x.mean() returns a series where each value corresponds to the mean of one of the numeric columns.
  • DataFrame.mean() is a convenience method of Pandas DataFrames that spares us having to iterate over each column. See Aggregation.
  • The comparison gt(1) is equivalent to > 1 and results in a Series of Boolean values where we have True for numeric columns whose mean is greater than 1 and False otherwise.
  • If the column selection logic is too complex to fit as an inline lambda function, we can use a standalone function (like we show in the alternative above).
PYTHON
I/O