We wish to select columns whose data meets certain criteria.
Oftentimes we wish to select columns whose data satisfies certain conditions e.g. the percentage of missing values is below 10%.
Essentially, we apply a predicate operation (i.e. an operation that returns True
or False
) to each column of the data frame and select the Columns for which the function evaluates to True
. There are two common scenarios:
For a detailed coverage of data type specific operations, see the respective data type operations section.
We wish to select columns whose data satisfies a certain condition where the condition is applicable to all data types.
In this example, we wish to return columns where the percentage of missing values to the total number of values is less than 10%.
df_2 = df.loc[:, lambda x: x.isna().mean().lt(0.1)]
Here is how this works:
loc[]
can take a callable function with one argument to which the calling data frame (or Series
) is passed and that returns an output that is valid for indexing i.e. column names or a boolean vector.df.loc[:, df.isna().mean().lt(0.1)]
and that would have worked but it is would cause an error if loc[]
is applied later in a chain after, say, some new columns were created via assign()
.lt()
is a convenience function for the less than operation <
that allows cleaner and easier to chain code. We cover numerical comparison functions in Chapter X.Alternatively,
We can use a function instead of a lambda function
def col_select_fun(df):
return df.isna().mean().lt(0.1)
df_2 = df.loc[:, lambda x: col_select_fun(x)]
Here is how this works:
col_select_fun()
) that we call from loc[]
.lambda
function is a good fit when the logic is simple, e.g. the simple one liner above. In case more elaborate logic is involved, it is better for feasibility, reliability, and readability to isolate the logic into a separate function.We wish to select columns whose data satisfies a certain condition where the condition is applicable only to columns of particular data types. In this scenario, we need to first select the columns of the appropriate data type(s) and then apply the function to those.
In this example, we wish to return numeric columns where the mean is less than 1.
df_2 = (df
.select_dtypes('number')
.loc[:, lambda x: x.mean().gt(1)])
Here is how this works:
select_dtypes()
to df
is comprised of only numeric columns..loc[]
to the lambda function (referred to in this example as x
).x.mean()
returns a series where each value corresponds to the mean of one of the numeric columns.DataFrame.mean()
is a convenience method of Pandas DataFrames that spares us having to iterate over each column. See Aggregation.gt(1)
is equivalent to > 1
and results in a Series of Boolean values where we have True
for numeric columns whose mean is greater than 1
and False
otherwise.lambda
function, we can use a standalone function (like we show in the alternative above).