We wish to select columns that meet a logical combination of multiple conditions.
In Selecting by Name Pattern, Selecting by Data Type, and Selecting by Data Criteria, we covered the three most common implicit column selection patterns. In this section we cover how to combine multiple logical conditions to fulfill more elaborate column selection scenarios.
We will look at three approaches for selecting columns by multiple conditions that together cover a wide range of possible situations.
Series
, we can use logical operations to combine those logical Series
.select_dtypes()
, we can apply the column selection operations as a sequence of selection functions e.g. select_dtypes()
followed by loc[]
.If the column selection operations we wish to carry out return a logical Series
(as opposed to column names), we can use logical operations to combine those logical Series
.
In this example, we wish to select columns that are likely to hold boolean (True
or False
) data even if those columns are not assigned a boolean data type. To do so we decided to select string columns whose values take less than 5 possible values or that contain words that are likely to denote a logical type (e.g. Yes or No). We wish to exclude any column where more than 10% of the values are missing and finally we wish to include any columns that have the logical data type. This is a plausible scenario in a data cleaning context.
df.loc[:, lambda x:
(x.dtypes == 'object')
& ((x.nunique() < 5) | x.isin(['Yes', 'No', 'True', 'False']).any())
& ~ x.isna().mean().gt(0.1)
| (x.dtypes == 'bool')]
Here is how this works:
Series
of boolean values of the same length as the number of columns in the Data Frame where True
corresponds to columns to be selected.&
(and), |
(or), as well as ~
(not) to combine those boolean Series and produce the required logic. See Chapter X for more on Logical Operations.x.dtypes == 'object'
checks whether the column is of a string data type.x.nunique() < 5
checks if the column can take less than 5 possible values.x.isin(['Yes', 'No', 'True', 'False']).any()
checks whether the column has any of these valuesx.isna().mean().gt(0.1)
checks whether the column has more than 10% missing valuesx.dtypes == 'bool'
checks whether the column is of a boolean data typeIf some of the column selection operations we wish to apply act on the input data frame and return another data frame that has a subset of selected columns, such as select_dtypes()
, we can apply the column selection operations as a sequence of selection functions.
In this example, the objective is to select columns that are of a numeric data type (integer or float) and whose name starts with the string ‘col_’
.
df_2 = df \
.select_dtypes('number') \
.loc[:, lambda x: x.columns.str.startswith('col_')]
Here is how this works:
select_dtypes('number')
returns only columns whose data type is numeric
(integer and float).loc[:, lambda x: x.columns.str.startswith('col_')]
takes the output from the first operation and of those numerical columns returns only those columns whose name starts with ‘col_’
.DataFrame
df
that are both of a numeric data type and whose name starts with ‘col_’.
.Series
, vs a Series of selected column names, vs a DataFrame of the selected columns (such as select_dtypes()
).If some of the column selection operations we wish to carry out return column labels (as opposed to a logical Series
), we can use set operations to combine those sets of column names.
In this example, we wish to select columns that match any one or more of three criteria. Those are: (1) Has the name timestamp
or latlong
. (2) Has a name that ends with _id
. (3) Is of a numeric data type and has 10% or more missing value. This is a plausible scenario in a data cleaning context where we extract columns of a given type and keep columns that allow us to create meaningful groups for inspection.
df_2 = df.loc[:, lambda x :
x.select_dtypes('number').columns
.intersection(x.columns[x.isna().mean().gt(0.1)])
.difference(x.columns[x.columns.str.startswith('old_')])
.union(x.columns[x.columns.str.endswith('_id')])
.union(['timestamp', 'latlong'])]
Here is how this works:
select_dtypes('number')
x.isna().mean().gt(0.1)
x.columns.str.startswith('old_')
x.columns.str.endswith('_id')
['timestamp', 'latlong']
intersection
, union
, and difference
, to combine multiple column selection conditions in any way we need to select the columns we want.Series
of boolean True
or False
values such as str.endswith()
in this example, we can convert that to a set of column names (an Index
actually) on which set operations can be carried out by passing (indexing) the boolean Series
into DataFrame.columns[boolean_series]
e.g. x.columns[x.columns.str.endswith('_id')]
.df.loc[:, df.select_dtypes('number').columns ..
and that would have worked but it is would cause an error if loc[]
is applied later in a chain after, say, some new columns were created via assign()
.