At times we wish to exclude one or more columns and retain the rest. At other times the criteria to describe the columns that we wish to exclude are easier to formulate than their complement (the criteria for columns we wish to retain). In such situations we need to exclude some columns and return the rest. There are four common column exclusion scenarios which we cover in this section:
We wish to exclude one or more columns of a data frame given their names and to return the data frame’s remaining columns.
In this example, we wish to exclude the columns named col_1
and col_5
and return the remaining columns.
df_2 = df.drop(columns=['col_1', 'col_5'])
Here is how this works:
loc[]
is not capable of dropping columns. Instead, we need to use the drop()
method of Pandas data frames.drop()
method has a columns
argument to which we can pass the names of the columns we wish to exclude as a list.axis=1
like so df_2 = df.drop(['col_1', 'col_2'], axis = 1)
.KeyError
.Alternatively,
df_2 = df.loc[:,
lambda x: x.columns.difference(['col_1', 'col_5'])]
Here is how this works:
difference()
to the columns Index
to remove the columns that we wish to exclude; here col_1
and col_5
.df[df.columns.difference(['col_1', 'col_5'])]
drop()
approach would fail, the difference()
approach would ignore names that do not match any actual column names.lambda
function inside loc[]
so it is robust to use in a chain where columns may have been altered in previous steps. If that is not the case we can refer to the data frame, here df
, directly without a lambda
function.We wish to exclude one or more columns of a data frame given their positions (where 1 is the left most column) and to return the data frame’s remaining columns.
In this example, we wish to exclude the columns at the first and fifth positions and return the remaining columns.
df_2 = df.drop(columns=df.columns[[0, 1]])
Here is how this works:
iloc[]
, which we have used in Selecting by Position, is not capable of dropping columns. Instead, we use the drop()
method of a Pandas data frame.drop()
expects column names (labels), therefore we select the column names for the columns at the positions we wish to drop via df.columns[[0,1]]
.df.columns
returns the column names of a data frame (as an index).We wish to exclude columns that match a condition and return the data frame’s remaining columns.
In this example, we exclude columns that start with old_
and return the remaining columns.
df_2 = df.loc[:, lambda x: ~ x.columns.str.startswith('col_')]
Here is how this works:
~
prior to the logical expression inside loc[]
to toggle True values to False and vice versa, hence selecting the complement of the columns matching the logical expression.lambda
function inside loc[]
so it is robust to use in a chain where columns may have been altered in previous steps. If that is not the case we can refer to the data frame, here df
, directly without a lambda
function.We wish to exclude columns of one or more data types and return the data frame’s remaining columns.
In this example, we wish to exclude both columns of a numeric data type and of a logical data type.
df.select_dtypes(exclude=["number", "boolean"])
Here is how this works:
select_dtypes()
in Selection by Data Type to select columns of one or more data type(s). To do so we passed references to the intended data type(s) to the include
argument of select_dtypes()
. include
is the default argument of select_dtypes()
so we didn’t have to specify it explicitly.select_dtypes()
has another argument called exclude
which works in the same way as include but drops columns of the matching data type(s).exclude
a single data type or a list of data types. See Selection by Data Type for how to refer to the most common data types.