Exclude Columns

At times we wish to exclude one or more columns and retain the rest. At other times the criteria to describe the columns that we wish to exclude are easier to formulate than their complement (the criteria for columns we wish to retain). In such situations we need to exclude some columns and return the rest. There are four common column exclusion scenarios which we cover in this section:

  1. Exclude by Name: Exclude one or more columns given their names.
  2. Exclude by Position: Exclude one or more columns given their positions.
  3. Select Complement: Exclude columns that do not match a condition.
  4. Exclude Data Type: Exclude columns of one or more data types.

By Name

We wish to exclude one or more columns of a data frame given their names and to return the data frame’s remaining columns.

In this example, we wish to exclude the columns named col_1 and col_5 and return the remaining columns.

df_2 = df.drop(columns=['col_1', 'col_5'])

Here is how this works:

  • loc[] is not capable of dropping columns. Instead, we need to use the drop() method of Pandas data frames.
  • The drop() method has a columns argument to which we can pass the names of the columns we wish to exclude as a list.
  • Alternatively, we could pass the column names without the columns argument and set axis=1 like so df_2 = df.drop(['col_1', 'col_2'], axis = 1).
  • The column names passed to drop must exist. If they do not, we get a KeyError.

Alternatively,

df_2 = df.loc[:, 
       lambda x: x.columns.difference(['col_1', 'col_5'])]

Here is how this works:

  • We are selecting all columns except the columns that we wish to exclude.
  • We apply the method difference() to the columns Index to remove the columns that we wish to exclude; here col_1 and col_5.
  • We can apply the same approach with the bracket operator like so: df[df.columns.difference(['col_1', 'col_5'])]
  • This approach can be used in cases where the list of columns to be excluded may contain column names that do not exist in the data drame. While the drop() approach would fail, the difference() approach would ignore names that do not match any actual column names.
  • We use a lambda function inside loc[] so it is robust to use in a chain where columns may have been altered in previous steps. If that is not the case we can refer to the data frame, here df, directly without a lambda function.

By Position

We wish to exclude one or more columns of a data frame given their positions (where 1 is the left most column) and to return the data frame’s remaining columns.

In this example, we wish to exclude the columns at the first and fifth positions and return the remaining columns.

df_2 = df.drop(columns=df.columns[[0, 1]])

Here is how this works:

  • iloc[], which we have used in Selecting by Position, is not capable of dropping columns. Instead, we use the drop() method of a Pandas data frame.
  • drop() expects column names (labels), therefore we select the column names for the columns at the positions we wish to drop via df.columns[[0,1]].
  • df.columns returns the column names of a data frame (as an index).

Select Complement

We wish to exclude columns that match a condition and return the data frame’s remaining columns.

In this example, we exclude columns that start with old_ and return the remaining columns.

df_2 = df.loc[:, lambda x: ~ x.columns.str.startswith('col_')]

Here is how this works:

  • We simply add ~ prior to the logical expression inside loc[] to toggle True values to False and vice versa, hence selecting the complement of the columns matching the logical expression.
  • We use a lambda function inside loc[] so it is robust to use in a chain where columns may have been altered in previous steps. If that is not the case we can refer to the data frame, here df, directly without a lambda function.

Exclude Data Type

We wish to exclude columns of one or more data types and return the data frame’s remaining columns.

In this example, we wish to exclude both columns of a numeric data type and of a logical data type.

df.select_dtypes(exclude=["number", "boolean"])

Here is how this works:

  • We used select_dtypes() in Selection by Data Type to select columns of one or more data type(s). To do so we passed references to the intended data type(s) to the include argument of select_dtypes(). include is the default argument of select_dtypes() so we didn’t have to specify it explicitly.
  • select_dtypes() has another argument called exclude which works in the same way as include but drops columns of the matching data type(s).
  • We can pass to exclude a single data type or a list of data types. See Selection by Data Type for how to refer to the most common data types.
PYTHON
I/O