Implicit Column Selection

Oftentimes we wish to select columns, not by explicitly spelling out their names or positions, but by criteria satisfied by the desired columns. The three most common scenarios are:

  1. Name Pattern where we cover how to select columns whose names satisfy a given pattern e.g. select columns with a name that contains the string ‘_id’ .
  2. Data Type where we cover how to select columns of one or more data types e.g. select columns with a numeric data type.
  3. Data Criteria where we cover how to select columns whose data satisfies a certain condition e.g. the percentage of missing values is below 10%.

In Multiple Conditions, we cover how to combine multiple conditions in different ways to realize more complex column selection logic

Implicit column selection works in two steps:

  1. Identification We write logic (as a function or an anonymous function) that acts on each column of a data frame to check if certain conditions are satisfied and returns one of three types of output:
    • A boolean vector with as many elements as there are columns in the data frame and where the boolean element is True if the corresponding column is to be selected.
    • The names of the columns to be selected.
    • The positions of the columns to be selected.
  2. Extraction This output from step 1 is then passed to a selection operator e.g. loc[] , to carry out the extraction of columns from the original data frame.
    • If the output of step 1 is a list of column names or a boolean Series, we pass it to loc[] .
    • If the output of step 1 is a list of column positions, we pass it to iloc[].

loc[] and iloc[] are sufficient to handle all implicit column selection scenarios as described above. For completeness, it is worth noting that:

  • We can use the bracket operator [] only if the output of step 1 (identification) is column names.
  • Although it is common to use loc[] when boolean column indexing (i.e. select columns via a boolean Series), it is possible to use iloc[] if we covert the Series of boolean values into a numpy Array via to_numpy().
PYTHON
I/O