Column Selection

We wish to identify the columns on each of which we will apply filtering logic.

We will cover the following scenarios

  • All Columns where we cover how to apply a logical expression to all columns of a data frame and return the rows for which any of the columns satisfy the expression.
  • Explicit Selection where we cover how to apply a logical expression to each of a set of explicitly selected columns of a data frame (e.g. by spelling out the names of the columns of interest) and return the rows for which any of the columns satisfy the expression.
  • Implicit Selection where we cover how to apply a logical expression to each of a set of implicitly selected columns of a data frame (e.g. by selecting columns whose names contain a certain substring) and return the rows for which any of the columns satisfy the expression.

This section is complemented by

  • Function Specification where we cover how to specify one or more logical expressions or functions to apply to the selected set of columns
  • Relationship Specification where we cover how to combine the results of applying the specified function(s) to the specified column(s) in either an AND manner or an OR manner.

All Columns

We wish to apply a logical expression to every column and to return any row for which any column satisfies that logical expression.

In this example, we wish to return any row in the data frame df for which any column has a missing value NA.

df.loc[df.apply(pd.isna).any(axis=1)]

Here is how this works:

  • In df.apply(pd.isna) we go over each value of the data frame df and check if it is missing. The output is a data frame of the same size as df and where a value is True if the corresponding value in the data frame df is missing (see Missing Values).
  • We use any() with axis=1 to logically combine the values for each row of the data frame of logical True or False values produced by apply(pd.isna). The output is True for a row if the filtering condition evaluates to True for any column.
  • Extensions:
    • The counterpart of any() is all() which requires that the logical expression evaluates to TRUE for all columns. See Relationship Selection.
    • We get a patchy index i.e. rows have their original index value. To get a fresh index, we add .reset_index(drop=True). See Patchy Index under Common Scenarios.
    • If we are filtering rows as part of a chain of data manipulation operations and need to refer to a row that was created earlier in the same chain, we need to refer to it via a lambda function. See New Columns under Common Scenarios.

Alternatively:

df.loc[df.isna().any(axis=1)]

Here is how this works:

  • If the predicate function we wish to apply happens to be available as a data frame method like isna() in this example, we can apply it directly without the need for apply(). We do so here via df.isna().
  • We then apply any() with axis=1 to identify rows where the condition evaluates to True for any of the columns.

Explicit Selection

We wish to apply a logical expression to a set of explicitly specified column and to return any row for which any of those columns satisfies the logical expression.

In this example, we wish to return any row in the data frame df for which any of the columns ‘col_1' , ‘col_3’, or ‘col_5' has a missing value NA.

df_2 = df.loc[(df[['col_1', 'col_3', 'col_5']]
               .apply(pd.isna)
               .any(axis=1))]

Here is how this works:

  • In df[['col_1', 'col_3', 'col_5']], we select the columns of interest by name. See Basic Selection for a detailed coverage of explicit column selection scenarios.
  • The rest of the code works as described in the “All Columns” scenario above.

Implicit Selection

We wish to apply a logical expression to a set of implicitly specified column and to return any row for which any of those columns satisfies the logical expression. Implicit column selection is when we do not spell out the column names or positions explicitly but rather identify the columns via a property of their name or their data.

In this example, we wish to return any row in the data frame df for which any column whose name starts with the substring ‘cvr_’ is missing.

df_2 = df.loc[(df
               .loc[:, df.columns.str.contains('cvr_', regex=False)]
               .apply(pd.isna)
               .any(axis=1))]

Here is how this works:

  • We use df.columns.str.contains('cvr_', regex=False) to select all columns whose name starts with the substring ‘cvr_’. See Implicit Selection for a coverage of the most common scenarios of implicit column selection including by name pattern, data type, and Criteria satisfied by the column’s data.
  • The rest of the code works as described in the “All Columns” scenario above.
PYTHON
I/O