Common Filtering Scenarios

In this section we cover the basic yet most common scenarios of filtering data frame rows by column values.

Logical Column

We wish to filter by a logical column to return the rows where the logical column is True.

In this example, we wish to filter the rows of the data frame df where the logical column ‘col_3' has a value of True.

df_2 = df.loc[df['col_3']]

Here is how this works:

  • A column of a logical data type (comprised of True and / or False values) can be passed directly to loc[] which would return the rows where the passed column (in this example: col_3) takes a value of True.
  • Although there are several way to filter the rows of a data frame by a logical condition on column values, loc[] is the preferred approach in Pandas.

Compare to Scaler

We wish to obtain (filter) the rows of a data frame where the value of a particular column meets a condition that involves comparison to a scaler value.

In this example, we wish to obtain the rows of the data frame df where the numerical column ‘col_1’ takes a value greater than 0.

df_2 = df.loc[df['col_1'] > 0]

Here is how this works:

  • The expression df['col_1'] > 0 yields a logical Series where all values of col_1 that are larger than 0 are True and all values that are smaller than 0 are False.
  • The output logical Series of evaluating df['col_1'] > 0 is passed to the first argument of loc[] to subset (extract) the corresponding rows.
  • In this example we used a greater than comparison >. In general we can use any comparison operation on numerical columns including smaller than <, equal ==, greater than or equal >=, or smaller than or equal <=. See Numerical Operations for more details.
  • In addition to applying comparison operations to numerical columns, we can apply the appropriate comparison operations to non numeric columns i.e. String, Logical, Factor, and Date-Time. We cover those in their respective operations section.

Compare Columns

We wish to return rows where a comparison of the DataFrame’s column evaluates to True.

In this example, we wish to return rows where the value of the column col_1 is larger than the value of another column col_2.

df_2 = df.loc[df['col_1'] > df['col_2']]

Here is how this works:

  • For each row, the value of col_1 is compared to the value of col_2. If the value of col_1 for a row is larger than the value of col_2, the expression evaluates to True otherwise to False.
  • The outcome of evaluating df['col_1'] > df['col_2'] is a Series of boolean True or False values.
  • That logical Series is passed to loc[] which returns the subset of rows of a data frame (in this example: df) corresponding to where the boolean Series is True. In this example, that would be rows where the value of col_1 is larger than the value of col_2.

Complement

We wish to obtain rows where a condition is not satisfied.

In this example, we wish to obtain rows where the value of a string column ‘col_2' is neither ‘a’ nor ‘c’.

No Missing Values

df_2 = df.loc[~ df['col_2'].isin(['a','c'])]

Here is how this works:

  • We use the isin() operator to check whether each value of ‘col_2' is either ‘a’ or ‘c’. The output is a Series of logical values that is True if the value of the corresponding element of ‘col_2’ is a member of the list and False otherwise. We cover checking for membership in a list in detail in List Operations.
  • We use the complement operator ! to invert the logical values resulting from isin() so that we have a value of True for rows where the value of ‘col_2' is neither ‘a’ nor ‘c’
  • That logical Series is passed to loc[] which returns the subset of rows of a data frame df corresponding to where the boolean Series is True.

Missing Values

df_2 = (df
        .dropna(subset='col_2')
        .loc[~ df['col_2'].isin(['a','c'])])

Here is how this works:

  • In Python, comparing to a missing value (np.nan) returns False. We would not know whether a False is due to the value not meeting a condition or because it is missing.
  • This is not a concern when we wish to return the rows where the logical expression evaluates to True. When we wish to take the complement, it becomes an issue though.
  • If we do not want rows with missing values for the columns used in the filtering expression, we should drop the missing values via dropna() before comparing. We cover working with missing values in Missing Values.

New Columns

If we are filtering rows as part of a chain of data manipulation operations and need to refer to a row that was created earlier in the same chain, we need to refer to it via a lambda function.

df_2 = (df
 .assign(col_2 = 2 * df['col_1'])
 .loc[lambda x: x['col_2'] > 0])

Here is how this works:

  • We can’t refer to columns that we were just created in the same chain through the original data frame. We would get an error if we try to run: df.assign(col_2 = 2 * df['col_1']).loc[df['col_2'] > 0]
  • To refer to a row that was created earlier in the same chain, we need to refer to it via a lambda function.

Patchy Index

The rows we get as output from a row filtering operation, like in the scenarios above, will have the same indices as they had in the original data frame resulting in a “patchy” index. While this is appropriate in some situations, often times we wish that the data frame that results from filtering has a fresh index.

df_2 = (df
    .loc[df['col_1'] > 0]
    .reset_index(drop=True))

Here is how this works:

  • We use the function reset_index() after loc[] to give the new data frame df_2 a fresh index that starts at 0 and increments by 1 for each row.
  • If we do not set the drop parameter as True, the original index is kept in the data frame as a new column.
PYTHON
I/O