Filter Groups By Column Value

We have a grouped data frame, and we wish to apply the row filtering logic to each group separately.

In this example, we have a data frame df that is grouped by the column col_1 and we wish to filter rows where the value of the column col_2 is greater than the value of the mean of col_2 for the group.

df_2 = (df
 .groupby('col_1')
 .apply(lambda x: x.loc[x['col_2'] > x['col_2'].mean()]))

Here is how this works:

  • loc[] can’t be applied to DataFrameGroupBy object. If we attempt to apply it, we get a no attribute error.
  • In order to apply a filter to each group of a grouped data frame (a DataFrameGroupBy object), the preferred approach is to use apply().
  • Inside apply() we use a lambda function to which each group is passed as a DataFrame (in this example referred to as x).
  • We can then apply our filtering logic to each group (a DataFrame) via loc[] like we would filter any regular DataFrame.
  • In this example, we compare the value of x['col_2'] to the mean of x['col_2'] for the group which is computed via x['col_2'].mean().
  • For each group, rows where the value of col_2 is larger than the mean value of col_2 yield True and are returned by loc[].
  • The rows returned from each group are then appended together in the same order as the groups and the resulting DataFrame is returned.

Alternatively:

We can perform grouped filtering by performing a grouped transformation inside loc[].

df_2 = df.loc[
    df['col_2'] >  df.groupby('col_1')['col_2'].transform('mean')]

Here is how this works:

  • We use groupby() and transform() to perform a grouped transformation where we compute the mean for each group.
  • transform() repeats the value produced for each group as many times as there are rows in each group. The result is a Series with the same number of rows as the original data frame (in this example df) where the value is the same for each group. See Grouped Transformation.
  • We compare that resulting Series of mean group values with col_2 and return True where col_2 takes a value greater than the mean value for the group. The corresponding columns are then returned by loc[].
  • This solution may be an appropriate approach when we wish to apply a filter in a grouped manner while we do not necessarily have to start from a data frame that is already grouped; i.e. we do not have to start with a DataFrameGroupBy object.
PYTHON
I/O