Filter Groups

Filter a grouped data frame to keep only the groups that meet certain criteria. Note that in this scenario, we are interested in filtering entire groups and not individual rows.

In this example, we wish to keep only groups that have more than one row and where the sum of values of the column col_2 is above 0 for the group.

df_2 = (df
        .groupby('col_1')
        .filter(lambda x: (len(x) > 1) & (sum(x['col_2']) > 0)))

Here is how this works:

  • We use the method filter() of DataFrameGroupBy objects to filter groups.
  • filter() expects a function that when applied to each group (a sub data frame), returns True or False.
  • Groups for which the function returns True are retained and groups for which the function returns False are dropped.
  • In the code above we pass the filtering conditional expression as a lambda function to filter().
  • In len(x) > 1, we check if the current group has more than 1 row.
  • In sum(x['col_2']) > 0, we check if the sum of values of col_2 is greater than 0 for the current group.
  • Groups for which both conditions are True, i.e. that have more than 1 row and that have a sum of values for col_2 greater than 0, are retained (included in the output).

Alternatively:

We can perform grouped filtering by performing a grouped transformation inside loc[].

g_df = df.groupby('col_1')
df.loc[g_df.transform('size').gt(1) & 
       g_df['col_2'].transform('sum').gt(0)]

Here is how this works:

  • We use groupby() and transform() to perform a grouped transformations which we then apply logical conditions to inside loc[].
  • transform() repeats the value produced for each group as many times as there are rows in each group. The result is a Series with the same number of rows as the original data frame (in this example df) where the value is the same for each group.
  • In g_df = df.groupby('col_1'), we group the data frame df by the values of the column col_1. We store the output DataFrameGroupBy object in the object g_df to use later and spare ourselves the code repetition.
  • In g_df.transform('size').gt(1) we compute the size of each group of the grouped data frame g_df and compare that with the value 1. This expression returns True for rows belonging to groups where the group size is greater than 1.
  • In g_df['col_2'].transform('sum').gt(0) we compute the sum of values of the column col_2 for each group of the grouped data frame g_df and compare that with the value 0. This expression returns True for rows belonging to groups where the sum of values of the column col_2 is greater than 0.
  • Rows where both conditions are True, i.e. that belong to groups with a number of rows more than 1 and where the sum of values for col_2 is greater than 0, are retained (included in the output).
  • This may be an appropriate approach when we wish to apply a filter in a grouped manner while we do not necessarily have to start from a data frame that is already grouped; i.e. we do not have to start with a DataFrameGroupBy object.
PYTHON
I/O