Filter a grouped data frame to keep only the groups that meet certain criteria. Note that in this scenario, we are interested in filtering entire groups and not individual rows.
In this example, we wish to keep only groups that have more than one row and where the sum of values of the column col_2
is above 0 for the group.
df_2 = (df
.groupby('col_1')
.filter(lambda x: (len(x) > 1) & (sum(x['col_2']) > 0)))
Here is how this works:
filter()
of DataFrameGroupBy
objects to filter groups.filter()
expects a function that when applied to each group (a sub data frame), returns True
or False
.True
are retained and groups for which the function returns False
are dropped.lambda
function to filter()
.len(x) > 1
, we check if the current group has more than 1 row.sum(x['col_2']) > 0
, we check if the sum of values of col_2
is greater than 0 for the current group.True
, i.e. that have more than 1 row and that have a sum of values for col_2
greater than 0, are retained (included in the output).Alternatively:
We can perform grouped filtering by performing a grouped transformation inside loc[]
.
g_df = df.groupby('col_1')
df.loc[g_df.transform('size').gt(1) &
g_df['col_2'].transform('sum').gt(0)]
Here is how this works:
groupby()
and transform()
to perform a grouped transformations which we then apply logical conditions to inside loc[]
.transform()
repeats the value produced for each group as many times as there are rows in each group. The result is a Series
with the same number of rows as the original data frame (in this example df
) where the value is the same for each group.g_df = df.groupby('col_1')
, we group the data frame df
by the values of the column col_1
. We store the output DataFrameGroupBy
object in the object g_df
to use later and spare ourselves the code repetition.g_df.transform('size').gt(1)
we compute the size of each group of the grouped data frame g_df
and compare that with the value 1. This expression returns True
for rows belonging to groups where the group size is greater than 1.g_df['col_2'].transform('sum').gt(0)
we compute the sum of values of the column col_2
for each group of the grouped data frame g_df
and compare that with the value 0. This expression returns True
for rows belonging to groups where the sum of values of the column col_2
is greater than 0.True
, i.e. that belong to groups with a number of rows more than 1 and where the sum of values for col_2
is greater than 0, are retained (included in the output).DataFrameGroupBy
object.