We have a grouped data frame, and we wish to apply the row filtering logic to each group separately.
In this example, we have a data frame df
that is grouped by the column col_1
and we wish to filter rows where the value of the column col_2
is greater than the value of the mean of col_2
for the group.
df_2 = (df
.groupby('col_1')
.apply(lambda x: x.loc[x['col_2'] > x['col_2'].mean()]))
Here is how this works:
loc[]
can’t be applied to DataFrameGroupBy
object. If we attempt to apply it, we get a no attribute
error.DataFrameGroupBy
object), the preferred approach is to use apply()
.apply()
we use a lambda
function to which each group is passed as a DataFrame
(in this example referred to as x
).DataFrame
) via loc[]
like we would filter any regular DataFrame
.x['col_2']
to the mean of x['col_2']
for the group which is computed via x['col_2'].mean()
.col_2
is larger than the mean value of col_2
yield True
and are returned by loc[]
.DataFrame
is returned.Alternatively:
We can perform grouped filtering by performing a grouped transformation inside loc[]
.
df_2 = df.loc[
df['col_2'] > df.groupby('col_1')['col_2'].transform('mean')]
Here is how this works:
groupby()
and transform()
to perform a grouped transformation where we compute the mean
for each group.transform()
repeats the value produced for each group as many times as there are rows in each group. The result is a Series
with the same number of rows as the original data frame (in this example df
) where the value is the same for each group. See Grouped Transformation.Series
of mean group values with col_2
and return True
where col_2
takes a value greater than the mean value for the group. The corresponding columns are then returned by loc[]
.DataFrameGroupBy
object.