We wish to aggregate data in a way that can only be produced by carrying out multiple aggregation operations in sequence.
An example is computing the monthly average of number of daily events. To do that we would need to carry out to aggregations in sequence:
We will cover two common scenarios of sequential aggregation:
We wish to aggregate data in a way that can only be produced by carrying out multiple aggregation operations in sequence.
In this example, for each value of the column col_1
in the data frame df
, we wish to obtain the average number of rows and average sum of values of col_3
for groups defined by the combinations of col_1
and col_2
.
df_2 = (df
.groupby(['col_1', 'col_2'], as_index=False)
.agg(
count = ('col_3', 'size'),
value = ('col_3', 'sum'))
.groupby('col_1', as_index=False)
.agg(
avg_count=('count', 'mean'),
avg_value=('value', 'mean')))
Here is how this works:
col_1
and col_2
df
by the values in the col_1
and col_2
columns using the groupby()
function.col_3
column via the first call to agg()
.col_1
and col_2
, and two columns count
and value
.col_1
col_1
.agg()
.col_1
and two columns avg_count
and avg_value
.groupby()
we set as_index=False
so the grouping columns are retained as columns in the output and not pushed into the Index
. While this is often convenient, as in this case, there are situations when letting groupby()
create an Index
from the grouping columns is the way to go. We cover some such scenarios in Grouping.We wish to compute a lower resolution aggregation of a higher resolution aggregation of data.
In this example, we wish to aggregate the data frame df
over a date time column col_1
such that for each day, we obtain the average hourly number of rows and average hourly sum of values of the column col_2
.
df_3 = (df
.resample('H', on='col_1')
.agg(
count=('col_2', 'size'),
value=('col_2', 'sum'))
.resample('D')
.agg(
count=('count', 'mean'),
value=('value', 'mean'))
.reset_index())
Here is how this works:
resample('H', on='col_1')
, we use the resample()
function to group the rows in hourly intervals extracted from the values of the column col_1
.resample()
as an equivalent to group_by()
that makes it easier to group by different spans of time. We cover resample()
in some detail in Date Time Operations.col_2
via the first call to agg()
.col_1
column, and the values in the hour_count
and hour_value
columns are the count and the sum of values of the column col_2
for each hour, respectively.resample('D')
resample()
because the output of the first call to resample()
has a DateTimeIndex()
which holds the hours.agg()
.avg_count
and avg_value
.reset_index()
method to reset the index of the resulting dataframe, which moves the DateTimeIndex
back into a column col_1
and replaces that with a default sequential index starting from 0.