We wish to aggregate data in a way that can only be produced by carrying out multiple aggregation operations in sequence.

An example is computing the monthly average of number of daily events. To do that we would need to carry out to aggregations in sequence:

- Group by day and then compute the number of events for each day.
- Group the output of the first step by month and then compute the average.

We will cover two common scenarios of sequential aggregation:

**Grouping Column Subset:**We start by aggregating by a set of columns then we aggregate by a subset of that original set. For instance, we aggregate by location and category and compute the number of events then group by location alone and compute the median number of events per category for that location.**Lower Time Resolution:**A common sequential aggregation scenario is sequential aggregation over time, where we first aggregate over a smaller time span (higher resolution) then over a larger time span (lower resolution). A common example is to compute the average value over say a year of a sum computed over a month. In a commercial setting, this may be average monthly customer spend.

We wish to aggregate data in a way that can only be produced by carrying out multiple aggregation operations in sequence.

In this example, for each value of the column `col_1`

in the data frame `df`

, we wish to obtain the average number of rows and average sum of values of `col_3`

for groups defined by the combinations of `col_1`

and `col_2`

.

```
df_2 = (df
.groupby(['col_1', 'col_2'], as_index=False)
.agg(
count = ('col_3', 'size'),
value = ('col_3', 'sum'))
.groupby('col_1', as_index=False)
.agg(
avg_count=('count', 'mean'),
avg_value=('value', 'mean')))
```

Here is how this works:

- We perform a sequence of two data aggregation operations as follows:
- First Aggregation: By
`col_1`

and`col_2`

- We first group the rows in
`df`

by the values in the`col_1`

and`col_2`

columns using the`groupby()`

function. - Then, for each group, we compute the number of rows in the group (i.e., the size) and the sum of the values in the
`col_3`

column via the first call to`agg()`

. - The output of the first aggregation is a new dataframe with one row for each unique combination of values of the columns
`col_1`

and`col_2`

, and two columns`count`

and`value`

.

- We first group the rows in
- Second Aggregation: By
`col_1`

- We group the resulting dataframe by the values of the column
`col_1`

. - Then, for each group we compute the average count and average sum for each group via the second call to
`agg()`

. - The output of the second aggregation is a new dataframe with one row for each unique value of the column
`col_1`

and two columns`avg_count`

and`avg_value`

.

- We group the resulting dataframe by the values of the column
- Note in both calls to
`groupby()`

we set`as_index=False`

so the grouping columns are retained as columns in the output and not pushed into the`Index`

. While this is often convenient, as in this case, there are situations when letting`groupby()`

create an`Index`

from the grouping columns is the way to go. We cover some such scenarios in Grouping.

We wish to compute a lower resolution aggregation of a higher resolution aggregation of data.

In this example, we wish to aggregate the data frame `df`

over a date time column `col_1`

such that for each day, we obtain the average hourly number of rows and average hourly sum of values of the column `col_2`

.

```
df_3 = (df
.resample('H', on='col_1')
.agg(
count=('col_2', 'size'),
value=('col_2', 'sum'))
.resample('D')
.agg(
count=('count', 'mean'),
value=('value', 'mean'))
.reset_index())
```

Here is how this works:

- We perform a sequence of two data aggregation operations at successively lower time resolutions as follows.
- First Aggregation: By Hour
- In
`resample('H', on='col_1')`

, we use the`resample()`

function to group the rows in hourly intervals extracted from the values of the column`col_1`

. - Consider
`resample()`

as an equivalent to`group_by()`

that makes it easier to group by different spans of time. We cover`resample()`

in some detail in Date Time Operations. - For each group, we compute the number of rows in the group (i.e., the size) and the sum of the values of the column
`col_2`

via the first call to`agg()`

. - The output of the first aggregation is a new dataframe with one row for each unique hour in the
`col_1`

column, and the values in the`hour_count`

and`hour_value`

columns are the count and the sum of values of the column`col_2`

for each hour, respectively.

- In
- Second Aggregation: By Day
- We group the resulting dataframe in daily intervals i.e. by the day on which each hour occurred via
`resample('D')`

- Note that we didn't need to pass a column name to the second call to
`resample()`

because the output of the first call to`resample()`

has a`DateTimeIndex()`

which holds the hours. - We then compute the average count and average value for each day via the second call to
`agg()`

. - The output is a new dataframe with one row for each unique day, and two columns
`avg_count`

and`avg_value`

.

- We group the resulting dataframe in daily intervals i.e. by the day on which each hour occurred via
- We use the
`reset_index()`

method to reset the index of the resulting dataframe, which moves the`DateTimeIndex`

back into a column`col_1`

and replaces that with a default sequential index starting from 0. - For a more detailed coverage of date time operations, see Date Time Operations.

PYTHON