Sequential Aggregation

We wish to aggregate data in a way that can only be produced by carrying out multiple aggregation operations in sequence.

An example is computing the monthly average of number of daily events. To do that we would need to carry out to aggregations in sequence:

Group by day and then compute the number of events for each day.
Group the output of the first step by month and then compute the average.

We will cover two common scenarios of sequential aggregation:

Grouping Column Subset: We start by aggregating by a set of columns then we aggregate by a subset of that original set. For instance, we aggregate by location and category and compute the number of events then group by location alone and compute the median number of events per category for that location.
Lower Time Resolution: A common sequential aggregation scenario is sequential aggregation over time, where we first aggregate over a smaller time span (higher resolution) then over a larger time span (lower resolution). A common example is to compute the average value over say a year of a sum computed over a month. In a commercial setting, this may be average monthly customer spend.

Grouping Column Subset

We wish to aggregate data in a way that can only be produced by carrying out multiple aggregation operations in sequence.

In this example, for each value of the column col_1 in the data frame df, we wish to obtain the average number of rows and average sum of values of col_3 for groups defined by the combinations of col_1 and col_2.

df_2 = (df
    .groupby(['col_1', 'col_2'], as_index=False)
    .agg(
        count = ('col_3', 'size'),
        value = ('col_3', 'sum'))
    .groupby('col_1', as_index=False)
    .agg(
        avg_count=('count', 'mean'),
        avg_value=('value', 'mean')))

Here is how this works:

We perform a sequence of two data aggregation operations as follows:
First Aggregation: By col_1 and col_2
- We first group the rows in df by the values in the col_1 and col_2 columns using the groupby() function.
- Then, for each group, we compute the number of rows in the group (i.e., the size) and the sum of the values in the col_3 column via the first call to agg() .
- The output of the first aggregation is a new dataframe with one row for each unique combination of values of the columns col_1 and col_2, and two columns count and value.
Second Aggregation: By col_1
- We group the resulting dataframe by the values of the column col_1.
- Then, for each group we compute the average count and average sum for each group via the second call to agg().
- The output of the second aggregation is a new dataframe with one row for each unique value of the column col_1 and two columns avg_count and avg_value.
Note in both calls to groupby() we set as_index=False so the grouping columns are retained as columns in the output and not pushed into the Index. While this is often convenient, as in this case, there are situations when letting groupby() create an Index from the grouping columns is the way to go. We cover some such scenarios in Grouping.

Lower Time Resolution

We wish to compute a lower resolution aggregation of a higher resolution aggregation of data.

In this example, we wish to aggregate the data frame df over a date time column col_1 such that for each day, we obtain the average hourly number of rows and average hourly sum of values of the column col_2.

df_3 = (df
  .resample('H', on='col_1')
  .agg(
      count=('col_2', 'size'),
      value=('col_2', 'sum'))
  .resample('D')
  .agg(
      count=('count', 'mean'),
      value=('value', 'mean'))
  .reset_index())

Here is how this works:

We perform a sequence of two data aggregation operations at successively lower time resolutions as follows.
First Aggregation: By Hour
- In resample('H', on='col_1'), we use the resample() function to group the rows in hourly intervals extracted from the values of the column col_1.
- Consider resample() as an equivalent to group_by() that makes it easier to group by different spans of time. We cover resample() in some detail in Date Time Operations.
- For each group, we compute the number of rows in the group (i.e., the size) and the sum of the values of the column col_2 via the first call to agg().
- The output of the first aggregation is a new dataframe with one row for each unique hour in the col_1 column, and the values in the hour_count and hour_value columns are the count and the sum of values of the column col_2 for each hour, respectively.
Second Aggregation: By Day
- We group the resulting dataframe in daily intervals i.e. by the day on which each hour occurred via resample('D')
- Note that we didn't need to pass a column name to the second call to resample() because the output of the first call to resample() has a DateTimeIndex() which holds the hours.
- We then compute the average count and average value for each day via the second call to agg().
- The output is a new dataframe with one row for each unique day, and two columns avg_count and avg_value.
We use the reset_index() method to reset the index of the resulting dataframe, which moves the DateTimeIndex back into a column col_1 and replaces that with a default sequential index starting from 0.
For a more detailed coverage of date time operations, see Date Time Operations.

Optima.io Reference beta

Sequential Aggregation

Grouping Column Subset

Lower Time Resolution