Sequential Aggregation

We wish to aggregate data in a way that can only be produced by carrying out multiple aggregation operations in sequence.

An example is computing the monthly average of number of daily events. To do that we would need to carry out to aggregations in sequence:

  1. Group by day and then compute the number of events for each day.
  2. Group the output of the first step by month and then compute the average.

We will cover two common scenarios of sequential aggregation:

  1. Grouping Column Subset: We start by aggregating by a set of columns then we aggregate by a subset of that original set. For instance, we aggregate by location and category and compute the number of events then group by location alone and compute the median number of events per category for that location.
  2. Lower Time Resolution: A common sequential aggregation scenario is sequential aggregation over time, where we first aggregate over a smaller time span (higher resolution) then over a larger time span (lower resolution). A common example is to compute the average value over say a year of a sum computed over a month. In a commercial setting, this may be average monthly customer spend.

Grouping Column Subset

We wish to aggregate data in a way that can only be produced by carrying out multiple aggregation operations in sequence.

In this example, for each value of the column col_1 in the data frame df, we wish to obtain the average number of rows and average sum of values of col_3 for groups defined by the combinations of col_1 and col_2.

df_2 = df %>% 
  group_by(col_1, col_2) %>%
  summarize(
    count = n(),
    value = sum(col_3)) %>%
  group_by(col_1) %>%
  summarize(
    avg_count = mean(count),
    avg_value = mean(value))

Here is how this works:

  • We perform a sequence of two data aggregation operations as follows:
  • First Aggregation: By col_1 and col_2
    • We first group the rows in df by the values in the col_1 and col_2 columns using the group_by() function.
    • Then, for each group, we compute the number of rows in the group (i.e., the count) and the sum of the values in the col_3 column via the first call to summarize().
    • The output of the first aggregation is a new dataframe with one row for each unique combination of values of the columns col_1 and col_2, and two columns count and value.
  • Second Aggregation: By col_1
    • We group the resulting dataframe by the values of the column col_1.
    • Then, for each group we compute the average count and average sum for each group via the second call to summarize().
    • The output of the second aggregation is a new dataframe with one row for each unique value of the column col_1 and two columns avg_count and avg_value.
  • Note that by default, summarize drops the last grouping column (aka level) so we could skip the second call to group_by() and get the same results. We kept it here for clarity. See Grouping.

Lower Time Resolution

We wish to compute a lower resolution aggregation of a higher resolution aggregation of data.

In this example, we wish to aggregate the data frame df over a date time column col_1 such that for each day, we obtain the average hourly number of rows and average hourly sum of values of the column col_2.

df_2 = df %>%
  group_by(col_1_hour = round_date(col_1, unit = 'hour')) %>%
  summarize(
    hour_count = n(),
    hour_value = sum(col_2)) %>%
  group_by(round_date(col_1_hour, unit = 'day')) %>%
  summarize(
    avg_count = mean(hour_count),
    avg_value = mean(hour_value))

Here is how this works:

  • We perform a sequence of two data aggregation operations at successively lower time resolutions as follows.
  • First Aggregation: By Hour
    • We first group the rows in df by the values in the column col_1, rounded to the nearest hour using the round_date() function from the lubridate package.
    • We give this newly created grouping variable the name col_1_hour. We discuss creating new grouping columns inside group_by() in Grouping.
    • For each group, we compute the number of rows in the group (i.e., the count) and the sum of the values of the column col_2 via the first call to summarize().
    • The output of the first aggregation is a new dataframe with one row for each unique hour in the col_1 column, and the values in the hour_count and hour_value columns are the count and the sum of values of the column col_2 for each hour, respectively.
  • Second Aggregation: By Day
    • We group the resulting dataframe by the day on which each hour occurred and
    • We then compute the average count and average value for each day via the second call to summarize().
    • The output is a new dataframe with one row for each unique day, and two columns avg_count and avg_value.
  • For a more detailed coverage of date time operations, see Date Time Operations.
R
I/O