Sequential Aggregation

We wish to aggregate data in a way that can only be produced by carrying out multiple aggregation operations in sequence.

An example is computing the monthly average of number of daily events. To do that we would need to carry out to aggregations in sequence:

  1. Group by day and then compute the number of events for each day.
  2. Group the output of the first step by month and then compute the average.

We will cover two common scenarios of sequential aggregation:

  1. Grouping Column Subset: We start by aggregating by a set of columns then we aggregate by a subset of that original set. For instance, we aggregate by location and category and compute the number of events then group by location alone and compute the median number of events per category for that location.
  2. Lower Time Resolution: A common sequential aggregation scenario is sequential aggregation over time, where we first aggregate over a smaller time span (higher resolution) then over a larger time span (lower resolution). A common example is to compute the average value over say a year of a sum computed over a month. In a commercial setting, this may be average monthly customer spend.

Grouping Column Subset

We wish to aggregate data in a way that can only be produced by carrying out multiple aggregation operations in sequence.

In this example, for each value of the column col_1 in the table table_1, we wish to obtain the average number of rows and average sum of values of col_3 for groups defined by the combinations of col_1 and col_2.

WITH table_2 AS
         (SELECT col_1,
                 col_2,
                 COUNT(1)   AS row_count,
                 SUM(col_3) AS col_3_sum
          FROM table_1
          GROUP BY col_1, col_2)
SELECT col_1,
       AVG(row_count) avg_count,
       AVG(col_3_sum) avg_value
FROM table_2
GROUP BY col_1;

Here is how this works:

  • We perform a sequence of two data aggregation operations as follows:
  • First Aggregation: By col_1 and col_2
    • We first group the rows in table_1 by the values in the col_1 and col_2 columns using the GROUP BY().
    • Then, for each group, we compute the number of rows in the group (i.e., the count) and the sum of the values in the col_3 column in table_2 CTE.
    • The output of the first aggregation is a new table with one row for each unique combination of values of the columns col_1 and col_2, and two columns row_count and col_3_sum.
  • Second Aggregation: By col_1
    • We group the resulting table by the values of the column col_1.
    • Then, for each group we compute the average count and average sum for each group.
    • The output of the second aggregation is a new table with one row for each unique value of the column col_1 and two columns avg_count and avg_value.

Lower Time Resolution

We wish to compute a lower resolution aggregation of a higher resolution aggregation of data.

In this example, we wish to aggregate the table table_1 over a date time column col_1 such that for each day, we obtain the average hourly number of rows and average hourly sum of values of the column col_2.

WITH table_2 AS
         (SELECT DATETIME_TRUNC(col_1, HOUR) AS col_1_hour,
                 COUNT(1)                       hour_count,
                 SUM(col_2)                  AS hour_value,
          FROM table_1
          GROUP BY DATETIME_TRUNC(col_1, HOUR))

SELECT DATETIME_TRUNC(col_1_hour, DAY) AS col_1_day,
       AVG(hour_count)                 AS avg_count,
       AVG(hour_value)                 AS avg_count,
FROM table_2
GROUP BY DATETIME_TRUNC(col_1_hour, DAY);

Here is how this works:

  • We perform a sequence of two data aggregation operations at successively lower time resolutions as follows.
  • First Aggregation: By Hour
    • We first group the rows in table_1 by the values in the column col_1, rounded to the nearest hour using the DATETIME_TRUNC(datetime, HOUR) function.
    • We use this function in the GROUP BY caluse as we cannot refer to newly created columns within the same step.
    • For each group, we compute the number of rows in the group (i.e. the count) and the sum of the values of the column col_2.
    • The output of the first aggregation is a new table with one row for each unique hour in the col_1 column, and the values in the hour_count and hour_value columns are the count and the sum of values of the column col_2 for each hour, respectively.
  • Second Aggregation: By Day
    • We group the resulting table by the day on which each hour occurred and
    • We then compute the average count and average value for each day.
    • The output is a new table with one row for each unique day, and two columns avg_count and avg_value.
  • For a more detailed coverage of date time operations, see Date Time Operations.
SQL
I/O