Conditional Aggregation

We wish to carry out data aggregation operations conditionally. In other words, the data aggregation operation we wish to carry out is not simply applied to all rows, rather rows or groups are treated differently based on one or more conditions.

In this section, we will cover the following conditional aggregation scenarios:

  1. Subset Column where we show how to include only a subset of a column’s values in the aggregation operation e.g. aggregate a numeric column by computing the sum of the positive values only.
  2. Filter Group where we show how to combine filtering with aggregation to first filter a group and then carry out aggregation operations on the remaining rows; e.g. compute aggregations for each group only for the n rows where a particular column has its largest values.
  3. Conditional Choice of Aggregation Function where we cover how to pick the aggregation function to use for a group based on a condition and return the outputs for the groups as a single summary column; e.g. use a different aggregation function based on the size of the group.
  4. Conditional Transformation where we cover how to combine conditional transformation and aggregation to conditionally alter data and then carry out the aggregation operation on the altered data; e.g. apply a multiplier that depends on the value before taking the sum.

Subset Column

We wish to include only a subset of a column’s values in the aggregation operation

In this example, we wish to summarize the table table_1 grouped by the values of the column col_1 by computing two data aggregations: (1) The sum of positive values of col_2 for each group and (2) the sum of values of col_2 for the rows in each group where the values in the col_3 column are equal to "b".

SELECT col_1,
       SUM(CASE WHEN col_2 > 0 THEN col_2 END)   col_2_s1,
       SUM(CASE WHEN col_3 = 'b' THEN col_2 END) col_2_s2,
FROM table_1
GROUP BY col_1;

Here is how this works:

  • GROUP BY accepts one or more columns and then assigns the rows of the table to a group according to the values of the grouping column(s). See Summary Table.
  • The CASE expression is being used to determine which rows should be included in each sum.
  • We use the CASE expression to subset the values of the column being aggregated i.e. to select only the values that meet a particular condition:
    • In WHEN col_2 > 0, we select only the positive values of column col_2.
    • In WHEN col_3 = 'b', we select only the values of column col_2 that have corresponding values of b in the column col_3.
    • THEN is used to return the value we want to aggregate on based on the condition.

Filter Group

We wish carry out data aggregation operations on a subset of the rows of each group.

In this example, we wish to summarize the table table_1 grouped by the values of the column col_1 by computing two data aggregations: (1) The sum of the values of the column col_3 for the rows where the column col_2 takes its three largest values and (2) the mean of the values of the column col_4 also for the rows where the column col_2 takes its three largest values.

WITH table_2 AS (SELECT *
                 FROM table_1
                     QUALIFY ROW_NUMBER() OVER
                         (PARTITION BY col_1 ORDER BY col_2 DESC ) <= 3)
SELECT col_1, SUM(col_3) col_3_sum, AVG(col_4) col_3_avg
FROM table_2
GROUP BY col_1;

Here is how this works:

  • We start by filtering the rows within each group where the column col_2 has its 3 largest values using QUALIFY ROW_NUMBER(). See Filtering Groups by Rank.
  • Since ROW_NUMBER() is a window function and cannot be used with a GROUP BY, we do the operation over two steps.
  • We then carry out the desired aggregation in the second step using a GROUP BY.
    • The values of col_3_sum are computed by using the SUM() function to compute the sum of the values in column col_3 for each group.
    • The values of col_4_mean are computed by using the AVG() function to compute the mean of the values in column col_4 for each group.

Aggregation Function

We wish to use a pick the aggregation function to use for a group based on a condition and return the outputs for the groups as a single summary column.

In this example, we wish to summarize the table table_1 grouped by the values of the column col_1 by computing a data aggregation operation that depends on the size of the group. If the group size is less than 3 we return 0.5 times the group size but if the group size is greater than 3 we return the sum of 0.1 times the values of the column col_2.

SELECT col_1,
       CASE
           WHEN COUNT(1) < 3 THEN 0.5 * COUNT(1)
           ELSE SUM(0.1 * col_2) END AS col_2_s
FROM table_1
GROUP BY col_1;

Here is how this works:

  • GROUP BY accepts one or more columns and then assigns the rows of the table to a group according to the values of the grouping column(s). See Summary Table.
  • The aggregation we wish to perform for each group depends on the size of the group. We use the CASE expression to decide on the appropriate aggregation function to execute for each group according the value of the condition as follows:
    • If the number of rows in a group is less than 3, then we return 0.5 * COUNT(1), where COUNT(1) is the number of rows in the group.
    • If the number of rows is greater than or equal to 3, then we return SUM(0.1 * col_2), which is the sum of the values in the col_2 column multiplied by 0.1.
    • We cover the user CASE expression and other constructs for conditional transformation in Conditional Transformation.

Conditional Transformation

We wish to conditionally alter the data and then carry out an aggregation operation on the altered data.

In this example, we wish to summarize the table table_1 grouped by the values of the column col_1 by computing the weighted sum of the values of the column col_2 where we wish to have values smaller than 10 be given a weight of 2.

SELECT col_1,
       SUM(CASE
               WHEN col_2 < 10 THEN col_2 * 2
               ELSE col_2 END) col_2_s
FROM table_1
GROUP BY col_1

Here is how this works:

  • GROUP BY accepts one or more columns and then assigns the rows of the table to a group according to the values of the grouping column(s). See Summary Table.
  • We wish to modify the data based on a condition and then perform the data aggregation operation on the modified data.
    • We use CASE expressoin to check if each value is less than 10. If it is, the value is multiplied by 2. Otherwise, the original value is used. We cover this in detail in Conditional Transformation.
    • Then, the SUM() method is called to sum up all the values in the group (after they have been multiplied by 2 or not), and this sum is returned as the value for that group.
SQL
I/O