Summarize Numeric by 2 Factors

We wish to get a summary of a numeric column (e.g. the mean and standard deviation) for each group where the groups are defined by the combination of values of two categorical columns.

Mean

We wish to compute the mean of a numerical column over groups defined by two categorical columns.

In this example, we wish to compute the mean of the numeric column col_1 for each group where the groups are defined by the values of the two categorical columns col_10 and col_12.

BEGIN
    DECLARE v_pivot_column STRING;
    SET v_pivot_column =
            (SELECT STRING_AGG(DISTINCT CONCAT('"', CAST(col_12 AS STRING), '"')),
             FROM refcon.dataset.table_1);

    EXECUTE IMMEDIATE
        FORMAT("""
                  SELECT *
                  FROM (SELECT col_10,
                               col_12,
                               AVG(col_1) AS col_1_avg
                        FROM refcon.dataset.table_1
                        GROUP BY col_10, col_12) PIVOT
                         (avg(col_1_avg) AS col_12 FOR CAST(col_12 AS STRING) IN (%s));
                             """, v_pivot_column);
END;

Here is how this works:

  • We wrap the value of the col_12 with " and then use STRING_AGG DISTINCT to get the unique values for col_12 in the following format :"0","3"`.
  • We use the v_pivot_column to pivot col_12 values to columns and keep col_10 as rows.
  • We use AVG(col_1) to calculate the mean of col_1 per combination.

Sum

We wish to compute the sum of a numerical column over groups defined by two categorical columns.

In this example, we wish to compute the sum of the numeric column col_1 for each group where the groups are defined by the values of the two categorical columns col_10 and col_12.

BEGIN
    DECLARE v_pivot_column STRING;
    SET v_pivot_column =
            (SELECT STRING_AGG(DISTINCT CONCAT('"', CAST(col_12 AS STRING), '"')),
             FROM refcon.dataset.table_1);

    EXECUTE IMMEDIATE
        FORMAT("""
                  SELECT *
                  FROM (SELECT col_10,
                               col_12,
                               SUM(col_1) AS col_1_sum
                        FROM refcon.dataset.table_1
                        GROUP BY col_10, col_12) PIVOT
                          (avg(col_1_sum) AS col_12 FOR CAST(col_12 AS STRING) IN (%s));
                             """, v_pivot_column);
END;

This works similarly to the "Mean" scenario above, except we use SUM.

Proportion

We wish to obtain the ratio between the sum of values of a numeric variable for each group to the total sum of values of the numeric variable where the groups are defined by two categorical columns.

In this example, we compute the ratio of the sum of values of a numeric column col_1 for each group defined by col_10 and col_12 to the total sum of values of col_1.

BEGIN
    DECLARE v_pivot_column STRING;
    SET v_pivot_column =
            (SELECT STRING_AGG(DISTINCT CONCAT('"', CAST(col_12 AS STRING), '"')),
             FROM refcon.dataset.table_1);

    EXECUTE IMMEDIATE 
        FORMAT("""
                  SELECT *
                  FROM (SELECT col_10,
                               col_12,
                               SUM(col_1) / (SUM(SUM(col_1)) OVER ()) AS pct
                        FROM refcon.dataset.table_1
                        GROUP BY col_10, col_12) PIVOT
                         (avg(pct) AS col_12 FOR CAST(col_12 AS STRING) IN (%s));
                             """, v_pivot_column);
END;

This works similarly to the "Mean" scenario above, except we use SUM(col_1) / (SUM(SUM(col_1)) OVER ()) to calculate the proportion.

Multiple Functions

We wish to compute multiple aggregation functions on a numeric column for each group where the groups are defined by two columns.

In this example, we compute sum and mean for a numeric column col_1 for each group where the groups are defined by col_10 and col_12.

SELECT col_10,
       col_12,
       AVG(col_1) AS col_1_avg,
       SUM(col_1) AS col_1_sum
FROM refcon.dataset.table_1
GROUP BY col_10, col_12;

Here is how this works:

  • We GROUP BY col_10 and col_12.
  • In this scenario, we apply two aggregation functions SUM() and AVG().
  • We chose not to pivot because the code to do so in this situation is a bit too involved for an inspection context (see Reshaping).
SQL
I/O