Summarize Numeric by Factor

We wish to get a summary of a numeric column (e.g. the mean and standard deviation) for each group where the groups are defined by the values of a categorical column.

Mean

We wish to compute the mean of a numerical column over groups defined by one categorical column.

In this example, we wish to compute the mean of the numeric column col_1 for each group where the groups are defined by the values of the categorical column col_10.

SELECT col_10, AVG(col_1) AS result
FROM refcon.dataset.table_1
GROUP BY col_10;

Here is how this works:

  • We use the GROUP BY clause to group by col_10.
  • AVG(col_1) calculates the average of rows for each value of col_10.
  • AVG ignores null values by default.

Sum

We wish to compute the sum of values of a numerical column over groups defined by one categorical column.

In this example, we wish to compute the sum of the values of the numeric column col_1 for each group where the groups are defined by the values of the categorical column col_10.

SELECT col_10, SUM(col_1) AS result
FROM refcon.dataset.table_1
GROUP BY col_10;

Here is how this works:

This works similarly to the "Mean" scenario above but we use SUM() instead of AVG().

Proportion

We wish to obtain the ratio between the sum of values of a numeric variable for each group to the total sum of values of the numeric variable where the groups are defined by a grouping variable.

In this example, we compute the ratio of the sum of values of a numeric column col_1 for each group defined by col_10 to the total sum of values of col_1.

SELECT col_10,
    SUM(col_1) / (SUM(SUM(col_1)) OVER ()) AS proportion
FROM refcon.dataset.table_1
GROUP BY col_10;

here is how this works: - The GROUP BY is executed first, so we have one group per col_10 value. - SUM(col_1) gets calculated and is used as the nominator for each group. - SUM(SUM(col_1)) OVER () can be understood as SUM(___) OVER (). Where the input to the SUM is the per-group SUM(col_1) which was computed in the previous step. Because we didn't specify anything in the OVER clause, the whole aggregated table will be considered, resulting in a sum of sums (which is equiv to the previous solution).

Alternative approach

In this approach, we use a WITH clause to pre-compute the col_1_total, which we then use as the denominator

WITH col_1_stats AS
         (SELECT SUM(col_1) AS col_1_total
          FROM refcon.dataset.table_1)
SELECT col_10,
       SUM(col_1) / (SELECT col_1_total FROM col_1_stats) AS proportion
FROM refcon.dataset.table_1
GROUP BY col_10;

Here is how this works:

  • We first calculate 'col_1' by using SUM function in col_1_stats CTE.
  • We then group by 'col_10' and divide by the total sum of 'col_1' (using a subquery).
SQL
I/O