We wish to get a summary of a numeric column (e.g. the mean and standard deviation) for each group where the groups are defined by the values of a categorical column.
We wish to compute the mean of a numerical column over groups defined by one categorical column.
In this example, we wish to compute the mean of the numeric column col_1
for each group where the groups are defined
by the values of the categorical column col_10
.
SELECT col_10, AVG(col_1) AS result
FROM refcon.dataset.table_1
GROUP BY col_10;
Here is how this works:
GROUP BY
clause to group by col_10
.AVG(col_1)
calculates the average of rows for each value of col_10
.AVG
ignores null values by default.We wish to compute the sum of values of a numerical column over groups defined by one categorical column.
In this example, we wish to compute the sum of the values of the numeric column col_1
for each group where the groups
are defined by the values of the categorical column col_10
.
SELECT col_10, SUM(col_1) AS result
FROM refcon.dataset.table_1
GROUP BY col_10;
Here is how this works:
This works similarly to the "Mean" scenario above but we use SUM()
instead of AVG()
.
We wish to obtain the ratio between the sum of values of a numeric variable for each group to the total sum of values of the numeric variable where the groups are defined by a grouping variable.
In this example, we compute the ratio of the sum of values of a numeric column col_1
for each group defined
by col_10
to the total sum of values of col_1
.
SELECT col_10,
SUM(col_1) / (SUM(SUM(col_1)) OVER ()) AS proportion
FROM refcon.dataset.table_1
GROUP BY col_10;
here is how this works:
- The GROUP BY
is executed first, so we have one group per col_10
value.
- SUM(col_1)
gets calculated and is used as the nominator for each group.
- SUM(SUM(col_1)) OVER ()
can be understood as SUM(___) OVER ()
. Where the input to the SUM
is the per-group
SUM(col_1)
which was computed in the previous step. Because we didn't specify anything in the OVER
clause,
the whole aggregated table will be considered, resulting in a sum of sums (which is equiv to the previous solution).
In this approach, we use a WITH
clause to pre-compute the col_1_total, which we then use as the denominator
WITH col_1_stats AS
(SELECT SUM(col_1) AS col_1_total
FROM refcon.dataset.table_1)
SELECT col_10,
SUM(col_1) / (SELECT col_1_total FROM col_1_stats) AS proportion
FROM refcon.dataset.table_1
GROUP BY col_10;
Here is how this works:
SUM
function in col_1_stats
CTE.