In this page, we will cover the functions to use to carry out some of the most common data aggregation operations.
In its simplest form, a data aggregation function is one which takes multiple individual values and returns a single summary value; e.g. sum.
We have extended sections where we cover the data manipulation operations relevant to each data type quite extensively (see the data type operation sections in the navigation bar). Think of this page as a brief overview of the most common aggregation functions for each data type with links to the respective operations section for a coverage.
Note: In this page, we demonstrate the data aggregation operations in a scenario where a table
is grouped by one column. We can drop the GROUP BY
to aggregate over the entire table (without
grouping). We can also group by multiple columns or in myriad other ways (
See Grouping).
We wish to count the number of rows or the number of unique values in a group.
In this example, we wish to count the number of rows in each group and the number of unique values
of the column col_2
in each group. We wish to carry out the aggregation on the table table_1
grouped by the column col_1
.
SELECT col_1,
COUNT(1) AS row_count,
COUNT(DISTINCT col_2) col_2_vals
FROM table_1
GROUP BY col_1;
Here is how this works:
GROUP BY
clause groups the rows of a table by the values in one or more columns. The rows
are then assigned to a group based on the values in the grouping column(s). In this example, the
rows are being grouped by the values in the col_1 column.COUNT(1)
gives the current group size.
See Counting.COUNT(DISTINCT col_2)
gives the number of unique values.
See Uniqueness.We wish to aggregate data by computing an arithmatic operation, e.g. sum, for each group.
In this example, we wish to compute the sum of values of the column col_2
for each group and the
sum of the product of the values of the columns col_3
and col_4
for each group. We wish to carry
out the aggregation on the table table_1
grouped by the column col_1
.
SELECT col_1,
SUM(col_2) col_2_sum,
SUM(col_3 * col_4) col_3_4_ws
FROM table_1
GROUP BY col_1;
Here is how this works:
GROUP BY
clause groups the rows of a table by the values in one or more columns. The rows
are then assigned to a group based on the values in the grouping column(s). In this example, the
rows are being grouped by the values in the col_1 column.SUM()
to compute the sum of values of col_2
.SUM(col_3 * col_4)
, we first multiply the values of col_3
and col_4
for each row then
sum the results (for each group).We wish to aggregate data by computing summary statistics, e.g. the mean, for each group.
In this example we wish to compute the mean, and standard deviation of values of the
column col_2
for each group. We wish to carry out the aggregation on the table table_1
grouped
by the column col_1
.
SELECT col_1,
AVG(col_2) AS mean_col_2,
STDDEV(col_2) AS stddev_col_2
FROM table_1
GROUP BY col_1;
Here is how this works:
GROUP BY
clause groups the rows of a table by the values in one or more columns. The rows
are then assigned to a group based on the values in the grouping column(s). In this example, the
rows are being grouped by the values in the col_1 column.AVG()
and STDDEV()
to compute the mean, and standard
deviation of values of the column col_3
.We wish to aggregate data by computing the maximum or minimum value of a given column for each group.
In this example we wish to compute the minimum value of the numeric column col_3
for each group
and the maximum value of the numeric column col_4
for each group. We wish to carry out the
aggregation on the table table_1
grouped by the column col_1
.
SELECT col_1,
MIN(col_3) col_3_min,
MAX(col_4) col_4_max,
FROM table_1
GROUP BY col_1;
Here is how this works:
GROUP BY
clause groups the rows of a table by the values in one or more columns. The rows
are then assigned to a group based on the values in the grouping column(s). In this example, the
rows are being grouped by the values in the col_1 column.MIN()
returns the minimum value.MAX()
returns the maximum value.MAX()
and the MIN()
belong to the larger class of ranking operations, which we cover
in Ranking.Unlike R and Python, some operations in SQL are considered a window function and cannot be used in an aggregation scenario. Median is a very common example and in this example we provide a blueprint for doing aggregations on top of window functions.
In this example we wish to compute the median of values of the column col_2
for each group. We
wish to carry out the aggregation on the table table_1
grouped by the column col_1
.
WITH table_2 AS (SELECT *, PERCENTILE_DISC(col_2, 0.5) OVER (PARTITION BY col_1) col_1_2_median
FROM table_1)
SELECT col_1,
MAX(col_1_2_median) col_2_median
FROM table_2
GROUP BY col_1;
Here is how this works:
table_2
CTE , the PERCENTILE_DISC()
function is used to compute the median value (the 50th
percentile) of the col_2
column, partitioned by col_1
. This means that the median value is
calculated separately for each unique value of col_1
, and the resulting value is assigned to
the col_1_2_median
column for each row in table_2
.SELECT
statement we group the results by col_1
. The MAX()
function is used to
select the maximum value of col_1_2_median
for each group. We can replace MAX
by MIN
or AVG
since the value is of col_1_2_median is constant.We wish to perform data aggregation operations on a column of a string data type.
In this example, we wish to summarize the table table_1
over groups defined by the column col_1
.
The aggregation operations we wish to carry out are: (1) Concatenate the unique values of the
column col_2
per group into a comma separated string. (2) Concatenate the unique values of the
column col_2
that matches a given regular expression for each group.
SELECT col_1,
STRING_AGG(DISTINCT col_2, ',') col_2_distinct,
STRING_AGG(DISTINCT CASE WHEN col_2 LIKE '%X%' THEN col_2 END, ',') col_2_match
FROM table_1
GROUP BY col_1;
GROUP BY
clause groups the rows of a table by the values in one or more columns. The rows
are then assigned to a group based on the values in the grouping column(s). In this example, the
rows are being grouped by the values in the col_1 column.STRING_AGG()
function is used to concatenate a list of strings, separated by a specified
delimiter. We use DISTINCT
to only concatenate unique values.col_2_match
includes only those values of col_2
that match a certain condition, specified
using
a CASE
expression. In this case, the condition is that the col_2
value contains the
character 'X'
.
If the condition is met, the col_2 value is included in the concatenated string. Otherwise, it is
not included.We wish to perform data aggregation operations on a column of a logical data type; i.e. one that
takes a value of TRUE
or FALSE
.
In this example, for each group, where the groups are defined by the value of the column col_1
, we
wish to obtain the number and ratio of rows where the column col_2
is TRUE
.
SELECT col_1,
SUM(CAST(col_2 AS int64)) col_2_true_count,
AVG(CAST(col_2 AS int64)) col_2_true_rate
FROM table_1
GROUP BY col_1;
Here is how this works:
GROUP BY
clause groups the rows of a table by the values in one or more columns. The rows
are then assigned to a group based on the values in the grouping column(s). In this example, the
rows are being grouped by the values in the col_1 column.col_2
to integer inside SUM()
to obtain the number of TRUE
values. Summing
a boolean column that is casted as an integer is basically counting the number of TRUE
values.
When casting a boolean to an integer, TRUE
is regarded as 1
and FALSE
is regarded as 0
.AVG()
to obtain the proportion of TRUE
values of col_2
for each group.
Since TRUE
is regarded as 1
and FALSE
is regarded as 0
, taking the mean of a logical
column is basically taking the ratio between the number of TRUE
values and the total number of
values in the input vector.SUM
and AVG
ignore NULL values. This will
affect the results of the ratio in this scenario. If we wish to include missing values see
alternative solution below.Alternatively:
SELECT col_1,
COUNT(CASE WHEN col_2 THEN 1 END) col_2_true_count,
COUNT(CASE WHEN col_2 THEN 1 END) / COUNT(1) col_2_true_rate
FROM table_1
GROUP BY col_1;
Here is how this works:
col_2_true_count
column is computed using the COUNT()
function, which counts the number of
rows in each group defined by col_1.CASE
expression is used to count only those rows where col_2
is TRUE
.col_2_true_rate
is the count of TRUE
values divided by total row count. COUNT(1)
will count
rows where the value of col_2 is NULL
.We wish to group a table by some unit or interval of time e.g. group by year or weekday.
In this example, we wish to group the table table_1
by weekday inferred from the date-time
column col_1
. We then wish to compute the number of rows in each group and the number of unique
hours that col_1
takes in each group.
SELECT FORMAT_DATE('%A', col_1) AS weekday_name,
COUNT(1) row_count,
COUNT(DISTINCT EXTRACT(HOUR FROM col_1)) AS hour_count,
FROM table_1
GROUP BY FORMAT_DATE('%A', col_1);
Here is how this works:
weekday_name
column is computed using the FORMAT_DATE()
function, which formats a date value
as a string using a specified format string. In this case, the format string is '%A', which
returns the full name of the weekday (e.g., "Monday").COUNT(1)
to obtain the number of rows in each group.EXTRACT()
function extracts the hour part of the col_1
date value as an integer.COUNT(DISTINCT )
to count the number of unique values of EXTRACT(HOUR FROM col_1)
for
each group.