Common Operations

In this page, we will cover the functions to use to carry out some of the most common data aggregation operations.

In its simplest form, a data aggregation function is one which takes multiple individual values and returns a single summary value; e.g. sum.

We have extended sections where we cover the data manipulation operations relevant to each data type quite extensively (see the data type operation sections in the navigation bar). Think of this page as a brief overview of the most common aggregation functions for each data type with links to the respective operations section for a coverage.

Note: In this page, we demonstrate the data aggregation operations in a scenario where a table is grouped by one column. We can drop the GROUP BY to aggregate over the entire table (without grouping). We can also group by multiple columns or in myriad other ways ( See Grouping).

Counting

We wish to count the number of rows or the number of unique values in a group.

In this example, we wish to count the number of rows in each group and the number of unique values of the column col_2 in each group. We wish to carry out the aggregation on the table table_1 grouped by the column col_1.

SELECT col_1,
       COUNT(1) AS           row_count,
       COUNT(DISTINCT col_2) col_2_vals

FROM table_1
GROUP BY col_1;

Here is how this works:

The GROUP BY clause groups the rows of a table by the values in one or more columns. The rows are then assigned to a group based on the values in the grouping column(s). In this example, the rows are being grouped by the values in the col_1 column.
COUNT(1) gives the current group size. See Counting.
COUNT(DISTINCT col_2) gives the number of unique values. See Uniqueness.

Arithmatic

We wish to aggregate data by computing an arithmatic operation, e.g. sum, for each group.

In this example, we wish to compute the sum of values of the column col_2 for each group and the sum of the product of the values of the columns col_3 and col_4 for each group. We wish to carry out the aggregation on the table table_1 grouped by the column col_1.

SELECT col_1,
       SUM(col_2)         col_2_sum,
       SUM(col_3 * col_4) col_3_4_ws

FROM table_1
GROUP BY col_1;

Here is how this works:

The GROUP BY clause groups the rows of a table by the values in one or more columns. The rows are then assigned to a group based on the values in the grouping column(s). In this example, the rows are being grouped by the values in the col_1 column.
We then use the built-in function SUM() to compute the sum of values of col_2.
In SUM(col_3 * col_4), we first multiply the values of col_3 and col_4 for each row then sum the results (for each group).
For a coverage of arithmatic operations, see Arithmatic Operations.

Summary Statistics

We wish to aggregate data by computing summary statistics, e.g. the mean, for each group.

In this example we wish to compute the mean, and standard deviation of values of the column col_2 for each group. We wish to carry out the aggregation on the table table_1 grouped by the column col_1.

SELECT col_1,
       AVG(col_2)    AS mean_col_2,
       STDDEV(col_2) AS stddev_col_2
FROM table_1
GROUP BY col_1;

Here is how this works:

The GROUP BY clause groups the rows of a table by the values in one or more columns. The rows are then assigned to a group based on the values in the grouping column(s). In this example, the rows are being grouped by the values in the col_1 column.
We then use the built-in functions AVG() and STDDEV()to compute the mean, and standard deviation of values of the column col_3.
For a coverage of arithmatic operations, see Arithmatic Operations.
For a coverage of summary statistics, see Summary Statistics

Range

We wish to aggregate data by computing the maximum or minimum value of a given column for each group.

In this example we wish to compute the minimum value of the numeric column col_3 for each group and the maximum value of the numeric column col_4 for each group. We wish to carry out the aggregation on the table table_1 grouped by the column col_1.

SELECT col_1,
       MIN(col_3)                         col_3_min,
       MAX(col_4)                         col_4_max,
FROM table_1
GROUP BY col_1;

Here is how this works:

The GROUP BY clause groups the rows of a table by the values in one or more columns. The rows are then assigned to a group based on the values in the grouping column(s). In this example, the rows are being grouped by the values in the col_1 column.
As their name implies, given a vector of values:
- MIN() returns the minimum value.
- MAX() returns the maximum value.
MAX() and the MIN() belong to the larger class of ranking operations, which we cover in Ranking.
Note that here we are aggregating the values of individual columns in a group, to filter entire rows in a group based on the rank of the values of some columns, see Filtering Groups by Rank.

Window Functions

Unlike R and Python, some operations in SQL are considered a window function and cannot be used in an aggregation scenario. Median is a very common example and in this example we provide a blueprint for doing aggregations on top of window functions.

In this example we wish to compute the median of values of the column col_2 for each group. We wish to carry out the aggregation on the table table_1 grouped by the column col_1.

WITH table_2 AS (SELECT *, PERCENTILE_DISC(col_2, 0.5) OVER (PARTITION BY col_1) col_1_2_median
                 FROM table_1)
SELECT col_1,
       MAX(col_1_2_median) col_2_median
FROM table_2
GROUP BY col_1;

Here is how this works:

In table_2 CTE , the PERCENTILE_DISC() function is used to compute the median value (the 50th percentile) of the col_2 column, partitioned by col_1. This means that the median value is calculated separately for each unique value of col_1, and the resulting value is assigned to the col_1_2_median column for each row in table_2.
The last SELECT statement we group the results by col_1. The MAX() function is used to select the maximum value of col_1_2_median for each group. We can replace MAX by MIN or AVG since the value is of col_1_2_median is constant.

String

We wish to perform data aggregation operations on a column of a string data type.

In this example, we wish to summarize the table table_1 over groups defined by the column col_1. The aggregation operations we wish to carry out are: (1) Concatenate the unique values of the column col_2 per group into a comma separated string. (2) Concatenate the unique values of the column col_2 that matches a given regular expression for each group.

SELECT col_1,
       STRING_AGG(DISTINCT col_2, ',') col_2_distinct,
       STRING_AGG(DISTINCT CASE WHEN col_2 LIKE '%X%' THEN col_2 END, ',') col_2_match
FROM table_1
GROUP BY col_1;

The GROUP BY clause groups the rows of a table by the values in one or more columns. The rows are then assigned to a group based on the values in the grouping column(s). In this example, the rows are being grouped by the values in the col_1 column.
STRING_AGG() function is used to concatenate a list of strings, separated by a specified delimiter. We use DISTINCT to only concatenate unique values.
col_2_match includes only those values of col_2 that match a certain condition, specified using a CASE expression. In this case, the condition is that the col_2 value contains the character 'X'. If the condition is met, the col_2 value is included in the concatenated string. Otherwise, it is not included.
For a coverage of string operations (including regular expressions), see String Operations.

Logical

We wish to perform data aggregation operations on a column of a logical data type; i.e. one that takes a value of TRUE or FALSE.

In this example, for each group, where the groups are defined by the value of the column col_1, we wish to obtain the number and ratio of rows where the column col_2 is TRUE.

SELECT col_1,
       SUM(CAST(col_2 AS int64)) col_2_true_count,
       AVG(CAST(col_2 AS int64)) col_2_true_rate
FROM table_1
GROUP BY col_1;

Here is how this works:

The GROUP BY clause groups the rows of a table by the values in one or more columns. The rows are then assigned to a group based on the values in the grouping column(s). In this example, the rows are being grouped by the values in the col_1 column.
We cast col_2 to integer inside SUM() to obtain the number of TRUE values. Summing a boolean column that is casted as an integer is basically counting the number of TRUE values. When casting a boolean to an integer, TRUE is regarded as 1 and FALSE is regarded as 0.
Similarly, we use AVG() to obtain the proportion of TRUE values of col_2 for each group. Since TRUE is regarded as 1 and FALSE is regarded as 0, taking the mean of a logical column is basically taking the ratio between the number of TRUE values and the total number of values in the input vector.
This approach will ignore missing values since both SUM and AVG ignore NULL values. This will affect the results of the ratio in this scenario. If we wish to include missing values see alternative solution below.
For a coverage of logical operations, see Logical Operations

Alternatively:

SELECT col_1,
       COUNT(CASE WHEN col_2 THEN 1 END)            col_2_true_count,
       COUNT(CASE WHEN col_2 THEN 1 END) / COUNT(1) col_2_true_rate
FROM table_1
GROUP BY col_1;

Here is how this works:

col_2_true_count column is computed using the COUNT() function, which counts the number of rows in each group defined by col_1.
The CASE expression is used to count only those rows where col_2 is TRUE.
col_2_true_rate is the count of TRUE values divided by total row count. COUNT(1) will count rows where the value of col_2 is NULL.

Date Time Aggregation

We wish to group a table by some unit or interval of time e.g. group by year or weekday.

In this example, we wish to group the table table_1 by weekday inferred from the date-time column col_1. We then wish to compute the number of rows in each group and the number of unique hours that col_1 takes in each group.

SELECT FORMAT_DATE('%A', col_1)                 AS weekday_name,
       COUNT(1)                                    row_count,
       COUNT(DISTINCT EXTRACT(HOUR FROM col_1)) AS hour_count,
FROM table_1
GROUP BY FORMAT_DATE('%A', col_1);

Here is how this works:

weekday_name column is computed using the FORMAT_DATE() function, which formats a date value as a string using a specified format string. In this case, the format string is '%A', which returns the full name of the weekday (e.g., "Monday").
We use COUNT(1) to obtain the number of rows in each group.
The EXTRACT() function extracts the hour part of the col_1 date value as an integer.
We use COUNT(DISTINCT ) to count the number of unique values of EXTRACT(HOUR FROM col_1) for each group.
For a coverage of date time operations, see Date Time Operations.

Optima.io Reference beta