We wish to get a summary of a numeric column, so we may gain some insight into the data it holds.
We wish to generate common summary statistics for a numeric column e.g. the mean or the standard deviation. While we can compute each of those statistics one by one (see Variations below), it would be efficient during data inspection to use a single function that given a numeric column, computes the common summary statistics.
CREATE OR REPLACE PROCEDURE refcon.dataset.column_summary(v_table_name STRING,
v_column_name STRING)
BEGIN
EXECUTE IMMEDIATE
FORMAT("""
SELECT DISTINCT COUNT(1) OVER win row_count,
PERCENTILE_CONT(value, 0.25) OVER win AS col_q1,
PERCENTILE_CONT(value, 0.50) OVER win AS col_q2,
PERCENTILE_CONT(value, 0.75) OVER win AS col_q3,
AVG(value) OVER win AS col_avg,
STDDEV_POP(value) OVER win AS col_std,
MAX(value) OVER win AS col_max,
MIN(value) OVER win AS col_min,
FROM refcon.dataset.%s unpivot (VALUE FOR col IN (%s))
WINDOW win AS (PARTITION BY col);""",
v_table_name,
v_column_name);
END;
CALL refcon.dataset.column_summary('table_1', 'col_1');
Here is how this works:
table_1
to the procedure column_summary()
.column_summary()
is a great convenience. With one command, we get a consolidated report that has
the most common
summary statistics like row count, mean, standard deviation, minimum value, maximum value, and
percentiles.Consolidated summaries are great as an early step in the data inspection process. Often, however, we are interested in a particular summary statistic that may not be covered by the consolidated summary or the consolidated summary may be a bit too overwhelming. Say we just care about knowing the mean of a particular numeric column.
SELECT AVG(col_1) AS col_1_avg
FROM refcon.dataset.table_1;
Here is how this works:
AVG
function.AVG
ignores null values when calculating the mean.We wish to obtain the sum of all values of a numeric column.
SELECT SUM(col_1) AS col_1_sum
FROM refcon.dataset.table_1;
Here is how this works:
SUM
function.SUM
ignores null values when calculating the total.We can glean an understanding of the distribution of a numerical column by partitioning the range into it in a set of categories (binning) and then computing the frequency of occurrence of those categories in the data.
BEGIN
DECLARE v_bin_count INT64;
DECLARE v_bin_width INT64;
SET v_bin_count = 4;
SET v_bin_width =
(SELECT SAFE_CAST(MAX(col_1) / v_bin_count AS int64) AS nn
FROM refcon.dataset.table_1);
SELECT FLOOR(col_1 / v_bin_width) bin, COUNT(1) result
FROM refcon.dataset.table_1
GROUP BY bin
ORDER BY bin;
END;
Here is how it works:
v_bin_count
and v_bin_width
.v_bin_width
is calculated by dividing the maximum value of col_1 by the number of bins we
desire.col_1
by the bin width and use FLOOR
to get the bin the value belongs to.GROUP BY
and count the number of rows per bin using COUNT(1)
.Alternatively:
WITH bins AS (SELECT SAFE_CAST(MAX(col_1) / 4 AS int64) AS bin_width
FROM refcon.dataset.table_1)
SELECT FLOOR(col_1 / (SELECT bin_width FROM bins)) bin, COUNT(1) result
FROM refcon.dataset.table_1
GROUP BY bin
ORDER BY bin;
This works similarly to the above-mentioned solution, except we don't use variables and calculate the bin width in a CTE.
We wish to check what proportion of the values of a numerical variable is higher or lower than a
given value. In this
example, we wish to know the percentage of rows where the value of col_1
is greater than 0
.
SELECT COUNT(CASE WHEN col_1 > 1500 THEN col_1 END) / COUNT(1) result
FROM refcon.dataset.table_1;
Here is how it works:
COUNT
with a CASE WHEN
to count the number of rows where col_1
is bigger than 1500.If we have a numeric column encoded as a string column, we need to convert the data type to numeric before we can run numeric operations such as the average.
SELECT AVG(SAFE_CAST(col_12 AS FLOAT64)) col_12_avg
FROM refcon.dataset.table_1;
Here is how it works:
SAFE_CAST
to cast a STRING
column to FLOAT64
. SAFE_CAST will not through an error if
we have a string
value that cannot be converted to a float value.CAST(col_12 AS FLOAT64)
.AVG
function to calculate the mean.