Summarize Numeric Column

We wish to get a summary of a numeric column, so we may gain some insight into the data it holds.

Column Summary

We wish to generate common summary statistics for a numeric column e.g. the mean or the standard deviation. While we can compute each of those statistics one by one (see Variations below), it would be efficient during data inspection to use a single function that given a numeric column, computes the common summary statistics.

CREATE OR REPLACE PROCEDURE refcon.dataset.column_summary(v_table_name STRING,
                                                           v_column_name STRING)
BEGIN
    EXECUTE IMMEDIATE
        FORMAT("""
                  SELECT DISTINCT  COUNT(1) OVER win                        row_count,
                                   PERCENTILE_CONT(value, 0.25) OVER win AS col_q1,
                                   PERCENTILE_CONT(value, 0.50) OVER win AS col_q2,
                                   PERCENTILE_CONT(value, 0.75) OVER win AS col_q3,
                                   AVG(value) OVER win                   AS col_avg,
                                   STDDEV_POP(value) OVER win            AS col_std,
                                   MAX(value) OVER win                   AS col_max,
                                   MIN(value) OVER win                   AS col_min,
                  FROM refcon.dataset.%s unpivot (VALUE FOR col IN (%s))
                                           WINDOW win AS (PARTITION BY col);""",
               v_table_name,
               v_column_name);

END;

CALL refcon.dataset.column_summary('table_1', 'col_1');

Here is how this works:

  • We pass the table name table_1 to the procedure column_summary().
  • column_summary() is a great convenience. With one command, we get a consolidated report that has the most common summary statistics like row count, mean, standard deviation, minimum value, maximum value, and percentiles.
  • We can customize the summary by modifying the query inside the procedure.

Mean

Consolidated summaries are great as an early step in the data inspection process. Often, however, we are interested in a particular summary statistic that may not be covered by the consolidated summary or the consolidated summary may be a bit too overwhelming. Say we just care about knowing the mean of a particular numeric column.

SELECT AVG(col_1) AS col_1_avg
FROM refcon.dataset.table_1;

Here is how this works:

  • We pass the column name we are interested in to AVG function.
  • AVG ignores null values when calculating the mean.

Sum

We wish to obtain the sum of all values of a numeric column.

SELECT SUM(col_1) AS col_1_sum
FROM refcon.dataset.table_1;

Here is how this works:

  • We pass the column name we are interested in to SUM function.
  • SUM ignores null values when calculating the total.

Distribution

We can glean an understanding of the distribution of a numerical column by partitioning the range into it in a set of categories (binning) and then computing the frequency of occurrence of those categories in the data.

BEGIN

    DECLARE v_bin_count INT64;
    DECLARE v_bin_width INT64;

    SET v_bin_count = 4;
    SET v_bin_width =
            (SELECT SAFE_CAST(MAX(col_1) / v_bin_count AS int64) AS nn
             FROM refcon.dataset.table_1);

    SELECT FLOOR(col_1 / v_bin_width) bin, COUNT(1) result
    FROM refcon.dataset.table_1
    GROUP BY bin
    ORDER BY bin;
END;

Here is how it works:

  • First, we declare and set our variables v_bin_count and v_bin_width.
  • v_bin_width is calculated by dividing the maximum value of col_1 by the number of bins we desire.
  • We divide col_1 by the bin width and use FLOOR to get the bin the value belongs to.
  • We aggregate by the bin using GROUP BY and count the number of rows per bin using COUNT(1).

Alternatively:

WITH bins AS (SELECT SAFE_CAST(MAX(col_1) / 4 AS int64) AS bin_width
              FROM refcon.dataset.table_1)

SELECT FLOOR(col_1 / (SELECT bin_width FROM bins)) bin, COUNT(1) result
FROM refcon.dataset.table_1
GROUP BY bin
ORDER BY bin;

This works similarly to the above-mentioned solution, except we don't use variables and calculate the bin width in a CTE.

Thresholding

We wish to check what proportion of the values of a numerical variable is higher or lower than a given value. In this example, we wish to know the percentage of rows where the value of col_1 is greater than 0.

SELECT COUNT(CASE WHEN col_1 > 1500 THEN col_1 END) / COUNT(1) result
FROM refcon.dataset.table_1;

Here is how it works:

  • We use COUNT with a CASE WHEN to count the number of rows where col_1 is bigger than 1500.
  • We divide by the count of the total rows in the table to get the percentage of rows.

Data Type Setting

If we have a numeric column encoded as a string column, we need to convert the data type to numeric before we can run numeric operations such as the average.

SELECT AVG(SAFE_CAST(col_12 AS FLOAT64)) col_12_avg
FROM refcon.dataset.table_1;

Here is how it works:

  • We use SAFE_CAST to cast a STRING column to FLOAT64. SAFE_CAST will not through an error if we have a string value that cannot be converted to a float value.
  • If we wish to throw an error, we can use CAST(col_12 AS FLOAT64).
  • We pass the casted value to the AVG function to calculate the mean.
SQL
I/O