Grouped Transformation

Quite often we need to apply data transformation operations to groups of a table individually. We call this Grouped Transformation. A common grouped transformation scenario is replacing missing values with the mean or the median for the group. Another common grouped transformation scenario is scaling the data by subtracting the group’s mean and dividing by the group’s standard deviation.

WITH col_stats AS
         (SELECT col_10,
                 AVG(col_1)    AS col_1_avg,
                 STDDEV(col_1) AS col_1_std,
                 MAX(col_3)    AS col_3_max,
                 AVG(col_3)    AS col_3_avg
          FROM refcon.dataset.table_1
          GROUP BY col_10)

SELECT t.*,
       SAFE_DIVIDE(col_1 - col_1_avg,
                   col_1_std) AS col_13,
       col_3_max              AS col_14,
       col_6 - col_3_avg      AS col_15
FROM refcon.dataset.table_1 t
         LEFT JOIN col_stats s ON s.col_10 = t.col_10;

Here is how this works:

  • To perform grouped data transformation operations, we first create the aggregates we will use in the transformation in a CTE.
  • In col_stats we group by col_10 and calculate all the columns we will use in our transformation.
  • We use a LEFt JOIN on col_10 to combine col_stats with table_1.
  • Inside the SELECT statement, we can carry out all the typical data transformation scenarios, such as those we cover in Basic Transformation. In particular:
    • We can use any one or more columns as inputs to a data transformation operation.
    • If an operation returns a single scalar value e.g. MAX(col_3) AS col_3_max, the scalar value will be replicated as many times as the number of rows in the group.
SQL
I/O