Summarize Factor by Factor

We wish to look at the variation in one categorical column against another categorical column. This is often referred to as cross tabulation.

Unique Cases

We wish to know what are the distinct combinations that the values of two categorical columns take in a table.

In this example, we wish to get the unique combinations of the values of col_10 and col_12.

SELECT col_10, col_12
FROM refcon.dataset.table_1
GROUP BY col_10, col_12;

Here is how this works:

  • We group by col_10 and col_12 to get all unique combinations.
  • Note: We can group by any number of columns as per the needs of the situation.

Alternatively:

SELECT DISTINCT col_10, col_12
FROM refcon.dataset.table_1

Here is how this works:

  • We use 'DISTINCT' to get the unique combinations.
  • Note: We can list any number of columns after DISTINCT as per the needs of the situation.

Frequency

We wish to know the number of times each combination of values of categorical columns occurs in a table.

In this example, we wish to get the number of times each unique combination of the values of col_10 and col_12 occurs in a table.

SELECT col_10, col_12, COUNT(1) result
FROM refcon.dataset.table_1
GROUP BY col_10, col_12;

This works similarly to the Unique Cases scenario above, we just count the number of rows per combination using COUNT(1).

Cross Table

If we wish to get the results as a cross table, we can use the solution below.

BEGIN
    DECLARE v_pivot_column STRING;
    SET v_pivot_column =
            (SELECT STRING_AGG(DISTINCT CONCAT('"', CAST(col_12 AS STRING), '"')),
             FROM refcon.dataset.table_1);

    EXECUTE IMMEDIATE FORMAT("""
                                SELECT *
                                FROM (SELECT col_10,
                                             col_12
                                      FROM refcon.dataset.table_1) PIVOT
                                         (count(1) AS col_12
                                                FOR CAST(col_12 AS STRING) IN (%s))
                             """, v_pivot_column);
END;

Here is how this works:

  • We wrap the value of the col_12 with " and then use STRING_AGG DISTINCT to get the unique values for col_12 in the following format: "0","3".
  • We use the v_pivot_column to pivot col_12 values to columns and keep col_10 as rows.

Proportion

We wish to know the proportion (percentage or density) of the total number of rows (observations) that take each possible combination of values of two columns.

To compute a proportion we need to designate what is it that we are comparing i.e. what the numerator and denominator are. In this situation, the numerator is the frequency of each combination of values of the two categorical columns. The denominator, however, can take one of three forms:

  • on Rows: We divide by the sum of values for the row. In other words, we wish to know: of the rows where col_1 == a , what proportion (percent) of those rows have col_2 == b (essentially the conditional probability of col_2 == b given that col_1 == a).
  • on Columns: We divide by the sum of values for the column. In other words, we wish to know of the rows where col_2 == b, what proportion (percent) of those rows have col_1 == a (essentially the conditional probability of col_1 == a given that col_2 == b).
  • on Table: We divide by the sum of values for the entire table. In other words, we wish to know of the total number of rows, what proportion (percent) have col_1 == a and col_2==b.

on Rows

We wish to get the proportion of each combination of values of two columns relative to the first column.

In this example, we compute the proportions of col_10 and col_12 combinations relative to col_10.

SELECT col_10,
       col_12,
       COUNT(1) / (SUM(COUNT(1)) OVER (PARTITION BY col_10)) AS pct
FROM refcon.dataset.table_1
GROUP BY col_10, col_12;

Here is how this works:

  • We use (SUM(COUNT(1)) OVER (PARTITION BY col_10)) window function to calculate the total per unique value for col_10.
  • We Group by col_10 and col_12 and divide the result of the window function.

Cross Table

In this example, we compute a cross table between col_10 and col_12 and obtain the proportions of combinations relative to col_10.

BEGIN
    DECLARE v_pivot_column STRING;
    SET v_pivot_column =
            (SELECT STRING_AGG(DISTINCT CONCAT('"', CAST(col_12 AS STRING), '"')),
             FROM refcon.dataset.table_1);

    EXECUTE IMMEDIATE
        FORMAT("""
                  SELECT *
                  FROM (SELECT col_10,
                               col_12,
                               COUNT(1) / (SUM(COUNT(1)) OVER (PARTITION BY col_10)) AS pct
                        FROM refcon.dataset.table_1
                        GROUP BY col_10, col_12) PIVOT
                            (avg(pct) AS col_12 FOR CAST(col_12 AS STRING) IN (%s));
                             """, v_pivot_column);
END;

This works similarly to the "Frequency Cross Table" scenario above.

on Columns

We wish to get the proportion of each combination of values of two columns relative to the second column.

In this example, we compute the proportions of col_10 and col_12 combinations relative to col_12.

SELECT col_10,
       col_12,
       COUNT(1) / (SUM(COUNT(1)) OVER (PARTITION BY col_12)) AS pct
FROM refcon.dataset.table_1
GROUP BY col_10, col_12;

This works similarly to the "On Rows" scenario above.

Cross Table

In this example, we compute a cross table between col_12 and col_10 and obtain the proportions of combinations relative to col_10.

BEGIN
    DECLARE v_pivot_column STRING;
    SET v_pivot_column = (SELECT STRING_AGG(DISTINCT CONCAT('"', CAST(col_12 AS STRING), '"')),
                          FROM refcon.dataset.table_1);

    EXECUTE IMMEDIATE
        FORMAT("""
                  SELECT *
                  FROM (SELECT col_10,
                               col_12,
                               COUNT(1) / (SUM(COUNT(1)) OVER (PARTITION BY col_12)) AS pct
                        FROM refcon.dataset.table_1
                        GROUP BY col_10, col_12) PIVOT
                          (avg(pct) AS col_12 FOR CAST(col_12 AS STRING) IN (%s));
                             """, v_pivot_column);
END;

This works similarly to the "Frequency Cross Table" scenario above.

on Table

We wish to get the proportion of each combination of values of two columns relative to the total number of rows in the table.

In this example, we compute the proportions of col_10 and col_12 combinations relative to the number of rows in the table.

SELECT col_12,
       col_10,
       COUNT(1) / (SUM(COUNT(1)) OVER ()) AS pct
FROM refcon.dataset.table_1
GROUP BY col_12, col_10

Cross Table In this example, we compute a cross table between col_12 and col_10 and obtain the proportions of combinations relative to the number of rows in the table.

BEGIN
    DECLARE v_pivot_column STRING;
    SET v_pivot_column = (SELECT STRING_AGG(DISTINCT CONCAT('"', CAST(col_12 AS STRING), '"')),
                          FROM refcon.dataset.table_1);

    EXECUTE IMMEDIATE 
        FORMAT("""
                  SELECT *
                  FROM (SELECT col_10,
                               col_12,
                               COUNT(1) / (SUM(COUNT(1)) OVER ()) AS pct
                               FROM refcon.dataset.table_1
                               GROUP BY col_10, col_12) PIVOT
                               (avg(pct) AS col_12 FOR CAST(col_12 AS STRING) IN (%s));
                             """, v_pivot_column);
END;

This works similarly to the "Frequency Cross Table" scenario above.

Rounding

We wish to set a level of precision for the percentages computed.

In this example, we set the level of precision to 2 decimal places i.e. 0.xx .

SELECT col_12,
       col_10,
       ROUND(COUNT(1) / (SUM(COUNT(1)) OVER ()), 2) AS pct
FROM refcon.dataset.table_1
GROUP BY col_12, col_10
SQL
I/O