Selecting by Multiple Conditions

We wish to select columns that meet a logical combination of multiple conditions.

We wish to apply a complex logical expression of multiple conditions to each column individually and return any columns for which the expression evaluates to TRUE.

In this example, we wish to select columns that are string and have less than 5 possible values and less than 10% missing values. We also wish to include any columns that have the logical data type.

BEGIN
    DECLARE v_threshold FLOAT64;
    DECLARE v_columns STRING;

    SET v_threshold = 0.1;
    SET v_columns =
            (WITH key_value AS
                      (SELECT SPLIT(kv, ':')[OFFSET(0)] column_name,
                              SPLIT(kv, ':')[OFFSET(1)] column_value
                       FROM refcon.dataset.table_1 t,
                            UNNEST(SPLIT(REGEXP_REPLACE(TO_JSON_STRING(t), r'[{}"]', ''))) kv),
                  table_columns AS
                      (SELECT t.column_name,
                              COUNT(DISTINCT CASE
                                                 WHEN data_type = 'STRING'
                                                     AND column_value <> 'null'
                                                     THEN column_value END) unique_count,
                              LOGICAL_OR(CASE
                                             WHEN data_type = 'BOOL'
                                                 THEN TRUE
                                             ELSE FALSE END)                is_bool,
                              SAFE_DIVIDE(
                                      COUNT(CASE
                                                WHEN column_value = 'null'
                                                    THEN column_value END),
                                      COUNT(1))                             missing_value_ratio

                       FROM key_value t
                                INNER JOIN refcon.dataset.INFORMATION_SCHEMA.COLUMNS c
                                           ON c.column_name = t.column_name AND
                                              c.table_name = 'table_1'
                       WHERE data_type IN ('STRING', 'BOOL')
                       GROUP BY t.column_name)

             SELECT STRING_AGG(column_name) AS columns
             FROM table_columns
             WHERE (unique_count < 5 AND missing_value_ratio < 0.1)
                OR is_bool);
    EXECUTE IMMEDIATE FORMAT("""SELECT %s
                                FROM refcon.dataset.table_1
                             """, v_columns);
END;

Here is how this works:

  • First we transform our table to key value pairs:

    • We do that by converting the whole table to a json string using TO_JSON_STRING function. The output will be one string where each row is a json object.
    • We then remove '{}' and '"' from the output string and split by ',' so we get col:value.
    • We use SPLIT and OFFSET to get the first value as the column name and the second as the column value.
  • We calculate the necessary aggregations to be used in our conditions:

    • We join with INFORMATION_SCHEMA.COLUMNS and keep only the columns with STRING or BOOL types.
    • We get the unique count by using COUNT(DISTINCT). We exclude null to ignore missing values. This assumes that no rows had the value null in any string column before the transformation.
    • We use LOGICAL_OR to return TRUE when the column data type is BOOL.
    • We calculate the missing values ratio but counting the rows where the value equals null. If the value of the string column was 'null' as a string before transformation, it will also be counted. This needs to be handled if we only need to count actual missing values.
    • We divide by COUNT(1) to get the ratio
  • We use AND and OR to combine our conditions and keep columns that match the condition.

  • We apply STRING_AGG to get the column names as a comma separated string. i.e. col_1, col_5, col_6 and set v_columns variable.
  • We use EXECUTE IMMEDIATE to execute our dynamic query using the v_table and v_columns variables.
  • We achieve this using the FORMAT function which formats a data type expression as a string. %s will be replaced with the variable values in the order of listing the variables.
SQL
I/O