Selecting by Data Criteria

We wish to select columns whose data meets certain criteria.

Oftentimes we wish to select columns whose data satisfies certain conditions e.g. the percentage of missing values is below 10%.

Essentially, we apply a predicate operation (i.e. an operation that returns TRUE or FALSE) to each column of the table and select the Columns for which the function evaluates to TRUE. There are two common scenarios:

  1. Data Type Agnostic: The predicate function we are applying is valid for all data types e.g. checking for missing values.
  2. Data Type Dependent: The predicate function we are applying is only valid for particular data types. In this case, we need to first select columns of the appropriate data type before we run the predicate function on each of those columns e.g. checking that the mean is greater than a certain value is only valid for numeric and logical data types.

For detailed coverage of data type specific operations, see the respective data type operations section.

Data Type Agnostic

We wish to select columns whose data satisfies a certain condition where the condition is applicable to all data types.

In this example, we wish to return columns where the percentage of missing values to the total number of values is less than 10%.

BEGIN
    DECLARE v_columns STRING;

    SET v_columns =
            (WITH key_value AS
                      (SELECT SPLIT(kv, ':')[OFFSET(0)] column_name,
                              SPLIT(kv, ':')[OFFSET(1)] column_value
                       FROM refcon.dataset.table_1 t,
                            UNNEST(SPLIT(
                                    REGEXP_REPLACE(
                                            TO_JSON_STRING(t), r'[{}"]', ''))) kv),
                  table_columns AS
                      (SELECT column_name,
                              SAFE_DIVIDE(
                                      COUNT(CASE
                                                WHEN column_value = 'null'
                                                    THEN column_value END),
                                      COUNT(1)) missing_vlaue_ratio
                       FROM key_value t
                       GROUP BY column_name)
             SELECT STRING_AGG(column_name) AS columns
             FROM table_columns
             WHERE missing_vlaue_ratio < 0.1);
    EXECUTE IMMEDIATE FORMAT("""SELECT %s
                                FROM refcon.dataset.table_1
                             """, v_columns);
END;

Here is how this works:

  • First we transform our table to key value pairs:

    • We do that by converting the whole table to a json string using TO_JSON_STRING function. The output will be one string where each row from the original table is now a json object.
    • We then remove '{}' and '"' from the output string and split by ',' so we get col:value in each row.
    • We use SPLIT and OFFSET to get the first value as the column name and the second as the column value.
  • Get table columns with less than 10% missing values:

    • We calculate the missing values ratio by counting the rows where the value equals null. If the value of the string column was 'null' as a string before transformation, it will also be counted. This needs to be handled if we only need to count actual missing values.
    • We divide by COUNT(1) to get the ratio, and then we only keep columns with missing_vlaue_ratio < 0.1.
    • We apply STRING_AGG to get the column names as a comma separated string. i.e. col_1, col_5, col_6 and set v_columns variable.
  • We use EXECUTE IMMEDIATE to execute our dynamic query using the v_table and v_columns variables.

  • We achieve this using the FORMAT function which formats a data type expression as a string. %s will be replaced with the variable values in the order of listing the variables.

Data Type Dependent

We wish to select columns whose data satisfies a certain condition where the condition is applicable only to columns of particular data types. In this scenario, we need to first select the columns of the appropriate data type(s) and then apply the function to those.

In this example, we wish to return numeric columns where the mean is bigger than 1.

BEGIN
    DECLARE v_data_types ARRAY <STRING>;
    DECLARE v_columns STRING;

    SET v_data_types = ['INT64', 'NUMERIC', 'BIGNUMERIC', 'FLOAT64'];
    SET v_columns =
            (WITH key_value AS
                      (SELECT SPLIT(kv, ':')[OFFSET(0)] column_name,
                              SPLIT(kv, ':')[OFFSET(1)] column_value
                       FROM refcon.dataset.table_1 t,
                            UNNEST(SPLIT(
                                    REGEXP_REPLACE(
                                            TO_JSON_STRING(t), r'[{}"]', ''))) kv),
                  table_columns AS
                      (SELECT t.column_name,
                              AVG(SAFE_CAST(column_value AS BIGNUMERIC)) column_avg
                       FROM key_value t
                                INNER JOIN refcon.dataset.INFORMATION_SCHEMA.COLUMNS c
                                           ON c.column_name = t.column_name AND
                                              c.table_name = 'table_1'
                       WHERE data_type IN UNNEST(v_data_types)
                       GROUP BY t.column_name)

             SELECT STRING_AGG(column_name) AS columns
             FROM table_columns
             WHERE column_avg > 1);
    EXECUTE IMMEDIATE FORMAT("""SELECT %s
                                FROM refcon.dataset.table_1
                             """, v_columns);
END;

Here is how this works:

  • First we transform our table to key value pairs in. This works similarly to the scenario above.

  • Get table columns where the mean is bigger than 1:

    • We join with INFORMATION_SCHEMA.COLUMNS and keep only the columns with numeric types.
    • We use SAFE_CAST to cast all columns from string back to numeric. We cast all types to BIGNUMERIC as it is inclusive of all smaller numeric types.
    • We use AVG() to calculate the mean for each column.
    • We only keep columns with column_avg > 1 and then we apply STRING_AGG to get the column names as a comma separated string. i.e. col_1, col_5, col_6 and set v_columns variable.
  • We use EXECUTE IMMEDIATE to execute our dynamic query using the v_table and v_columns variables.

  • We achieve this using the FORMAT function which formats a data type expression as a string. %s will be replaced with the variable values in the order of listing the variables.
SQL
I/O