We wish to select columns that meet a logical combination of multiple conditions.
We wish to apply a complex logical expression of multiple conditions to each column individually and return any columns
for which the expression evaluates to TRUE
.
In this example, we wish to select columns that are string and have less than 5 possible values and less than 10% missing values. We also wish to include any columns that have the logical data type.
BEGIN
DECLARE v_threshold FLOAT64;
DECLARE v_columns STRING;
SET v_threshold = 0.1;
SET v_columns =
(WITH key_value AS
(SELECT SPLIT(kv, ':')[OFFSET(0)] column_name,
SPLIT(kv, ':')[OFFSET(1)] column_value
FROM refcon.dataset.table_1 t,
UNNEST(SPLIT(REGEXP_REPLACE(TO_JSON_STRING(t), r'[{}"]', ''))) kv),
table_columns AS
(SELECT t.column_name,
COUNT(DISTINCT CASE
WHEN data_type = 'STRING'
AND column_value <> 'null'
THEN column_value END) unique_count,
LOGICAL_OR(CASE
WHEN data_type = 'BOOL'
THEN TRUE
ELSE FALSE END) is_bool,
SAFE_DIVIDE(
COUNT(CASE
WHEN column_value = 'null'
THEN column_value END),
COUNT(1)) missing_value_ratio
FROM key_value t
INNER JOIN refcon.dataset.INFORMATION_SCHEMA.COLUMNS c
ON c.column_name = t.column_name AND
c.table_name = 'table_1'
WHERE data_type IN ('STRING', 'BOOL')
GROUP BY t.column_name)
SELECT STRING_AGG(column_name) AS columns
FROM table_columns
WHERE (unique_count < 5 AND missing_value_ratio < 0.1)
OR is_bool);
EXECUTE IMMEDIATE FORMAT("""SELECT %s
FROM refcon.dataset.table_1
""", v_columns);
END;
Here is how this works:
First we transform our table to key value pairs:
TO_JSON_STRING
function. The output will be one
string where each row is a json object.'{}'
and '"'
from the output string and split by ','
so we get col:value.SPLIT
and OFFSET
to get the first value as the column name and the second as the column value.We calculate the necessary aggregations to be used in our conditions:
INFORMATION_SCHEMA.COLUMNS
and keep only the columns with STRING
or BOOL
types.COUNT(DISTINCT)
. We exclude null
to ignore missing values. This assumes
that no rows had the value null
in any string column before the transformation.LOGICAL_OR
to return TRUE
when the column data type is BOOL
.null
. If the value of the
string column was 'null' as a string before transformation, it will also be counted. This needs to be handled if
we only need to count actual missing values.COUNT(1)
to get the ratioWe use AND
and OR
to combine our conditions and keep columns that match the condition.
STRING_AGG
to get the column names as a comma separated string. i.e. col_1, col_5, col_6
and
set v_columns
variable.EXECUTE IMMEDIATE
to execute our dynamic query using the v_table
and v_columns
variables.FORMAT
function which formats a data type expression as a string. %s
will be replaced
with the variable values in the order of listing the variables.