We wish to select columns whose data meets certain criteria.
Oftentimes we wish to select columns whose data satisfies certain conditions e.g. the percentage of missing values is below 10%.
Essentially, we apply a predicate operation (i.e. an operation that returns TRUE
or FALSE
) to each column of the
table and select the Columns for which the function evaluates to TRUE
. There are two common scenarios:
For detailed coverage of data type specific operations, see the respective data type operations section.
We wish to select columns whose data satisfies a certain condition where the condition is applicable to all data types.
In this example, we wish to return columns where the percentage of missing values to the total number of values is less than 10%.
BEGIN
DECLARE v_columns STRING;
SET v_columns =
(WITH key_value AS
(SELECT SPLIT(kv, ':')[OFFSET(0)] column_name,
SPLIT(kv, ':')[OFFSET(1)] column_value
FROM refcon.dataset.table_1 t,
UNNEST(SPLIT(
REGEXP_REPLACE(
TO_JSON_STRING(t), r'[{}"]', ''))) kv),
table_columns AS
(SELECT column_name,
SAFE_DIVIDE(
COUNT(CASE
WHEN column_value = 'null'
THEN column_value END),
COUNT(1)) missing_vlaue_ratio
FROM key_value t
GROUP BY column_name)
SELECT STRING_AGG(column_name) AS columns
FROM table_columns
WHERE missing_vlaue_ratio < 0.1);
EXECUTE IMMEDIATE FORMAT("""SELECT %s
FROM refcon.dataset.table_1
""", v_columns);
END;
Here is how this works:
First we transform our table to key value pairs:
TO_JSON_STRING
function. The output will be one
string where each row from the original table is now a json object.'{}'
and '"'
from the output string and split by ','
so we get col:value in each row.SPLIT
and OFFSET
to get the first value as the column name and the second as the column value.Get table columns with less than 10% missing values:
null
. If the value of the
string column was 'null'
as a string before transformation, it will also be counted. This needs to be handled if
we only need to count actual missing values.COUNT(1)
to get the ratio, and then we only keep columns with missing_vlaue_ratio < 0.1
.STRING_AGG
to get the column names as a comma separated string. i.e. col_1, col_5, col_6
and
set v_columns
variable.We use EXECUTE IMMEDIATE
to execute our dynamic query using the v_table
and v_columns
variables.
FORMAT
function which formats a data type expression as a string. %s
will be replaced
with the variable values in the order of listing the variables.We wish to select columns whose data satisfies a certain condition where the condition is applicable only to columns of particular data types. In this scenario, we need to first select the columns of the appropriate data type(s) and then apply the function to those.
In this example, we wish to return numeric columns where the mean is bigger than 1.
BEGIN
DECLARE v_data_types ARRAY <STRING>;
DECLARE v_columns STRING;
SET v_data_types = ['INT64', 'NUMERIC', 'BIGNUMERIC', 'FLOAT64'];
SET v_columns =
(WITH key_value AS
(SELECT SPLIT(kv, ':')[OFFSET(0)] column_name,
SPLIT(kv, ':')[OFFSET(1)] column_value
FROM refcon.dataset.table_1 t,
UNNEST(SPLIT(
REGEXP_REPLACE(
TO_JSON_STRING(t), r'[{}"]', ''))) kv),
table_columns AS
(SELECT t.column_name,
AVG(SAFE_CAST(column_value AS BIGNUMERIC)) column_avg
FROM key_value t
INNER JOIN refcon.dataset.INFORMATION_SCHEMA.COLUMNS c
ON c.column_name = t.column_name AND
c.table_name = 'table_1'
WHERE data_type IN UNNEST(v_data_types)
GROUP BY t.column_name)
SELECT STRING_AGG(column_name) AS columns
FROM table_columns
WHERE column_avg > 1);
EXECUTE IMMEDIATE FORMAT("""SELECT %s
FROM refcon.dataset.table_1
""", v_columns);
END;
Here is how this works:
First we transform our table to key value pairs in. This works similarly to the scenario above.
Get table columns where the mean is bigger than 1:
INFORMATION_SCHEMA.COLUMNS
and keep only the columns with numeric types.BIGNUMERIC
as it is
inclusive of all smaller numeric types.AVG()
to calculate the mean for each column.column_avg > 1
and then we apply STRING_AGG
to get the column names as a comma
separated string. i.e. col_1, col_5, col_6
and set v_columns
variable.We use EXECUTE IMMEDIATE
to execute our dynamic query using the v_table
and v_columns
variables.
FORMAT
function which formats a data type expression as a string. %s
will be replaced
with the variable values in the order of listing the variables.