Missing Values

On this page, we look at how to inspect missing values in a table. In particular, we will cover the following common scenarios:

  • Summary of the state of missing values:
    • Any: Are there any missing values?
    • Count: How many missing values are there?
    • Proportion: What percent of values is missing?

For each of these scenarios, we look at how to summarize the state of missing values over the entire table, over a particular column, and over each column.

  • Look at the data around missing values:
    • Get Rows: Return rows that have missing values.
    • Get Columns: Return columns that have missing values.

Note: Our coverage here will assume that missing values are encoded correctly as null. Occasionally, that may not be the case, and we would need to encode missing values as such before checking for them. We will cover encoding missing values in General Operations.

Dealing with missing values is an extensive topic. In General Operations , we will cover more aspects of missing value management including dropping missing values, imputing missing values, and missing value encoding.

Any

We wish to know if there are any missing values in the first place.

Table

We wish to know if a table has any missing values.

BEGIN
    DECLARE v_null_condition STRING;
    SET v_null_condition =
            (SELECT STRING_AGG(CONCAT(column_name, ' IS NULL '), 'OR ')
             FROM refcon.dataset.INFORMATION_SCHEMA.COLUMNS
             WHERE table_name = 'table_1');
    EXECUTE IMMEDIATE FORMAT("""
                                SELECT LOGICAL_OR(%s)
                                FROM refcon.dataset.table_1
                             """, v_null_condition);

END;

Here is how this works:

  • First, we want to construct a null condition over all columns in the table.
  • INFORMATION_SCHEMA provides metadata for each table in the dataset and COLUMNS view returns the info for each column in the table.
  • We CONCAT the column name with IS NULL to get col_1 IS NULL.
  • We use STRING_AGG with OR as a separator to get col_1 IS NULL OR col_2 IS NULL OR ...
  • We use EXECUTE IMMEDIATE to execute our dynamic query using the v_null_condition variable.
  • We achieve this using the FORMAT function which formats a data type expression as a string. %s will be replaced with the generated null condition.
  • LOGICAL_OR aggregation function will return true if the condition evaluates to true at least once.

Particular Column

We wish to know if a particular column has any missing values.

SELECT LOGICAL_OR(col_10 IS NULL) has_null_values
FROM refcon.dataset.table_1;

Here is how this works:

  • LOGICAL_OR aggregation function will return true if the condition col_10 IS NULL evaluates to true at least once.

Each Column

We wish to know which columns of the table have missing values.

WITH key_value AS (SELECT SPLIT(kv, ':')[OFFSET(0)] column_name,
                          SPLIT(kv, ':')[OFFSET(1)] column_value
                   FROM refcon.dataset.table_1 t,
                        UNNEST(SPLIT(
                                REGEXP_REPLACE(
                                        TO_JSON_STRING(t), r'[{}"]', ''))) kv)

SELECT column_name, LOGICAL_OR(column_value = 'null') has_null_values
FROM key_value
GROUP BY column_name;

Here is how this works:

  • First, we transform our table into key-value pairs:

    • We do that by converting the whole table to a json string using TO_JSON_STRING function. The output will be one string where each row from the original table is now a json object.
    • We then remove '{}' and '"' from the output string and split by ',' so we get col:value in each row.
    • We use SPLIT and OFFSET to get the first value as the column name and the second as the column value.
  • Get table columns with missing values:

    • We use LOGICAL_OR to check if any row has the value null. If the value of the string column was 'null' as a string before transformation, it will also be counted. This needs to be handled if we only need to count actual missing values.

Count

We wish to know how many missing values there are.

Table

We wish to obtain the number of missing values in a table.

BEGIN
    DECLARE v_null_condition STRING;
    SET v_null_condition =
            (SELECT STRING_AGG(CONCAT(column_name, ' IS NULL '), 'OR ')
             FROM refcon.dataset.INFORMATION_SCHEMA.COLUMNS
             WHERE table_name = 'table_1');
    EXECUTE IMMEDIATE FORMAT("""
                                SELECT COUNT(CASE WHEN %s THEN 1 END)
                                FROM refcon.dataset.table_1
                             """, v_null_condition);

END;

Here is how this works:

This works similarly to the "Any: Table" scenario described above, except we use COUNT.

Particular Column

We wish to get the number of missing values in a particular column.

SELECT COUNT(1) null_value_count
FROM refcon.dataset.table_1
WHERE col_5 IS NULL;

Here is how this works:

  • We filter rows where col_5 IS NULL.
  • We use COUNT(1) to get the count of rows.

Each Column

We wish to obtain the number of missing values per column for each column in a table.

WITH key_value AS (SELECT SPLIT(kv, ':')[OFFSET(0)] column_name,
                          SPLIT(kv, ':')[OFFSET(1)] column_value
                   FROM refcon.dataset.table_1 t,
                        UNNEST(SPLIT(
                                REGEXP_REPLACE(
                                        TO_JSON_STRING(t), r'[{}"]', ''))) kv)

SELECT column_name, COUNT(CASE WHEN column_value = 'null' THEN column_value END) null_value_count
FROM key_value
GROUP BY column_name

Here is how this works:

This works similarly to the "Any: Each Column" scenario described above, except we use COUNT.

Proportion

We wish to get the proportion (percentage) of values that are missing.

Table

We wish to know the proportion of a table’s values that are missing.

WITH key_value AS (SELECT SPLIT(kv, ':')[OFFSET(0)] column_name,
                          SPLIT(kv, ':')[OFFSET(1)] column_value
                   FROM refcon.dataset.table_1 t,
                        UNNEST(SPLIT(
                                REGEXP_REPLACE(
                                        TO_JSON_STRING(t), r'[{}"]', ''))) kv)

SELECT COUNT(CASE WHEN column_value = 'null' THEN column_value END)
           /
       COUNT(1) null_value_proportion
FROM key_value

Here is how this works:

This works similarly to the "Any: Each Column" scenario described above, except we calculate the proportion without grouping by any column.

Particular Column

We wish to know the proportion of a column’s values that are missing.

SELECT COUNT(CASE WHEN col_5 IS NULL THEN 1 END)
           /
       COUNT(1) null_value_proportion
FROM refcon.dataset.table_1

Here is how this works:

  • We count rows where col_5 IS NULL using COUNT (CASE WHEN).
  • We divide by COUNT(1) to get the proportion of rows.

Each Column

We wish to obtain the proportion of missing values for each column in a table.

WITH key_value AS (SELECT SPLIT(kv, ':')[OFFSET(0)] column_name,
                          SPLIT(kv, ':')[OFFSET(1)] column_value
                   FROM refcon.dataset.table_1 t,
                        UNNEST(SPLIT(
                                REGEXP_REPLACE(
                                        TO_JSON_STRING(t), r'[{}"]', ''))) kv)

SELECT column_name,
       COUNT(CASE WHEN column_value = 'null' THEN column_value END)
           /
       COUNT(1) null_value_proportion
FROM key_value
GROUP BY column_name
ORDER BY null_value_proportion DESC 

Here is how this works:

This works similarly to the "Any: Each Column" scenario described above, except we calculate the proportion.

Get Rows

Any

We wish to get rows where any column has a missing value.

BEGIN
    DECLARE v_null_condition STRING;
    SET v_null_condition =
            (SELECT STRING_AGG(CONCAT(column_name, ' IS NULL '), 'OR ')
             FROM refcon.dataset.INFORMATION_SCHEMA.COLUMNS
             WHERE table_name = 'table_1');
    EXECUTE IMMEDIATE FORMAT("""
                                SELECT *
                                FROM refcon.dataset.table_1
                                WHERE %s
                             """, v_null_condition);

END;

Here is how this works:

This works similarly to the "Any: Table" scenario described above, except we filter based on the null condition.

Count

Get rows where more than n columns have missing values. In this example, we extract any row where more than 1 columns have missing values.

BEGIN
    DECLARE v_null_condition STRING;
    SET v_null_condition =
            (SELECT STRING_AGG(CONCAT('CAST (', column_name, ' IS NULL AS INT64 )'), '+ ')
             FROM refcon.dataset.INFORMATION_SCHEMA.COLUMNS
             WHERE table_name = 'table_1');
    EXECUTE IMMEDIATE FORMAT("""
                                SELECT *
                                FROM refcon.dataset.table_1
                                WHERE %s > 1
                             """, v_null_condition);

END;

Here is how this works:

  • First, we want to construct a null condition over all columns in the table.
  • INFORMATION_SCHEMA provides metadata for each table in the dataset and COLUMNS view returns the info for each column in the table.
  • We CONCAT the column name with IS NULL to get col_1 IS NULL.
  • We cast the boolean result to an integer. i.e. TRUE =1 and FALSE=0.
  • We use STRING_AGG with + as a separator to get CAST(col_1 IS NULL AS INT64) + CAST(col_2 IS NULL AS INT64) OR ...
  • We use EXECUTE IMMEDIATE to execute our dynamic query using the v_null_condition variable.
  • We achieve this using the FORMAT function which formats a data type expression as a string. %s will be replaced with the generated null condition.
  • v_null_condition will be bigger than 1 if at least two columns have missing values for a particular row.

Get Columns

Get columns where more than a given percent of values are missing. In this example, we get columns where more than 10% of the values are missing.

BEGIN
    DECLARE v_columns STRING;

    SET v_columns =
            (WITH key_value AS
                      (SELECT SPLIT(kv, ':')[OFFSET(0)] column_name,
                              SPLIT(kv, ':')[OFFSET(1)] column_value
                       FROM refcon.dataset.table_1 t,
                            UNNEST(SPLIT(
                                    REGEXP_REPLACE(
                                            TO_JSON_STRING(t), r'[{}"]', ''))) kv),
                  table_columns AS
                      (SELECT column_name,
                              SAFE_DIVIDE(
                                      COUNT(CASE
                                                WHEN column_value = 'null'
                                                    THEN column_value END),
                                      COUNT(1)) missing_vlaue_ratio
                       FROM key_value t
                       GROUP BY column_name)
             SELECT STRING_AGG(column_name) AS columns
             FROM table_columns
             WHERE missing_vlaue_ratio < 0.1);
    EXECUTE IMMEDIATE FORMAT("""SELECT %s
                                FROM refcon.dataset.table_1
                             """, v_columns);
END;

Here is how this works:

  • First, we transform the table into key-value pairs similar to the "Any: Each Column" scenario described above.
  • Then we calculate the missing value proportion similar to the "Proportion: Each Column" scenario described above.
  • We filter the columns with missing_vlaue_ratio < 0.1 and use STRING_AGG to get a comma seperated list of column names. i.e. col_1,col_2,...
  • We use EXECUTE IMMEDIATE to execute our dynamic query using the v_columns variable.
  • We achieve this using the FORMAT function which formats a data type expression as a string. %s will be replaced with the generated column names.
SQL
I/O