On this page, we look at how to inspect missing values in a table. In particular, we will cover the following common scenarios:
For each of these scenarios, we look at how to summarize the state of missing values over the entire table, over a particular column, and over each column.
Note: Our coverage here will assume that missing values are encoded correctly as null
.
Occasionally, that may not be
the case, and we would need to encode missing values as such before checking for them. We will cover
encoding missing
values in General Operations.
Dealing with missing values is an extensive topic. In General Operations , we will cover more aspects of missing value management including dropping missing values, imputing missing values, and missing value encoding.
We wish to know if there are any missing values in the first place.
Table
We wish to know if a table has any missing values.
BEGIN
DECLARE v_null_condition STRING;
SET v_null_condition =
(SELECT STRING_AGG(CONCAT(column_name, ' IS NULL '), 'OR ')
FROM refcon.dataset.INFORMATION_SCHEMA.COLUMNS
WHERE table_name = 'table_1');
EXECUTE IMMEDIATE FORMAT("""
SELECT LOGICAL_OR(%s)
FROM refcon.dataset.table_1
""", v_null_condition);
END;
Here is how this works:
INFORMATION_SCHEMA
provides metadata for each table in the dataset and COLUMNS
view returns
the info for each
column in the table.CONCAT
the column name with IS NULL
to get col_1 IS NULL
.STRING_AGG
with OR
as a separator to get col_1 IS NULL OR col_2 IS NULL OR ..
.EXECUTE IMMEDIATE
to execute our dynamic query using the v_null_condition
variable.FORMAT
function which formats a data type expression as a string. %s
will be replaced
with the generated null condition.LOGICAL_OR
aggregation function will return true if the condition evaluates to true at least
once.Particular Column
We wish to know if a particular column has any missing values.
SELECT LOGICAL_OR(col_10 IS NULL) has_null_values
FROM refcon.dataset.table_1;
Here is how this works:
LOGICAL_OR
aggregation function will return true if the condition col_10 IS NULL
evaluates to
true at least once.Each Column
We wish to know which columns of the table have missing values.
WITH key_value AS (SELECT SPLIT(kv, ':')[OFFSET(0)] column_name,
SPLIT(kv, ':')[OFFSET(1)] column_value
FROM refcon.dataset.table_1 t,
UNNEST(SPLIT(
REGEXP_REPLACE(
TO_JSON_STRING(t), r'[{}"]', ''))) kv)
SELECT column_name, LOGICAL_OR(column_value = 'null') has_null_values
FROM key_value
GROUP BY column_name;
Here is how this works:
First, we transform our table into key-value pairs:
TO_JSON_STRING
function. The
output will be one
string where each row from the original table is now a json object.'{}'
and '"'
from the output string and split by ','
so we get col:value
in each row.SPLIT
and OFFSET
to get the first value as the column name and the second as the
column value.Get table columns with missing values:
LOGICAL_OR
to check if any row has the value null
. If the value of the
string column was 'null'
as a string before transformation, it will also be counted. This
needs to be handled if
we only need to count actual missing values.We wish to know how many missing values there are.
Table
We wish to obtain the number of missing values in a table.
BEGIN
DECLARE v_null_condition STRING;
SET v_null_condition =
(SELECT STRING_AGG(CONCAT(column_name, ' IS NULL '), 'OR ')
FROM refcon.dataset.INFORMATION_SCHEMA.COLUMNS
WHERE table_name = 'table_1');
EXECUTE IMMEDIATE FORMAT("""
SELECT COUNT(CASE WHEN %s THEN 1 END)
FROM refcon.dataset.table_1
""", v_null_condition);
END;
Here is how this works:
This works similarly to the "Any: Table" scenario described above, except we use COUNT
.
Particular Column
We wish to get the number of missing values in a particular column.
SELECT COUNT(1) null_value_count
FROM refcon.dataset.table_1
WHERE col_5 IS NULL;
Here is how this works:
col_5 IS NULL
.COUNT(1)
to get the count of rows.Each Column
We wish to obtain the number of missing values per column for each column in a table.
WITH key_value AS (SELECT SPLIT(kv, ':')[OFFSET(0)] column_name,
SPLIT(kv, ':')[OFFSET(1)] column_value
FROM refcon.dataset.table_1 t,
UNNEST(SPLIT(
REGEXP_REPLACE(
TO_JSON_STRING(t), r'[{}"]', ''))) kv)
SELECT column_name, COUNT(CASE WHEN column_value = 'null' THEN column_value END) null_value_count
FROM key_value
GROUP BY column_name
Here is how this works:
This works similarly to the "Any: Each Column" scenario described above, except we use COUNT
.
We wish to get the proportion (percentage) of values that are missing.
Table
We wish to know the proportion of a table’s values that are missing.
WITH key_value AS (SELECT SPLIT(kv, ':')[OFFSET(0)] column_name,
SPLIT(kv, ':')[OFFSET(1)] column_value
FROM refcon.dataset.table_1 t,
UNNEST(SPLIT(
REGEXP_REPLACE(
TO_JSON_STRING(t), r'[{}"]', ''))) kv)
SELECT COUNT(CASE WHEN column_value = 'null' THEN column_value END)
/
COUNT(1) null_value_proportion
FROM key_value
Here is how this works:
This works similarly to the "Any: Each Column" scenario described above, except we calculate the proportion without grouping by any column.
Particular Column
We wish to know the proportion of a column’s values that are missing.
SELECT COUNT(CASE WHEN col_5 IS NULL THEN 1 END)
/
COUNT(1) null_value_proportion
FROM refcon.dataset.table_1
Here is how this works:
col_5 IS NULL
using COUNT (CASE WHEN)
.COUNT(1)
to get the proportion of rows.Each Column
We wish to obtain the proportion of missing values for each column in a table.
WITH key_value AS (SELECT SPLIT(kv, ':')[OFFSET(0)] column_name,
SPLIT(kv, ':')[OFFSET(1)] column_value
FROM refcon.dataset.table_1 t,
UNNEST(SPLIT(
REGEXP_REPLACE(
TO_JSON_STRING(t), r'[{}"]', ''))) kv)
SELECT column_name,
COUNT(CASE WHEN column_value = 'null' THEN column_value END)
/
COUNT(1) null_value_proportion
FROM key_value
GROUP BY column_name
ORDER BY null_value_proportion DESC
Here is how this works:
This works similarly to the "Any: Each Column" scenario described above, except we calculate the proportion.
Any
We wish to get rows where any column has a missing value.
BEGIN
DECLARE v_null_condition STRING;
SET v_null_condition =
(SELECT STRING_AGG(CONCAT(column_name, ' IS NULL '), 'OR ')
FROM refcon.dataset.INFORMATION_SCHEMA.COLUMNS
WHERE table_name = 'table_1');
EXECUTE IMMEDIATE FORMAT("""
SELECT *
FROM refcon.dataset.table_1
WHERE %s
""", v_null_condition);
END;
Here is how this works:
This works similarly to the "Any: Table" scenario described above, except we filter based on the null condition.
Count
Get rows where more than n
columns have missing values. In this example, we extract any row where
more than 1
columns
have missing values.
BEGIN
DECLARE v_null_condition STRING;
SET v_null_condition =
(SELECT STRING_AGG(CONCAT('CAST (', column_name, ' IS NULL AS INT64 )'), '+ ')
FROM refcon.dataset.INFORMATION_SCHEMA.COLUMNS
WHERE table_name = 'table_1');
EXECUTE IMMEDIATE FORMAT("""
SELECT *
FROM refcon.dataset.table_1
WHERE %s > 1
""", v_null_condition);
END;
Here is how this works:
INFORMATION_SCHEMA
provides metadata for each table in the dataset and COLUMNS
view returns
the info for each
column in the table.CONCAT
the column name with IS NULL
to get col_1 IS NULL
.TRUE =1
and FALSE=0
.STRING_AGG
with +
as a separator to
get CAST(col_1 IS NULL AS INT64) + CAST(col_2 IS NULL AS INT64) OR ..
.EXECUTE IMMEDIATE
to execute our dynamic query using the v_null_condition
variable.FORMAT
function which formats a data type expression as a string. %s
will be replaced
with the generated null condition.v_null_condition
will be bigger than 1 if at least two columns have missing values for a
particular row.Get columns where more than a given percent of values are missing. In this example, we get columns
where more than 10%
of the values are missing.
BEGIN
DECLARE v_columns STRING;
SET v_columns =
(WITH key_value AS
(SELECT SPLIT(kv, ':')[OFFSET(0)] column_name,
SPLIT(kv, ':')[OFFSET(1)] column_value
FROM refcon.dataset.table_1 t,
UNNEST(SPLIT(
REGEXP_REPLACE(
TO_JSON_STRING(t), r'[{}"]', ''))) kv),
table_columns AS
(SELECT column_name,
SAFE_DIVIDE(
COUNT(CASE
WHEN column_value = 'null'
THEN column_value END),
COUNT(1)) missing_vlaue_ratio
FROM key_value t
GROUP BY column_name)
SELECT STRING_AGG(column_name) AS columns
FROM table_columns
WHERE missing_vlaue_ratio < 0.1);
EXECUTE IMMEDIATE FORMAT("""SELECT %s
FROM refcon.dataset.table_1
""", v_columns);
END;
Here is how this works:
missing_vlaue_ratio < 0.1
and use STRING_AGG
to get a comma
seperated list of column
names. i.e. col_1,col_2,...EXECUTE IMMEDIATE
to execute our dynamic query using the v_columns
variable.FORMAT
function which formats a data type expression as a string. %s
will be replaced
with the generated column names.