Drop Missing

We wish to drop rows that contain missing values.

We cover the following scenarios:

  • Selected Columns: Drop rows where any or all of a selected set of columns has a missing value.
  • Any Column: Drop rows where any column has a missing value.
  • Empty Rows: Drop rows where all columns have missing values.
  • Absolute Threshold: Drop rows that have more than a specified number of columns with missing values.

Selected Columns

We wish to drop rows where any column from a selected set of columns has a missing value.

In this example, we wish to drop any rows from the table table_1 where either the column col_1 or the column col_2 has a missing value.

SELECT *
FROM table_1
WHERE NOT (col_2 IS NULL OR col_3 IS NULL);

Here is how this works:

  • The standard solution for identifying missing values is the IS NULL expression. It returns TRUE if the value of col_1 is missing and FALSE otherwise.
  • We use the OR to check if col_2 or col_3 has missing value. This returns TRUE for rows where neither of the columns has a missing value. We use WHERE NOT to select rows that complement this condition. See Common Scenarios: Complement and Multiple Filters: OR.
  • The output table will have the same columns as the input table table_1 but with any row, where col_2 or col_3 has a missing value, dropped.
  • An equivalent condition is col_2 IS NOT NULL AND col_3 IS NOT NULL.

Extension: Each Selected Column

We wish to drop rows if and only if all the values of the specified columns are missing.

SELECT *
FROM table_1
WHERE col_2 IS NULL
  AND col_3 IS NULL;

Here is how this works:

  • The standard solution for identifying missing values is the IS NULL expression. It returns TRUE if the value of col_1 is missing and FALSE otherwise.
  • We use the AND to check if both col_2 and col_3 has missing value. This returns TRUE for rows where both of the columns has a missing value. We use WHERE to select rows that meet this condition. See Multiple Filters: AND.
  • The output table will have the same columns as the input table table_1 but with any row where all the specified columns, which here are col_1 and col_2, are missing is dropped.

Any Columns

We wish to drop rows where any column has a missing value.

SELECT *
FROM table_1
WHERE TO_JSON_STRING(table_1) NOT LIKE '%:null%';

Here is how this works:

  • The TO_JSON_STRING function is called on each row in table_1, which converts the row to a JSON string representation. For example in a table with 3 columns , the row will be repressed as "{"col_1":"str value","col_2":null,"col_3": int_value}", where col_2 here has a missing value.
  • Each row is represented as a single string, and we use NOT LIKE '%:null%' to return TRUE if it does not contain the substring :null,. See String Detecting: Complement.
  • We use the substring :null to avoid including string columns that contain the substring 'null'.
  • The output table will have the same columns as the input table table_1 but with any row that has one or more missing values dropped.

Empty Rows

We wish to drop rows where all columns have missing value (a value of NULL).

SELECT *
FROM table_1
WHERE (SELECT COUNT(*)
       FROM UNNEST(REGEXP_EXTRACT_ALL(TO_JSON_STRING(table_1), ':null'))) < 3

Here is how this works:

  • If we have a small number of columns we can use the same approach in the 'Selected Columns' scenario above.
  • If the number of columns is too large we can use TO_JSON_STRING() function to convert each row to a Json string.
  • We then use REGEXP_EXTRACT_ALL() to extract all matches for the pattern ':null'. We use the substring :null to avoid including string columns that contain the substring 'null'.
  • REGEXP_EXTRACT_ALL() returns an array of matches, we UNNSET the array and count the elements which in this case will be the number of columns with missing values.
  • We only keep rows where the number of column with missing values is less than the number of columns in the table. In this example, 3. See Dimensions for a coverage of how to get the number of columns dynamically.
  • The output table will have the same columns as the input table table_1 but with any row where all columns have missing values NULL is dropped.

Absolute Threshold

We wish to drop rows that have more than a specified number of columns with missing values.

In this example, we wish to retain rows where the values of less than three columns are missing NULL.

SELECT *
FROM table_1
WHERE (SELECT COUNT(*)
       FROM UNNEST(REGEXP_EXTRACT_ALL(TO_JSON_STRING(table_1), ':null'))) < 3

Here is how this works:

This works similar to the Empty Rows scenario above except we filter using any number instead of the number of columns in the table.

SQL
I/O