Fill Missing

We wish to fill in missing values with suitable values. This is often referred to as imputation.

In this section, we cover the most common approaches to filling in missing values, which are:

  • Replace: Fill in missing values with a static or summary value, such as the mean of the column.
  • Pad: Fill in missing values with the previous or the next value.
  • Coalesce: Fill in the missing values in a column with the corresponding values from another column of the same size.

Replace

We wish to fill in missing values in a column of a table with a static or a summary value.

In this example, we wish to replace any NULL values in the column col_2 of the table table_1 with 0.

SELECT col_1,
       IFNULL(col_2, 0) col_2
FROM table_1;

Here is how this works:

  • We use the function IFNULL() to replace NULLs with a specified value.
  • In this example, the function IFNULL() takes as input:
    • The column whose missing values we wish to replace, which in this case is col_2.
    • The value that we want to use to fill in the missing values, which here is 0.

Alternative: COALESCE

SELECT col_1,
       COALESCE(col_2, 0) col_2
FROM table_1;

Here is how this works:

This works similarly to the primary solution, except we use COALESCE instead of IFNULL. We will describe COALESCE in more details in Coalesce section.

Extension: Replace with Summary

SELECT col_1,
       IFNULL(col_2, (SELECT AVG(col_2) FROM table_1)) col_2
FROM table_1;

Here is how this works:

  • We use IFNULL() to replace missing values as described above. However, instead of passing a constant value, we compute and pass the mean of the values of the column col_2 using the subquery (SELECT AVG(col_2) FROM table_1).
  • Note that AVG() ignores missing values by default.

Extension: Replace with Group Summary

SELECT col_1,
       IFNULL(col_2, AVG(col_2) OVER (PARTITION BY col_1)) col_2
FROM table_1;

Here is how this works:

  • This works similarly to the primary solution above except that we use AVG() as a window function and partition by col_1.
  • In particular, calling AVG(col_2) OVER (PARTITION BY col_1) will compute the mean value of col_2 for each group. Then IFNULL() will use that group-wise mean value to fill in missing values in the respective group.

Pad

We wish to fill in missing values with the previous non-missing value or the next non-missing value.

Previous

In this example, we wish to fill in missing values in the column col_2 with the first prior non-missing value according to the order of column col_1.

SELECT col_1,
       IFNULL(col_2, LAST_VALUE(col_2 IGNORE NULLS)
                                OVER (ORDER BY col_1 ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING)) AS col_2
FROM table_1;

Here is how this works:

  • We use the function IFNULL() to fill in missing values in col_2, with the first previous non-missing value using LAST_VALUE() window function.
  • LAST_VALUE is a window function that returns the last value in a window of rows. The IGNORE NULLS option is to skip any NULL values and return the last non-null value instead.
  • OVER is used to define the window within which the function should operate. In this case, we are using ORDER BY col_1 to specify that the rows should be ordered by the values in col_1. The ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING clause specifies the range of rows to include in the window. In this example, include the current row and all the rows that come before it.
  • UNBOUNDED PRECEDING means that the window starts with the first row in the result set, and 1 PRECEDING means that the window includes all rows up to and including the row immediately preceding the current row.
  • This means that LAST_VALUE will return the last non-null value that appears in any row before the current row, in the order defined by ORDER BY col_1.
  • The output table will have the same columns and the same rows as the table table_1 but where any missing value in the column col_2 is filled with the first previous non-missing value.

Next

In this example, we wish to fill in missing values in the column col_2 with the first next non-missing value according to the order of column col_1.

SELECT col_1,
       IFNULL(col_2, FIRST_VALUE(col_2 IGNORE NULLS)
                                 OVER (ORDER BY col_1 ROWS BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING)) AS col_2
FROM table_1;

Here is how this works:

  • We use the function IFNULL() to fill in missing values in col_2, with the first bext non-missing value using FIRST_VALUE() window function.
  • FIRST_VALUE is a window function that returns the first value in a window of rows. The IGNORE NULLS option is to skip any NULL values and return the furst non-null value instead.
  • OVER is used to define the window within which the function should operate. In this case, we are using ORDER BY col_1 to specify that the rows should be ordered by the values in col_1. The ROWS BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING) clause specifies the range of rows to include in the window. In this example, include the current row and all the rows that come after it.
  • 1 FOLLOWING means that the window starts with the next row after the current one, and UNBOUNDED FOLLOWING means that the window includes all rows after the current one.
  • This means that FIRST_VALUE will return the first non-null value that appears in any row after the current row, in the order defined by ORDER BY col_1.

Extension: Grouped Padding

In this example, we wish to fill in missing values in the column col_3 with the first next non-missing value in its group where the groups are defined by the value of the column col_1. Next value is according to the order of column col_2

SELECT col_1,
       col_2,
       COALESCE(col_3,
                FIRST_VALUE(col_3 IGNORE NULLS)
                            OVER (PARTITION BY col_1 ORDER BY col_2 ROWS BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING)) AS col_3
FROM table_1;

Here is how this works:

  • This works similarly to the primary solution above except that we use FIRST PARTITION BY col_1 in the OVER() clause.
  • In particular, calling FIRST_VALUE() with OVER (PARTITION BY col_1 ..) will fill in missing values in the column col_3 with other values from within the group.

Coalesce

We wish to fill in the missing values in a column or a vector with the corresponding values from another column; a process often referred to as coalescing.

In this example, we wish to fill in any missing values in the column col_2 with the corresponding values of the column col_3.

SELECT col_1,
       COALESCE(col_2, col_3) AS col_2,
       col_3
FROM table_1;

Here is how this works:

  • We use the function COALESCE() to fill in any missing values in the column col_2 with the corresponding values of the column col_3.
  • The output table will have the same rows and columns of the input table table_1 but with any missing values in the column col_2 with the corresponding values of the column col_3.

Extension: Coalesce with Multiple Columns

SELECT col_1,
       COALESCE(col_2, col_3, col_4) AS col_2,
       col_3,
       col_4
FROM table_1;

Here is how this works:

  • This works similarly to the primary solution above except that we pass more than two columns to COALESCE.
  • It will replace the missing value in col_2 with the first non-missing value of the remaining columns. In this example, if col_3 is not NULL, it will fill col_2 with the value of col_3, otherwise it will use the value in col_4.

Extension: Coalesce with Different tables

We wish to have missing values in one table be filled with corresponding values from another table.

SELECT t1.col_1,
       COALESCE(t1.col_2, t2.col_2) AS col_2,
       COALESCE(t1.col_3, t2.col_3) AS col_3,
FROM table_1 AS t1
         LEFT JOIN table_2 AS t2 ON t1.col_1 = t2.col_1;

Here is how this works:

  • We first left join the two tables table_1 and table_2 by appropriate join columns, which here is col_1. See Joining.
  • When coalescing two different data structures, we need to ensure that both columns are aligned. The left join operation ensures that that is the case.
  • We then use COALESCE() and use the tables aliases to select which column to coalesce. In this example, we coalesce col_2 and col_3 from table_1 with col_2 and col_3 from table_2.
SQL
I/O