We wish to fill in missing values with suitable values. This is often referred to as imputation.
In this section, we cover the most common approaches to filling in missing values, which are:
We wish to fill in missing values in a column of a table with a static or a summary value.
In this example, we wish to replace any NULL
values in the column col_2
of the table table_1
with 0.
SELECT col_1,
IFNULL(col_2, 0) col_2
FROM table_1;
Here is how this works:
IFNULL()
to replace NULL
s with a specified value.IFNULL()
takes as input:col_2
.Alternative: COALESCE
SELECT col_1,
COALESCE(col_2, 0) col_2
FROM table_1;
Here is how this works:
This works similarly to the primary solution, except we use COALESCE
instead of IFNULL
. We will
describe COALESCE
in more details in Coalesce section.
Extension: Replace with Summary
SELECT col_1,
IFNULL(col_2, (SELECT AVG(col_2) FROM table_1)) col_2
FROM table_1;
Here is how this works:
IFNULL()
to replace missing values as described above. However, instead of passing a
constant value, we compute and pass the mean of the values of the column col_2
using the
subquery (SELECT AVG(col_2) FROM table_1)
.AVG()
ignores missing values by default.Extension: Replace with Group Summary
SELECT col_1,
IFNULL(col_2, AVG(col_2) OVER (PARTITION BY col_1)) col_2
FROM table_1;
Here is how this works:
AVG()
as a window function
and partition by col_1
.AVG(col_2) OVER (PARTITION BY col_1)
will compute the mean value
of col_2
for each group. Then IFNULL()
will use that group-wise mean value to fill in missing
values in the respective group.We wish to fill in missing values with the previous non-missing value or the next non-missing value.
Previous
In this example, we wish to fill in missing values in the column col_2
with the first prior
non-missing value according to the order of column col_1
.
SELECT col_1,
IFNULL(col_2, LAST_VALUE(col_2 IGNORE NULLS)
OVER (ORDER BY col_1 ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING)) AS col_2
FROM table_1;
Here is how this works:
IFNULL()
to fill in missing values in col_2
, with the first previous
non-missing value using LAST_VALUE()
window function.LAST_VALUE
is a window function that returns the last value in a window of rows.
The IGNORE NULLS
option is to skip any NULL
values and return the last non-null value instead.OVER
is used to define the window within which the function should operate. In this case, we are
using ORDER BY col_1
to specify that the rows should be ordered by the values in col_1
. The
ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING
clause specifies the range of rows to include
in the window. In this example, include the current row and all the rows that come before it.UNBOUNDED PRECEDING
means that the window starts with the first row in the result set, and
1 PRECEDING
means that the window includes all rows up to and including the row immediately
preceding the current row.LAST_VALUE
will return the last non-null value that appears in any row before
the current row, in the order defined by ORDER BY col_1
.table_1
but where any
missing value in the column col_2
is filled with the first previous non-missing value.Next
In this example, we wish to fill in missing values in the column col_2
with the first next
non-missing value according to the order of column col_1
.
SELECT col_1,
IFNULL(col_2, FIRST_VALUE(col_2 IGNORE NULLS)
OVER (ORDER BY col_1 ROWS BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING)) AS col_2
FROM table_1;
Here is how this works:
IFNULL()
to fill in missing values in col_2
, with the first bext
non-missing value using FIRST_VALUE()
window function.FIRST_VALUE
is a window function that returns the first value in a window of rows.
The IGNORE NULLS
option is to skip any NULL
values and return the furst non-null value
instead.OVER
is used to define the window within which the function should operate. In this case, we are
using ORDER BY col_1
to specify that the rows should be ordered by the values in col_1
. The
ROWS BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING)
clause specifies the range of rows to include
in the window. In this example, include the current row and all the rows that come after it.1 FOLLOWING
means that the window starts with the next row after the current one,
and UNBOUNDED FOLLOWING
means that the window includes all rows after the current one.FIRST_VALUE
will return the first non-null value that appears in any row after
the current row, in the order defined by ORDER BY col_1
.Extension: Grouped Padding
In this example, we wish to fill in missing values in the column col_3
with the first next
non-missing value in its group where the groups are defined by the value of the column col_1
. Next
value is according to the order of column col_2
SELECT col_1,
col_2,
COALESCE(col_3,
FIRST_VALUE(col_3 IGNORE NULLS)
OVER (PARTITION BY col_1 ORDER BY col_2 ROWS BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING)) AS col_3
FROM table_1;
Here is how this works:
PARTITION BY col_1
in the OVER()
clause.FIRST_VALUE()
with OVER (PARTITION BY col_1 ..)
will fill in missing
values in the column col_3
with other values from within the group.We wish to fill in the missing values in a column or a vector with the corresponding values from another column; a process often referred to as coalescing.
In this example, we wish to fill in any missing values in the column col_2
with the corresponding
values of the column col_3
.
SELECT col_1,
COALESCE(col_2, col_3) AS col_2,
col_3
FROM table_1;
Here is how this works:
COALESCE()
to fill in any missing values in the column col_2
with the corresponding values of the column col_3
.table_1
but
with any missing values in the column col_2
with the corresponding values of the column col_3
.Extension: Coalesce with Multiple Columns
SELECT col_1,
COALESCE(col_2, col_3, col_4) AS col_2,
col_3,
col_4
FROM table_1;
Here is how this works:
COALESCE
.col_2
with the first non-missing value of the remaining
columns. In this example, if col_3
is not NULL, it will fill col_2
with the value of col_3
,
otherwise it will use the value in col_4
.Extension: Coalesce with Different tables
We wish to have missing values in one table be filled with corresponding values from another table.
SELECT t1.col_1,
COALESCE(t1.col_2, t2.col_2) AS col_2,
COALESCE(t1.col_3, t2.col_3) AS col_3,
FROM table_1 AS t1
LEFT JOIN table_2 AS t2 ON t1.col_1 = t2.col_1;
Here is how this works:
table_1
and table_2
by appropriate join columns, which here
is col_1
. See Joining.COALESCE()
and use the tables aliases to select which column to coalesce. In this
example, we coalesce col_2
and col_3
from table_1 with col_2
and col_3
from table_2
.