We wish to fill in missing values with suitable values. This is often referred to as imputation.
In this section, we cover the most common approaches to filling in missing values, which are:
We wish to fill in missing values in a column of a data frame with a static or a summary value.
In this example, we wish to replace any NA
values in the column col_2
of the data frame df
with 0.
df_2 = df %>%
mutate(
col_2 = replace_na(col_2, 0)
)
Here is how this works:
replace_na()
from tidyr
to replace NA
s with a specified value.replace_na()
takes as input:col_2
and is referred to directly since we are operating within mutate()
.Extension: Replace with Summary
df_2 = df %>%
mutate(
col_2 = replace_na(col_2, mean(col_2, na.rm=TRUE))
)
Here is how this works:
replace_na()
inside mutate()
to replace missing values as described above. However, instead of passing a constant value, we compute and pass the mean of the values of the column col_2
via mean(col_2, na.rm=TRUE)
.na.rm=TRUE
since we know that the column has missing values (which we are trying to fill in) and mean()
would just return NA
unless we set na.rm=TRUE
.Extension: Replace with Group Summary
df_2 = df %>%
group_by(col_1) %>%
mutate(
col_2 = replace_na(col_2, mean(col_2, na.rm=TRUE))
)
Here is how this works:
replace_na()
(inside mutate()
) after we run group_by()
on the input data frame so that replace_na()
acts on each group separately.mean(col_2, na.rm=TRUE)
will compute the mean value of col_2
for each group. Then replace_na()
will use that group-wise mean value to fill in missing values in the respective group.Extension: Entire Data Frame
df_2 = df %>%
mutate(across(everything(), ~replace_na(., 0)))
Here is how this works:
across()
inside mutate()
to apply the transformation ~replace_na(., 0)
to multiple columns. We select all columns via everything()
. See Implicit Transformation.Extension: Multiple Columns with Different Replacements
We wish to replace all missing elements in columns col_1
, col_2
and col_3
, with ‘z’, 0, and -1 respectively.
replacements = list(col_1 = 'z', col_2 = 0, col_3 = -1)
df_2 = df %>%
replace_na(replacements)
Here is how this works:
In order to replace missing values in different columns with different values, we can pass to replace_na()
a list of named values where the name is the column name and the value is what we wish to use to replace missing values in the named column.
We wish to fill in missing values with the previous non-missing value or the next non-missing value.
In this example, we wish to fill in missing values in the column col_2
with the first prior non-missing value assuming the data frame is ordered by the column col_1
.
df_2 = df %>%
arrange(col_1) %>%
fill(col_2)
Here is how this works:
fill()
from tidyr
to fill in missing values in one or more specified column(s), which here is col_2
, with the first previous non-missing value.fill()
has an argument .direction
(note the dot .
in the name) that specifies whether:.direction=”down”
(the default and what we use in this example).direction=”up”
.col_1
in arrange(col_1)
before we run fill(col_2)
.df_2
will have the same columns and the same rows as the data frame df
but where any missing value in the column col_2
is filled with the first previous non-missing value.Extension: Grouped Padding
In this example, we wish to fill in missing values in the column col_3
with the first prior non-missing value in its group where the groups are defined by the value of the column col_1
. We are assuming that the data frame is ordered by the column col_2
.
df_2 = df %>%
group_by(col_1) %>%
arrange(col_2) %>%
fill(col_3, .direction = "downup")
Here is how this works:
fill()
after we run group_by()
on the input data frame so that fill()
acts on each group separately.fill(col_3)
after group_by(col_1)
will fill in missing values in the column col_3
with other values from within the group..direction
of fill()
to .direction = "downup"
specifies that NAs are filled with the first previous non-missing values then NAs at the beginning of each group will be filled with the first following non-NA value. We can switch this behavior by setting .direction = "updown"
.col_2
in arrange(col_2)
before we run fill(col_3)
.We wish to fill in the missing values in a column or a vector with the corresponding values from another column or vector of the same size; a process often referred to as coalescing.
In this example, we wish to fill in any missing values in the column col_2
with the corresponding values of the column col_3
.
df_2 = df %>%
mutate(col_2 = coalesce(col_2, col_3))
Here is how this works:
coalesce()
from dplyr
to fill in any missing values in the column col_2
with the corresponding values of the column col_3
.coalesce()
finds the first non-missing value at each position.df_2
will have the same rows and columns of the input data frame df
but with any missing values in the column col_2
with the corresponding values of the column col_3
.Extension: Coalesce with Different Data Frames
We wish to have missing values in one data frame be filled with corresponding values from another data frame.
df = df_1 %>%
left_join(df_2, by = "col_1", suffix = c('_x', '_y')) %>%
mutate(
col_2 = coalesce(col_2_x, col_2_y),
col_3 = coalesce(col_3_x, col_3_y)) %>%
select(-matches("_[xy]"))
Here is how this works:
df_1
and df_2
by appropriate join columns, which here is col_1
. We also specify how columns with the same name should be renamed by setting the suffix
argument. See Joining.coalesce()
inside mutate()
as described in the primary solution above.df_1
data frame but with columns any missing values in columns col_2
and col_3
filled with the corresponding values form the second data frame df_2
._x
and _y
) without explicitly naming the pair of columns.We wish to fill in missing values by linearly interpolating between the surrounding values.
In this example, we wish to fill in missing values in the column col_2
by linearly interpolating between the surrounding non-missing values assuming the data frame is sorted by the column col_1
.
library(imputeTS)
df_2 = df %>%
arrange(col_1) %>%
mutate(col_2 = na_interpolation(col_2))
Here is how this works:
na_interpolation()
from the package imputeTS
to fill in missing values with a linear interpolation between surrounding non-missing values.zoo
, forecast
and mice
, we recommend na_interpolation()
from imputeTS
because it is the most tidyverse-like; i.e. it fits well in dplyr
chains and requires no overhead (some of the options require special syntax).df_2
will have the same rows and columns as the input data frame df
but with missing values of the column col_2
filled by linearly interpolating between the surrounding non-missing values.Extension: Grouped Interpolation
We wish to fill in missing values by linearly interpolating between the surrounding values. We wish to have the interpolation carried out inside each group separately.
library(imputeTS)
df_2 = df %>%
group_by(col_1) %>%
arrange(col_2) %>%
mutate(col_3 = na_interpolation(col_3))
Here is how this works:
group_by()
prior to mutate()
so that interpolation is carried out on each group separately.Extension: Non-Linear Interpolation
We wish to fill in missing values via a non-linear interpolation between the surrounding values.
library(imputeTS)
df_2 = df %>%
mutate(col_2 = na_interpolation(col_2, option = "spline"))
Here is how this works:
option
argument of na_interpolation()
to option = "spline"
to perform a non-linear spline interpolation.imputeTS
package here for a description of other non-linear interpolation options for filling in missing values.We wish to complete a data frame by adding rows that hold any missing combinations of the values of a selected set of columns.
In this example, we wish to have the data frame df
include rows that cover all combinations of the values of columns col_1
and col_2
. We wish to fill in the missing values of the column col_3
is any newly created rows with 0.
df_2 = df %>%
complete(col_1, col_2, fill = list(col_3 = 0))
Here is how this works:
complete()
to add rows to the data frame df
so that all combinations of the values of the specified columns, which here are col_1
and col_2
, are present.complete()
takes as input:df
and is passed via the pipe %>%
col_1
and col_2
complete()
the values to use to fill in the missing values of any columns that we did not complete over. In this case, there is one such column col_3
and we wish to fill in the missing values in any newly added rows with the integer 0.df_2
will have the same columns as df
but with rows for all possible combinations of the values of the columns col_1
and col_2
with the value of col_3
set to 0 for any rows that were added to the input data frame df
.Extension: Complete per Group
We wish to complete a grouped data frame by adding rows that hold any missing combinations of the values of a selected set of columns that exist in the group.
In this example, we wish to have the data frame df
include rows that cover all combinations of the values of columns col_2
and col_3
that exist in each group, where the data frame is grouped by the column col_1
.
df_2 = df %>%
group_by(col_1) %>%
complete(col_2, col_3)
Here is how this works:
complete()
after we run group_by()
on the input data frame so that complete()
acts on each group separately.complete(col_2, col_3)
after group_by(col_1)
will add rows for missing combinations of the values of col_2
and col_3
within each group.df
besides col_1
, col_2
and col_3
with have a value of NA
for all the newly added rows.df_2
will have the same columns as df
but will have rows for all possible combinations of the values of the columns col_2
and col_3
that exist in each group (grouped by col_1
). The values of other columns will be NA
in any newly created rows.