Fill Missing

We wish to fill in missing values with suitable values. This is often referred to as imputation.

In this section, we cover the most common approaches to filling in missing values, which are:

  • Replace: Fill in missing values with a static or summary value, such as the mean of the column.
  • Pad: Fill in missing values with the previous or the next value.
  • Coalesce: Fill in the missing values in a column or a vector with the corresponding values from another column or vector of the same size.
  • Interpolate: Fill in the missing value by interpolating between the surrounding values. There are many interpolation methods such as linear interpolation and spline interpolation.
  • Complete: Complete a data frame with missing combinations of data.

Replace

We wish to fill in missing values in a column of a data frame with a static or a summary value.

In this example, we wish to replace any NA values in the column col_2 of the data frame df with 0.

df_2 = df %>% 
  mutate(
    col_2 = replace_na(col_2, 0)
  )

Here is how this works:

  • We use the function replace_na() from tidyr to replace NAs with a specified value.
  • In this example, the function replace_na() takes as input:
    • The column whose missing values we wish to replace, which in this case is col_2 and is referred to directly since we are operating within mutate().
    • The value that we want to use to fill in the missing values, which here is 0.
  • See “Extension: Entire Data Frame” below for how to apply this to all columns at once.

Extension: Replace with Summary

df_2 = df %>% 
  mutate(
    col_2 = replace_na(col_2, mean(col_2, na.rm=TRUE))
  )

Here is how this works:

  • We use replace_na() inside mutate() to replace missing values as described above. However, instead of passing a constant value, we compute and pass the mean of the values of the column col_2 via mean(col_2, na.rm=TRUE).
  • Note that we must set na.rm=TRUE since we know that the column has missing values (which we are trying to fill in) and mean() would just return NA unless we set na.rm=TRUE.
  • See Implicit Transformation for how to apply this to many columns without spelling each transformation out explicitly.

Extension: Replace with Group Summary

df_2 = df %>% 
  group_by(col_1) %>%
  mutate(
    col_2 = replace_na(col_2, mean(col_2, na.rm=TRUE))
  )

Here is how this works:

  • This works similarly to the primary solution above except that we call replace_na() (inside mutate()) after we run group_by() on the input data frame so that replace_na() acts on each group separately.
  • In particular, calling mean(col_2, na.rm=TRUE) will compute the mean value of col_2 for each group. Then replace_na() will use that group-wise mean value to fill in missing values in the respective group.
  • See Implicit Grouped Transformation for how to apply this to many columns.

Extension: Entire Data Frame

df_2 = df %>%
  mutate(across(everything(), ~replace_na(., 0)))

Here is how this works:

  • We use across() inside mutate() to apply the transformation ~replace_na(., 0) to multiple columns. We select all columns via everything(). See Implicit Transformation.
  • This assumes that the column data types are the same as the replacement value. If that is not the case, see “Extension: Multiple Columns with Different Replacements” below.

Extension: Multiple Columns with Different Replacements

We wish to replace all missing elements in columns col_1, col_2 and col_3, with ‘z’, 0, and -1 respectively.

replacements = list(col_1 = 'z', col_2 = 0, col_3 = -1)

df_2 = df %>%
  replace_na(replacements)

Here is how this works:

In order to replace missing values in different columns with different values, we can pass to replace_na() a list of named values where the name is the column name and the value is what we wish to use to replace missing values in the named column.

Pad

We wish to fill in missing values with the previous non-missing value or the next non-missing value.

In this example, we wish to fill in missing values in the column col_2 with the first prior non-missing value assuming the data frame is ordered by the column col_1.

df_2 = df %>%   
  arrange(col_1) %>%   
  fill(col_2)

Here is how this works:

  • We use the function fill() from tidyr to fill in missing values in one or more specified column(s), which here is col_2, with the first previous non-missing value.
  • Note that the function fill() has an argument .direction (note the dot . in the name) that specifies whether:
    • a missing value is filled with the first previous non-missing value .direction=”down” (the default and what we use in this example)
    • or is filled with the first following non-missing value .direction=”up”.
  • It is often important to ensure that the data frame is ordered in an appropriate way before we fill missing values with previous or next values. In this example, we do that by first arranging in ascending order of col_1 in arrange(col_1) before we run fill(col_2).
  • The output data frame df_2 will have the same columns and the same rows as the data frame df but where any missing value in the column col_2 is filled with the first previous non-missing value.

Extension: Grouped Padding

In this example, we wish to fill in missing values in the column col_3 with the first prior non-missing value in its group where the groups are defined by the value of the column col_1. We are assuming that the data frame is ordered by the column col_2.

df_2 = df %>%
  group_by(col_1) %>%
  arrange(col_2) %>%
  fill(col_3, .direction = "downup")

Here is how this works:

  • This works similarly to the primary solution above except that we call fill() after we run group_by() on the input data frame so that fill() acts on each group separately.
  • In particular, calling fill(col_3) after group_by(col_1) will fill in missing values in the column col_3 with other values from within the group.
  • Setting the argument .direction of fill() to .direction = "downup" specifies that NAs are filled with the first previous non-missing values then NAs at the beginning of each group will be filled with the first following non-NA value. We can switch this behavior by setting .direction = "updown".
  • It is often important to ensure that the data frame is ordered in an appropriate way before we fill in missing values with previous or next values. In this example, we do that by first arranging in ascending order of col_2 in arrange(col_2) before we run fill(col_3).

Coalesce

We wish to fill in the missing values in a column or a vector with the corresponding values from another column or vector of the same size; a process often referred to as coalescing.

In this example, we wish to fill in any missing values in the column col_2 with the corresponding values of the column col_3.

df_2 = df %>%
  mutate(col_2 = coalesce(col_2, col_3))

Here is how this works:

  • We use the function coalesce() from dplyr to fill in any missing values in the column col_2 with the corresponding values of the column col_3.
  • Given a set of vectors, coalesce() finds the first non-missing value at each position.
  • The output data frame df_2 will have the same rows and columns of the input data frame df but with any missing values in the column col_2 with the corresponding values of the column col_3.

Extension: Coalesce with Different Data Frames

We wish to have missing values in one data frame be filled with corresponding values from another data frame.

df = df_1 %>%  
  left_join(df_2, by = "col_1", suffix = c('_x', '_y')) %>%  
  mutate(    
    col_2 = coalesce(col_2_x, col_2_y),    
    col_3 = coalesce(col_3_x, col_3_y)) %>%  
  select(-matches("_[xy]"))

Here is how this works:

  • We first left join the two data frames df_1 and df_2 by appropriate join columns, which here is col_1. We also specify how columns with the same name should be renamed by setting the suffix argument. See Joining.
  • When coalescing two different data structures, we need to ensure that both vectors are aligned. The left join operation ensures that that is the case.
  • We then use coalesce() inside mutate() as described in the primary solution above.
  • Finally, we drop the original columns, so we are left with a data frame of the same rows and columns as the original df_1 data frame but with columns any missing values in columns col_2 and col_3 filled with the corresponding values form the second data frame df_2.
  • See Dynamic Transformation for how to extend this to apply to all pairs of columns (_x and _y) without explicitly naming the pair of columns.

Interpolate

We wish to fill in missing values by linearly interpolating between the surrounding values.

In this example, we wish to fill in missing values in the column col_2 by linearly interpolating between the surrounding non-missing values assuming the data frame is sorted by the column col_1.

library(imputeTS)

df_2 = df %>%  
  arrange(col_1) %>%
  mutate(col_2 = na_interpolation(col_2))

Here is how this works:

  • We use the function na_interpolation() from the package imputeTS to fill in missing values with a linear interpolation between surrounding non-missing values.
  • Missing values at the beginning or the end of the column (or vector) are filled with the first and last non-missing values respectively.
  • To fill in the missing values in a column via interpolation, we should ensure that three assumptions on the column being interpolated are met:
    1. It is numeric
    2. The values are equally spaced
    3. It is sorted in an appropriate manner
  • While there are interpolation functions in base R, as well as other packages such as zoo, forecast and mice, we recommend na_interpolation() from imputeTS because it is the most tidyverse-like; i.e. it fits well in dplyr chains and requires no overhead (some of the options require special syntax).
  • The output data frame df_2 will have the same rows and columns as the input data frame df but with missing values of the column col_2 filled by linearly interpolating between the surrounding non-missing values.

Extension: Grouped Interpolation

We wish to fill in missing values by linearly interpolating between the surrounding values. We wish to have the interpolation carried out inside each group separately.

library(imputeTS)

df_2 = df %>%  
  group_by(col_1) %>%
  arrange(col_2) %>%
  mutate(col_3 = na_interpolation(col_3))

Here is how this works:

  • We call group_by() prior to mutate() so that interpolation is carried out on each group separately.
  • The first and last non-missing values are repeated to fill any missing values at the beginning or end of the group respectively.

Extension: Non-Linear Interpolation

We wish to fill in missing values via a non-linear interpolation between the surrounding values.

library(imputeTS)

df_2 = df %>%  
  mutate(col_2 = na_interpolation(col_2, option = "spline"))

Here is how this works:

  • This works similarly to the primary solution above except that we pass to the option argument of na_interpolation() to option = "spline" to perform a non-linear spline interpolation.
  • See the documentation of the imputeTS package here for a description of other non-linear interpolation options for filling in missing values.

Complete

We wish to complete a data frame by adding rows that hold any missing combinations of the values of a selected set of columns.

In this example, we wish to have the data frame df include rows that cover all combinations of the values of columns col_1 and col_2. We wish to fill in the missing values of the column col_3 is any newly created rows with 0.

df_2 = df %>% 
    complete(col_1, col_2, fill = list(col_3 = 0))

Here is how this works:

  • In this example, we use complete() to add rows to the data frame df so that all combinations of the values of the specified columns, which here are col_1 and col_2, are present.
  • The function complete() takes as input:
    • The data frame that we wish to complete, which here is df and is passed via the pipe %>%
    • The names of the columns whose combinations we wish to cover, which here are col_1 and col_2
  • In addition, we pass to complete() the values to use to fill in the missing values of any columns that we did not complete over. In this case, there is one such column col_3 and we wish to fill in the missing values in any newly added rows with the integer 0.
  • The output data frame df_2 will have the same columns as df but with rows for all possible combinations of the values of the columns col_1 and col_2 with the value of col_3 set to 0 for any rows that were added to the input data frame df.

Extension: Complete per Group

We wish to complete a grouped data frame by adding rows that hold any missing combinations of the values of a selected set of columns that exist in the group.

In this example, we wish to have the data frame df include rows that cover all combinations of the values of columns col_2 and col_3 that exist in each group, where the data frame is grouped by the column col_1.

df_2 = df %>%  
  group_by(col_1) %>%  
  complete(col_2, col_3)

Here is how this works:

  • This works similarly to the primary solution above except that we call complete() after we run group_by() on the input data frame so that complete() acts on each group separately.
  • In particular, calling complete(col_2, col_3) after group_by(col_1) will add rows for missing combinations of the values of col_2 and col_3 within each group.
  • The value of any other columns in the input data frame df besides col_1, col_2 and col_3 with have a value of NA for all the newly added rows.
  • The output data frame df_2 will have the same columns as df but will have rows for all possible combinations of the values of the columns col_2 and col_3 that exist in each group (grouped by col_1). The values of other columns will be NA in any newly created rows.
R
I/O