Drop Missing

We wish to drop rows or columns that contain missing values.

We cover the following scenarios:

  • Rows: Drop rows with missing values from a data frame.
    • Any Column: Drop rows where any column has a missing value.
    • Selected Columns: Drop rows where any or all of a selected set of columns has a missing value.
    • Empty Rows: Drop rows where all columns have missing values.
    • Absolute Threshold: Drop rows that have more than a specified number of columns with missing values.
    • Relative Threshold: Drop rows that have more than a specified percent of columns with missing values.
  • Columns: Drop columns with missing values from a data frame.
    • Any Row: Drop columns where any row has a missing value.
    • Empty Columns: Drop columns where all values are missing.
    • Absolute Threshold: Drop columns that have more than a specified number of missing values.
    • Relative Threshold: Drop columns that have more than a specified percent of missing values.

Note that this section assumes that missing values are encoded as NA. See Missing Value Encoding.

Rows

We wish to drop rows with missing values from a data frame.

Any Columns

We wish to drop rows where any column has a missing value.

df_2 = df %>% drop_na()

Here is how this works:

  • We use the function drop_na() from tidyr to drop rows where any column has a missing value.
  • The output data frame df_2 will have the same columns as the input data frame df but with any row that has one or more missing values dropped.

Selected Columns

We wish to drop rows where any column from a selected set of columns has a missing value.

In this example, we wish to drop any rows from the data frame df where either the column col_1 or the column col_2 has a missing value.

df_2 = df %>% 
    drop_na(col_1, col_2) 

Here is how this works:

  • We pass to drop_na() the names of the set of columns that we wish to limit the check for missing values to.
  • The output data frame df_2 will have the same columns as the input data frame df but with any row, where one or more of the specified columns has a missing value, dropped.

Extension: Each Selected Column

We wish to drop rows if and only if all the values of the specified columns are missing.

df_2 = df %>%   
    filter(!if_all(c(col_1, col_2), is.na))

Here is how this works:

  • We use if_all() to apply is.na() to each of the columns of interest, which here are col_1 and col_2, and to AND the results. We then pass the compliment ! of the output to filter() to drop rows where all the specified columns are missing. See Implicit Filtering.
  • The output data frame df_2 will have the same columns as the input data frame df but with any row where all the specified columns, which here are col_1 and col_2, are missing is dropped.

Empty Rows

We wish to drop rows where all columns have missing value (a value of NA).

df_2 = df %>%
   filter(!if_all(everything(), is.na))

Here is how this works:

  • We use if_all() to apply is.na() to each of the columns of interest, which here are all columns hence everything(), and to AND the results. We then pass the compliment ! of the output to filter() to drop rows where all the specified columns are missing. See Implicit Filtering.
  • The output data frame df_2 will have the same columns as the input data frame df but with any row where all columns have missing values NA is dropped.

Alternative: via the Janitor Package

library(janitor)
df_2 = df %>%
    remove_empty(which = 'rows')

Here is how this works:

  • We can use the convenient remove_empty() function, provided by the janitor package, to drop empty rows.
  • By default remove_empty() drops both empty rows and columns. To specify that only empty rows be dropped, we set the argument which to which = “rows”.
  • The output data frame df_2 will have the same columns as the input data frame df but with any row where all columns have missing values NA is dropped.

Absolute Threshold

We wish to drop rows that have more than a specified number of columns with missing values.

In this example, we wish to retain rows where the values of less than three columns are missing NA.

df_2 = df %>%
   filter(rowSums(is.na(.)) < 3)

Here is how this works:

  • In is.na(.), we produce a matrix of the same size as the data frame that is piped into filter() and referred to via ., where an element is TRUE if the corresponding element of the data frame is missing and FALSE otherwise.
  • We then use rowSums() from base R to compute the sum of each row of the logical matrix resulting from is.na() to produce a vector that has as many elements as the rows of the data frame and where each element holds the number of NA values in the corresponding row as an integer. See Working with Logical Data.
  • We then compare with the threshold, which here is < 3, inside filter() to retain only rows where the values of less than three columns are missing NA.
  • The output data frame df_2 will have the same columns as the input data frame df but with only rows where the values of less than three columns are missing NA are retained.

Relative Threshold

We wish to drop rows that have more than a specified percent of columns with missing values.

In this example, we wish to retain rows where the values of less than half the columns are missing NA.

df_2 = df %>%
   filter(rowMeans(is.na(.)) < 0.5)

Here is how this works:

This works similarly to the “Absolute Threshold” scenario above, except that we use rowMeans() to compute the proportion of columns that have missing values for each row. See Working with Logical Data.

Columns

We wish to drop rows with missing values from a data frame.

Any Row

We wish to drop columns where any row has a missing value.

df_2 = df %>%
    select(where(~!any(is.na(.))))

Here is how this works:

  • We use where(), a helper of select(), to compute a predicate function, i.e. one that returns TRUE or FALSE, for each column of the data frame df. See Implicit Selection.
  • The anonymous function is ~!any(is.na(.)) is applied to each column and returns TRUE if any of the column's values is NA.
  • The output data frame df_2 has the same number of rows as the input data frame df but where any column, that has one or more missing values NA, is removed.

Empty Columns

We wish to drop columns that are entirely NA; i.e. we wish to drop columns that take a value of NA for all rows.

df_2 = df %>%
    select(where(~!all(is.na(.))))

Here is how this works:

This works similarly to the “Any Row” scenario above except that we use all() to select columns where all values are missing (i.e.NA).

Alternative: via the Janitor Package

library(janitor)
df_2 = df %>%
    remove_empty(which = 'cols')

Here is how this works:

  • We can use the convenient remove_empty() function, provided by the janitor package, to drop empty columns.
  • By default remove_empty() drops both empty rows and columns. To specify that only empty columns be dropped, we set the argument which to which = “cols”.
  • The output data frame df_2 will have the same columns as the input data frame df but with any column where all values are missing (i.e. NA) is dropped.

Absolute Threshold

We wish to drop columns that have more than a specified number of missing values.

In this example, we wish to retain columns where there is less than three missing values.

df_2 = df %>%   
    select(where(~sum(is.na(.)) < 3))

Here is how this works:

  • This works similarly to the “Any Row” scenario above except that we use a different anonymous predicate function.
  • The anonymous predicate function is ~sum(is.na(.)) < 3 and is applied to each column individually. It returns TRUE for columns that have less than 3 NA values.
  • The sum of a vector of logical values is the number of TRUE values. See Working with Logical Data.
  • The output data frame df_2 has the same number of rows as the input data frame df but where any column, that has three or more missing values (i.e. NA), is removed.

Relative Threshold

We wish to drop columns with more than a specified percentage of missing values.

In this example, we wish to retain columns where less than half the values are missing.

df_2 = df %>% 
    select(where(~mean(is.na(.x)) < 0.5))

Here is how this works:

This works similarly to the “Absolute Threshold” scenario above, except that we use mean() to compute the proportion of missing values for each column. See Working with Logical Data.

R
I/O