We wish to drop rows or columns that contain missing values.
We cover the following scenarios:
Note that this section assumes that missing values are encoded as NA
. See Missing Value Encoding.
We wish to drop rows with missing values from a data frame.
Any Columns
We wish to drop rows where any column has a missing value.
df_2 = df %>% drop_na()
Here is how this works:
drop_na()
from tidyr
to drop rows where any column has a missing value.df_2
will have the same columns as the input data frame df
but with any row that has one or more missing values dropped.Selected Columns
We wish to drop rows where any column from a selected set of columns has a missing value.
In this example, we wish to drop any rows from the data frame df
where either the column col_1
or the column col_2
has a missing value.
df_2 = df %>%
drop_na(col_1, col_2)
Here is how this works:
drop_na()
the names of the set of columns that we wish to limit the check for missing values to.df_2
will have the same columns as the input data frame df
but with any row, where one or more of the specified columns has a missing value, dropped.Extension: Each Selected Column
We wish to drop rows if and only if all the values of the specified columns are missing.
df_2 = df %>%
filter(!if_all(c(col_1, col_2), is.na))
Here is how this works:
if_all()
to apply is.na()
to each of the columns of interest, which here are col_1
and col_2
, and to AND the results. We then pass the compliment !
of the output to filter()
to drop rows where all the specified columns are missing. See Implicit Filtering.df_2
will have the same columns as the input data frame df
but with any row where all the specified columns, which here are col_1
and col_2
, are missing is dropped.Empty Rows
We wish to drop rows where all columns have missing value (a value of NA
).
df_2 = df %>%
filter(!if_all(everything(), is.na))
Here is how this works:
if_all()
to apply is.na()
to each of the columns of interest, which here are all columns hence everything()
, and to AND the results. We then pass the compliment !
of the output to filter()
to drop rows where all the specified columns are missing. See Implicit Filtering.df_2
will have the same columns as the input data frame df
but with any row where all columns have missing values NA
is dropped.Alternative: via the Janitor Package
library(janitor)
df_2 = df %>%
remove_empty(which = 'rows')
Here is how this works:
remove_empty()
function, provided by the janitor
package, to drop empty rows.remove_empty()
drops both empty rows and columns. To specify that only empty rows be dropped, we set the argument which
to which = “rows”
.df_2
will have the same columns as the input data frame df
but with any row where all columns have missing values NA
is dropped.Absolute Threshold
We wish to drop rows that have more than a specified number of columns with missing values.
In this example, we wish to retain rows where the values of less than three columns are missing NA
.
df_2 = df %>%
filter(rowSums(is.na(.)) < 3)
Here is how this works:
is.na(.)
, we produce a matrix of the same size as the data frame that is piped into filter()
and referred to via .
, where an element is TRUE
if the corresponding element of the data frame is missing and FALSE
otherwise.rowSums()
from base R to compute the sum of each row of the logical matrix resulting from is.na()
to produce a vector that has as many elements as the rows of the data frame and where each element holds the number of NA
values in the corresponding row as an integer. See Working with Logical Data.< 3
, inside filter()
to retain only rows where the values of less than three columns are missing NA
.df_2
will have the same columns as the input data frame df
but with only rows where the values of less than three columns are missing NA
are retained.Relative Threshold
We wish to drop rows that have more than a specified percent of columns with missing values.
In this example, we wish to retain rows where the values of less than half the columns are missing NA
.
df_2 = df %>%
filter(rowMeans(is.na(.)) < 0.5)
Here is how this works:
This works similarly to the “Absolute Threshold” scenario above, except that we use rowMeans()
to compute the proportion of columns that have missing values for each row. See Working with Logical Data.
We wish to drop rows with missing values from a data frame.
Any Row
We wish to drop columns where any row has a missing value.
df_2 = df %>%
select(where(~!any(is.na(.))))
Here is how this works:
where()
, a helper of select()
, to compute a predicate function, i.e. one that returns TRUE
or FALSE
, for each column of the data frame df
. See Implicit Selection.~!any(is.na(.))
is applied to each column and returns TRUE
if any of the column's values is NA
.df_2
has the same number of rows as the input data frame df
but where any column, that has one or more missing values NA
, is removed.Empty Columns
We wish to drop columns that are entirely NA
; i.e. we wish to drop columns that take a value of NA
for all rows.
df_2 = df %>%
select(where(~!all(is.na(.))))
Here is how this works:
This works similarly to the “Any Row” scenario above except that we use all()
to select columns where all values are missing (i.e.NA
).
Alternative: via the Janitor Package
library(janitor)
df_2 = df %>%
remove_empty(which = 'cols')
Here is how this works:
remove_empty()
function, provided by the janitor
package, to drop empty columns.remove_empty()
drops both empty rows and columns. To specify that only empty columns be dropped, we set the argument which
to which = “cols”
.df_2
will have the same columns as the input data frame df
but with any column where all values are missing (i.e. NA
) is dropped.Absolute Threshold
We wish to drop columns that have more than a specified number of missing values.
In this example, we wish to retain columns where there is less than three missing values.
df_2 = df %>%
select(where(~sum(is.na(.)) < 3))
Here is how this works:
~sum(is.na(.)) < 3
and is applied to each column individually. It returns TRUE
for columns that have less than 3 NA
values.TRUE
values. See Working with Logical Data.df_2
has the same number of rows as the input data frame df
but where any column, that has three or more missing values (i.e. NA
), is removed.Relative Threshold
We wish to drop columns with more than a specified percentage of missing values.
In this example, we wish to retain columns where less than half the values are missing.
df_2 = df %>%
select(where(~mean(is.na(.x)) < 0.5))
Here is how this works:
This works similarly to the “Absolute Threshold” scenario above, except that we use mean()
to compute the proportion of missing values for each column. See Working with Logical Data.