On this page, we look at how to inspect missing values in a data frame. In particular, we will cover the following common scenarios:
Summary of the state of missing values:
For each of these scenarios, we look at how to summarize the state of missing values over the entire data frame, over a particular column, and over each column.
Look at the data around missing values:
Note: Our coverage here will assume that missing values are encoded correctly as NA
. Occasionally, that may not be the case and we would need to encode missing values as such before checking for them. We will cover encoding missing values in General Operations.
Dealing with missing values is an extensive topic. In General Operations, we will cover more aspects of missing value management including dropping missing values, imputing missing values, and missing value encoding.
We wish to know if there are any missing values in the first place.
Data Frame
We wish to know if a data frame has any missing values.
df %>% is.na() %>% any()
Here is how this works:
df
to the function is.na()
from base R.is.na()
returns a boolean matrix of the same size (dimensions) as the given data frame (here df
) where it takes a value of TRUE
if the corresponding value is missing and FALSE
otherwise.is.na()
to any()
which returns TRUE
if one or more values of the input are TRUE
(i.e. there are any missing values) and FALSE
if none of the values of the input are TRUE (i.e. no missing values).Particular Column
We wish to know if a particular column has any missing values.
df %>% pull(col_1) %>% is.na() %>% any()
Here is how this works:
pull()
to extract the particular column we are interested in inspecting.Each Column
We wish to know which columns of the data frame have missing values.
df %>% map_dfc(is.na) %>% map_lgl(any)
Here is how this works:
map_dfc()
from purrr
(tidyverse
package) to apply is.na()
to each column of the data frame df
. The output is a data frame of boolean values of the same size as the input data frame (here df
).map_lgl()
to apply any()
to each column of the boolean data frame returned by map_dfc(is.na)
. The output is a boolean vector with one boolean value for each column of the data frame df
that is TRUE
if the column has any missing values and FALSE
otherwise.We wish to know how many missing values there are.
Data Frame
We wish to obtain the number of missing values in a data frame.
df %>% is.na() %>% sum()
Here is how this works:
sum()
instead of any()
.is.na()
to sum()
to obtain the number of missing values in the data frame df
.TRUE
values (which in this case is the number of missing values) because sum()
regards TRUE
as 1
and FALSE
as 0
.Particular Column
We wish to get the number of missing values in a particular column.
df %>% pull(col_1) %>% is.na() %>% sum()
Here is how this works:
pull()
to extract the particular column we are interested in inspecting.Each Column
We wish to obtain the number of missing values per column for each column in a data frame.
df %>% map_dfc(is.na) %>% map_dbl(sum) %>% sort(decreasing = TRUE)
Here is how this works:
any()
, we use sum()
to count the number of missing values per column.sort(decreasing = TRUE)
.We wish to get the proportion (percentage) of values that are missing.
Data Frame
We wish to know the proportion of a data frame’s values that are missing.
df %>% is.na() %>% mean()
Here is how this works:
mean()
instead of any()
.is.na()
to mean()
to obtain the proportion of missing values in the data frame df
.mean()
regards TRUE
as 1
and FALSE
as 0
. Therefore, taking the mean of a boolean vector is basically taking the ratio between the number of TRUE
values (which in this case is the number of missing values) and the total number of values in the data frame df
.Particular Column
We wish to know the proportion of a column’s values that are missing.
df %>% pull(col_1) %>% is.na() %>% mean()
Here is how this works:
pull()
to extract the particular column we are interested in inspecting.Each Column
We wish to obtain the proportion of missing values for each column in a data frame.
df %>% map_dfc([is.na](http://is.na/)) %>% map_dbl(mean) %>% sort(decreasing = TRUE)
Here is how this works:
sum()
, we use mean()
to count the number of missing values per column.sort(decreasing = TRUE)
to sort the output in descending order.Any
We wish to get rows where any column has a missing value.
df %>% filter(if_any(everything(), is.na))
Here is how this works:
df
to filter()
to select rows meeting a certain condition.if_any()
applies a given predicate function (a function that returns TRUE or FALSE in this case is.na()
) to a selection of columns and combines the results into a single boolean vector with as many elements as there are rows in the input data frame.everything()
to select all columns i.e. to include all columns when applying is.na()
to check for missing values.Count
Get rows where more than n
columns have missing values. In this example we extract any row where more than 3
columns have missing values.
df %>% filter(pmap_int(., ~sum(is.na(c(...))))>3)
Here is how this works:
pmap_int()
to iterate over each row of the data frame and count the number of missing values. The output of pmap_int()
is a vector of integer with as many elements as there are rows in the data frame df
.pmap_int()
with the threshold of interest which in this case is 3
. The output is a boolean vector which is passed to filter()
to get the rows where the number of missing values is more than 3.pmap_int()
is part of the pmap
family of functions from the package purrr
which map a function over the values of n
lists (here one list corresponds to each column of the data frame).~sum([is.na](http://is.na/)(c(...)))
which counts the number of missing values (NAs
) in each row of the data frame.Get columns where more than a given percent of values are missing. In this example, we get columns where more than 10%
of the values are missing.
df %>% select(where(~ mean(is.na(.x)) > 0.1))
Here is how this works:
df
to select()
to select columns meeting a certain condition.~
to iterate over all columns and check if more than 10%
of the values of the column are missing via ~ mean(is.na(.x)) > 0.1
. Here .x
denotes the current column being evaluated.where()
returns the column names of those for which ~ mean(is.na(.x)) > 0.1
returns TRUE
which is used by select()
to return the columns as a data frame.