Missing Values

On this page, we look at how to inspect missing values in a data frame. In particular, we will cover the following common scenarios:

  • Summary of the state of missing values:

    • Any: Are there any missing values?
    • Count: How many missing values are there?
    • Proportion: What percent of values is missing?

    For each of these scenarios, we look at how to summarize the state of missing values over the entire data frame, over a particular column, and over each column.

  • Look at the data around missing values:

    • Get Rows: Return rows that have missing values.
    • Get Columns: Return columns that have missing values.

Note: Our coverage here will assume that missing values are encoded correctly as NA. Occasionally, that may not be the case and we would need to encode missing values as such before checking for them. We will cover encoding missing values in General Operations.

Dealing with missing values is an extensive topic. In General Operations, we will cover more aspects of missing value management including dropping missing values, imputing missing values, and missing value encoding.

Any

We wish to know if there are any missing values in the first place.

Data Frame

We wish to know if a data frame has any missing values.

df %>% is.na() %>% any()

Here is how this works:

  • We pass a data frame df to the function is.na() from base R.
  • is.na() returns a boolean matrix of the same size (dimensions) as the given data frame (here df) where it takes a value of TRUE if the corresponding value is missing and FALSE otherwise.
  • We pass the boolean matrix returned by is.na() to any() which returns TRUE if one or more values of the input are TRUE (i.e. there are any missing values) and FALSE if none of the values of the input are TRUE (i.e. no missing values).

Particular Column

We wish to know if a particular column has any missing values.

df %>% pull(col_1) %>% is.na() %>% any()

Here is how this works:

  • We use pull() to extract the particular column we are interested in inspecting.
  • The rest of the expression works as described above.

Each Column

We wish to know which columns of the data frame have missing values.

df %>% map_dfc(is.na) %>% map_lgl(any)

Here is how this works:

  • We use map_dfc() from purrr (tidyverse package) to apply is.na() to each column of the data frame df. The output is a data frame of boolean values of the same size as the input data frame (here df).
  • We then use map_lgl() to apply any() to each column of the boolean data frame returned by map_dfc(is.na). The output is a boolean vector with one boolean value for each column of the data frame df that is TRUE if the column has any missing values and FALSE otherwise.

Count

We wish to know how many missing values there are.

Data Frame

We wish to obtain the number of missing values in a data frame.

df %>% is.na() %>% sum()

Here is how this works:

  • This works similarly to Any described above except that we use sum() instead of any().
  • We pass the boolean matrix returned by is.na() to sum() to obtain the number of missing values in the data frame df.
  • Summing a boolean vector is basically counting the number of TRUE values (which in this case is the number of missing values) because sum() regards TRUE as 1 and FALSE as 0.

Particular Column

We wish to get the number of missing values in a particular column.

df %>% pull(col_1) %>% is.na() %>% sum()

Here is how this works:

  • We use pull() to extract the particular column we are interested in inspecting.
  • The rest of the expression works as described above.

Each Column

We wish to obtain the number of missing values per column for each column in a data frame.

df %>% map_dfc(is.na) %>% map_dbl(sum) %>% sort(decreasing = TRUE)

Here is how this works:

  • This works similarly to the “Each Column” scenario under “Any” above except that instead of any(), we use sum() to count the number of missing values per column.
  • It is often helpful to sort the number of missing values for each column in descending order. We do that via sort(decreasing = TRUE).

Proportion

We wish to get the proportion (percentage) of values that are missing.

Data Frame

We wish to know the proportion of a data frame’s values that are missing.

df %>% is.na() %>% mean()

Here is how this works:

  • This works similarly to Any described above except that we use mean() instead of any().
  • We pass the boolean matrix returned by is.na() to mean() to obtain the proportion of missing values in the data frame df.
  • mean() regards TRUE as 1 and FALSE as 0. Therefore, taking the mean of a boolean vector is basically taking the ratio between the number of TRUE values (which in this case is the number of missing values) and the total number of values in the data frame df.

Particular Column

We wish to know the proportion of a column’s values that are missing.

df %>% pull(col_1) %>% is.na() %>% mean()

Here is how this works:

  • We use pull() to extract the particular column we are interested in inspecting.
  • The rest of the expression works as described above.

Each Column

We wish to obtain the proportion of missing values for each column in a data frame.

df %>% map_dfc([is.na](http://is.na/)) %>% map_dbl(mean) %>% sort(decreasing = TRUE)

Here is how this works:

  • This works similarly to the “Each Column” scenario under “Count” above except that instead of sum(), we use mean() to count the number of missing values per column.
  • We use sort(decreasing = TRUE) to sort the output in descending order.

Get Rows

Any

We wish to get rows where any column has a missing value.

df %>% filter(if_any(everything(), is.na))

Here is how this works:

  • We pass the data frame df to filter() to select rows meeting a certain condition.
  • if_any() applies a given predicate function (a function that returns TRUE or FALSE in this case is.na()) to a selection of columns and combines the results into a single boolean vector with as many elements as there are rows in the input data frame.
  • We use everything() to select all columns i.e. to include all columns when applying is.na() to check for missing values.
  • What we did here is called Implicit Filtering which we cover in detail in Implicit Filtering.
  • Here we inspected for missing values in all columns, see Selecting for ways to select only a subset of columns.

Count

Get rows where more than n columns have missing values. In this example we extract any row where more than 3 columns have missing values.

df %>% filter(pmap_int(., ~sum(is.na(c(...))))>3)

Here is how this works:

  • We use pmap_int() to iterate over each row of the data frame and count the number of missing values. The output of pmap_int() is a vector of integer with as many elements as there are rows in the data frame df.
  • We compare the values in the integer vector returned by pmap_int() with the threshold of interest which in this case is 3. The output is a boolean vector which is passed to filter() to get the rows where the number of missing values is more than 3.
  • pmap_int() is part of the pmap family of functions from the package purrr which map a function over the values of n lists (here one list corresponds to each column of the data frame).
  • In this case the function is ~sum([is.na](http://is.na/)(c(...))) which counts the number of missing values (NAs) in each row of the data frame.
  • What we did here is called non vectorized filtering with implicit column selection which cover here . Please see that section for a deeper coverage of how this expression works.

Get Columns

Get columns where more than a given percent of values are missing. In this example, we get columns where more than 10% of the values are missing.

df %>% select(where(~ mean(is.na(.x)) > 0.1))

Here is how this works:

  • We pass the data frame df to select() to select columns meeting a certain condition.
  • We use an anonymous function declared via ~ to iterate over all columns and check if more than 10% of the values of the column are missing via ~ mean(is.na(.x)) > 0.1. Here .x denotes the current column being evaluated.
  • where() returns the column names of those for which ~ mean(is.na(.x)) > 0.1 returns TRUE which is used by select() to return the columns as a data frame.
  • What we did here is called Selecting implicitly by data criteria which we cover in detail in Selecting by Data Criteria.
R
I/O