Identify Missing

We wish to identify which values of a data frame or a vector are missing.

This section is complemented by Inspecting Missing Values where we cover: Checking whether any missing values exist, counting missing values, and extracting rows with missing values.

Column

We wish to know which values of a column are missing.

In this example, we wish to create a new column col_1_is_na that has a value of TRUE where the corresponding value of the column col_1 is missing.

df_2 = df %>% 
  mutate(col_1_is_na = is.na(col_1))

Here is how this works:

  • The standard solution for identifying missing values is the is.na() function from base R.
  • While working within any of the dplyr verbs (filter(), mutate(), etc…), we can pass the name of the column whose elements we want to check for NA to is.na().
  • The output of is.na() will be a vector of the same length as the input column, which here is col_1, where an element is TRUE if the corresponding element of the input column is NA.

Data Frame

We wish to know which values of a data frame are missing.

df %>% is.na()

Here is how this works:

  • To identify which values of a data frame are missing, we pass the data frame to is.na().
  • The output of is.na() is a matrix of logical values of the same dimensions as the input data frame, which here is df, where an element is TRUE if the corresponding element of the input data frame is NA.
  • It sometimes makes for easier post-processing to convert output of is.na() from a matrix to a data frame which we can do via df %>% is.na() %>% as_tibble().

Incomplete Row

We wish to determine which rows of a data frame have a missing value.

df_2 = df %>%
  mutate(
    is_incomplete = !complete.cases(.)
  )

Here is how this works:

  • We use the function complete.cases() from base R to identify whether a row of a data frame has any missing values.
  • The output of complete.cases() is a vector of the same length as the number of rows and where a value is True if the corresponding row has no missing values.
  • In order to call complete.cases() on the piped data frame in a chain, we refer to it via the dot operator ..
  • See Drop Missing for a more detailed coverage of identifying rows with missing values.

Extension: Selected Columns

We wish to determine which rows of a data frame have a missing value for any of a selected set of columns.

df_2 = df %>%  
  mutate(
    is_incomplete = !complete.cases(pick(col_2, col_3))  
  )

Here is how this works:

  • We use the dplyr helper pick() to obtain a data frame that contains a subset of the columns of the data frame being piped in the chain, which in this case is a data frame containing the columns col_2 and col_3.
  • complete.cases() is then executed on the sub-data-frame to return True for rows where neither of the selected columns has a missing value.

Row

We wish to know which values of a row of a data frame are missing.

In this example, we wish to know which columns are missing in the first row of the data frame df.

is.na(df)[1,]

Here is how this works:

  • We apply is.na() to the data frame df and then use [1,] to extract the first row of the matrix.
  • The equivalent piped solution is df %>% is.na() %>%[(1,).
R
I/O