Missing Values

On this page, we look at how to inspect missing values in a data frame. In particular, we will cover the following common scenarios:

  • Summary of the state of missing values:

    • Any: Are there any missing values?
    • Count: How many missing values are there?
    • Proportion: What percent of values is missing?

    For each of these scenarios, we look at how to summarize the state of missing values over the entire data frame, over each column, and over a particular column.

  • Look at the data around missing values:

    • Get Rows: Return rows that have missing values.
    • Get Columns: Return columns that have missing values.

Note: Our coverage here will assume that missing values are encoded correctly as np.nan . Occasionally, that may not be the case and we would need to encode missing values as such before checking for them. We will cover encoding missing values in General Operations.

Dealing with missing values is an extensive topic. In General Operations, we will cover more aspects of missing value management including dropping missing values, imputing missing values, and missing value encoding.

Any

We wish to know if there are any missing values in the first place.

Data Frame

We wish to know if a data frame has any missing values.

df.isna().any().any()

Here is how this works:

  • Pandas data frames have an isna() method which returns a boolean data frame of the same size as the original data frame where values are True if the corresponding value in the original data frame is missing.
  • The any() method of Pandas data frames acts on columns and returns Series that has a single boolean value for each column that is True if any of the values in the respective column are True and False if none is True; i.e. True if any value in the corresponding column is missing.
  • The second any() acts on the Series of per column booleans to produce a single boolean for the entire data frame which is True if any of the columns has any missing values and False otherwise.

Each Column

We wish to know which columns of the data frame have missing values.

df.isna().any().sort_values(ascending=False)

Here is how this works:

  • As described above, df.isna().any() returns a Series that has one boolean value for each column of the data frame df that is True if the column has any missing values and False otherwise.
  • If there are many columns, it may be difficult to see which ones have missing values. A good idea is to sort in descending order so columns with missing values show up first. We do that by applying sort_values() to the boolean Series generated by df.isna().any() while setting the argument ascending=False so True, corresponding to columns with missing values, comes before False.

Particular Column

We wish to know if a particular column has any missing values.

df['col_1'].isna().any()

Here is how this works:

  • We select the column of interest (here col_1) via df['col_1'] which returns the column of interest as a Series.
  • We then apply the isna() method of Pandas Series to get a boolean Series that is True wherever the column has any missing values and False otherwise.
  • We then apply the any() method to the boolean Series produced by df['col_1'].isna() which returns a single boolean value that is True if the column col_1 has any missing values and False otherwise.

Count

We wish to know how many missing values there are.

Data Frame

We wish to obtain the number of missing values in a data frame.

df.isna().sum().sum()

Here is how this works:

  • The isna() method of Pandas data frames returns a boolean data frame of the same size as the original data frame where values are True if the corresponding value in the original data frame is missing.
  • The sum() method of Pandas data frames sums over columns returning a Series that has many elements as the input data frame has columns and where each element is the sum of the elements of one column.
  • We are summing logical True or False values where a True is taken as 1 and a False as 0.
  • The second sum() acts on the Series of per column sums to produce a single integer for the entire data frame which is the total number of missing values in the data frame.

Each Column

We wish to obtain the number of missing values per column for each column in a data frame.

df.isna().sum().sort_values(ascending=False)

Here is how this works:

  • df.isna().sum() works as described above.
  • It is usually helpful to see the columns with the most missing values first. To sort in descending order of number of missing values per column we applysort_values(ascending=False) to the output of df.isna().sum().

Particular Column

We wish to get the number of missing values in a particular column.

df['col_1'].isna().sum()

Here is how this works:

  • We select the column of interest (here col_1) via df['col_1'] which returns the column of interest as a Series.
  • We then apply the isna() method of Pandas Series to get a boolean Series that is True wherever the column has any missing values and False otherwise.
  • We then apply the sum() method to the boolean Series produced by df['col_1'].isna() which returns a single integer value that is the number of missing values in the column of interest.

Proportion

We wish to get the proportion (percentage) of values that are missing.

Data Frame

We wish to know the proportion of a data frame’s values that are missing.

df.isna().stack().mean()

Here is how this works:

  • The isna() method of Pandas data frames returns a boolean data frame of the same size as the original data frame where values are True if the corresponding value in the original data frame is missing.
  • The stack() method of Pandas data frames reshapes the data frame of booleans into a single column to which mean() can then be applied.
  • Generally, the mean of the means of subsets is not the same as the mean of the whole set. However, they are the same if the subsets are the same sample size which is the case here because even if there is missing data in the original data frame, the data frame of booleans resulting from applying isna() will not have missing data. So we could have safely gone with df.isna().mean().mean(). That said it is good practice in general to apply statistical summary functions once to the whole set (the population) if what is intended is a summary of the whole set (the population).

Each Column

We wish to obtain the proportion of missing values for each column in a data frame.

df.isna().mean().sort_values(ascending=False)

Here is how this works:

  • df.isna() works as described above.
  • The mean() method of Pandas data frames computes the mean value for each column and returns a Series that has as many elements as there are columns.
  • Taking the mean of a boolean vector is equivalent to taking the ratio of True values to the total number of values.
  • It is instructive to sort in descending order of proportion of missing values per column. We do that via sort_values(ascending=False).

Particular Column

We wish to know the proportion of a column’s values that are missing.

df['col_1'].isna().mean()

Here is how this works:

  • df['col_1'].isna() works as described above.
  • We then apply the mean() method to the boolean Series produced by df['col_1'].isna() which returns the proportion of values of the column col_1 that are missing.

Get Rows

Any

We wish to get rows where any column has a missing value.

df.loc[df.isna().any(axis=1)]

Here is how this works:

  • df.isna() works as described above.
  • We pass to any() the argument axis=1 so it may apply to the rows (instead of the default that is columns) of the boolean data frame resulting from df.na() and returns a Series of boolean values that has as many rows as the data frame df.
  • The boolean Series resulting from any() is passed as input to loc[] to filter the data frame df (for more details see Filtering).

Count

Get rows where more than n columns have missing values. In this example we extract any row where more than 3 columns have missing values.

df.loc[df.isna().sum(axis=1).gt(3)]

Here is how this works:

  • df.isna() works as described above.
  • We pass to sum() the argument axis=1 so it may apply to the rows (instead of the default that is columns) of the boolean data frame resulting from df.na() and returns a Series that has one value for each row in the data frame df that is the number of missing values in the corresponding row.
  • We then apply gt(3) to the Series of integer values returned by sum(axis=1) to get a boolean Series that is True where a row has more than 3 missing values and False otherwise.
  • The boolean Series resulting from gt(3) is passed as input to loc[] to filter the data frame df (for more details see Filtering).

Get Columns

Get columns where more than a given percent of values are missing. In this example, we get columns where more than 10% of the values are missing.

df.loc[:, df.isnull().mean().gt(0.1)]

Here is how this works:

  • df.isnull().mean() returns a Series that has as many elements as the columns of the data frame df and where each element is the proportion of missing values in the corresponding column.
  • We then apply a greater than gt(0.1) comparison which returns True if the proportion of missing values in the corresponding column is greater than 10%.
  • We pass the output from df.isnull().mean().gt(0.1) to the second argument of loc[] to select the columns where the value is True.
  • What we did here is called Selecting implicitly by data criteria which we cover in detail in Selecting by Data Criteria.
PYTHON
I/O