On this page, we look at how to inspect missing values in a data frame. In particular, we will cover the following common scenarios:
Summary of the state of missing values:
For each of these scenarios, we look at how to summarize the state of missing values over the entire data frame, over each column, and over a particular column.
Look at the data around missing values:
Note: Our coverage here will assume that missing values are encoded correctly as np.nan
. Occasionally, that may not be the case and we would need to encode missing values as such before checking for them. We will cover encoding missing values in General Operations.
Dealing with missing values is an extensive topic. In General Operations, we will cover more aspects of missing value management including dropping missing values, imputing missing values, and missing value encoding.
We wish to know if there are any missing values in the first place.
Data Frame
We wish to know if a data frame has any missing values.
df.isna().any().any()
Here is how this works:
isna()
method which returns a boolean data frame of the same size as the original data frame where values are True
if the corresponding value in the original data frame is missing.any()
method of Pandas data frames acts on columns and returns Series
that has a single boolean value for each column that is True
if any of the values in the respective column are True
and False
if none is True
; i.e. True
if any value in the corresponding column is missing.any()
acts on the Series
of per column booleans to produce a single boolean for the entire data frame which is True
if any of the columns has any missing values and False
otherwise.Each Column
We wish to know which columns of the data frame have missing values.
df.isna().any().sort_values(ascending=False)
Here is how this works:
df.isna().any()
returns a Series
that has one boolean value for each column of the data frame df
that is True
if the column has any missing values and False
otherwise.sort_values()
to the boolean Series generated by df.isna().any()
while setting the argument ascending=False
so True
, corresponding to columns with missing values, comes before False
.Particular Column
We wish to know if a particular column has any missing values.
df['col_1'].isna().any()
Here is how this works:
col_1
) via df['col_1']
which returns the column of interest as a Series
.isna()
method of Pandas Series
to get a boolean Series that is True
wherever the column has any missing values and False
otherwise.any()
method to the boolean Series
produced by df['col_1'].isna()
which returns a single boolean value that is True
if the column col_1
has any missing values and False
otherwise.We wish to know how many missing values there are.
Data Frame
We wish to obtain the number of missing values in a data frame.
df.isna().sum().sum()
Here is how this works:
isna()
method of Pandas data frames returns a boolean data frame of the same size as the original data frame where values are True
if the corresponding value in the original data frame is missing.sum()
method of Pandas data frames sums over columns returning a Series
that has many elements as the input data frame has columns and where each element is the sum of the elements of one column.True
is taken as 1
and a False
as 0
.sum()
acts on the Series
of per column sums to produce a single integer for the entire data frame which is the total number of missing values in the data frame.Each Column
We wish to obtain the number of missing values per column for each column in a data frame.
df.isna().sum().sort_values(ascending=False)
Here is how this works:
df.isna().sum()
works as described above.sort_values(ascending=False)
to the output of df.isna().sum()
.Particular Column
We wish to get the number of missing values in a particular column.
df['col_1'].isna().sum()
Here is how this works:
col_1
) via df['col_1']
which returns the column of interest as a Series
.isna()
method of Pandas Series
to get a boolean Series that is True
wherever the column has any missing values and False
otherwise.sum()
method to the boolean Series
produced by df['col_1'].isna()
which returns a single integer value that is the number of missing values in the column of interest.We wish to get the proportion (percentage) of values that are missing.
Data Frame
We wish to know the proportion of a data frame’s values that are missing.
df.isna().stack().mean()
Here is how this works:
isna()
method of Pandas data frames returns a boolean data frame of the same size as the original data frame where values are True
if the corresponding value in the original data frame is missing.stack()
method of Pandas data frames reshapes the data frame of booleans into a single column to which mean()
can then be applied.isna()
will not have missing data. So we could have safely gone with df.isna().mean().mean()
. That said it is good practice in general to apply statistical summary functions once to the whole set (the population) if what is intended is a summary of the whole set (the population).Each Column
We wish to obtain the proportion of missing values for each column in a data frame.
df.isna().mean().sort_values(ascending=False)
Here is how this works:
df.isna()
works as described above.mean()
method of Pandas data frames computes the mean value for each column and returns a Series
that has as many elements as there are columns.True
values to the total number of values.sort_values(ascending=False)
.Particular Column
We wish to know the proportion of a column’s values that are missing.
df['col_1'].isna().mean()
Here is how this works:
df['col_1'].isna()
works as described above.mean()
method to the boolean Series
produced by df['col_1'].isna()
which returns the proportion of values of the column col_1
that are missing.Any
We wish to get rows where any column has a missing value.
df.loc[df.isna().any(axis=1)]
Here is how this works:
df.isna()
works as described above.any()
the argument axis=1
so it may apply to the rows (instead of the default that is columns) of the boolean data frame resulting from df.na()
and returns a Series
of boolean values that has as many rows as the data frame df
.Series
resulting from any()
is passed as input to loc[]
to filter the data frame df
(for more details see Filtering).Count
Get rows where more than n
columns have missing values. In this example we extract any row where more than 3
columns have missing values.
df.loc[df.isna().sum(axis=1).gt(3)]
Here is how this works:
df.isna()
works as described above.sum()
the argument axis=1
so it may apply to the rows (instead of the default that is columns) of the boolean data frame resulting from df.na()
and returns a Series that has one value for each row in the data frame df
that is the number of missing values in the corresponding row.gt(3)
to the Series
of integer values returned by sum(axis=1)
to get a boolean Series
that is True
where a row has more than 3 missing values and False
otherwise.Series
resulting from gt(3)
is passed as input to loc[]
to filter the data frame df
(for more details see Filtering).Get columns where more than a given percent of values are missing. In this example, we get columns where more than 10%
of the values are missing.
df.loc[:, df.isnull().mean().gt(0.1)]
Here is how this works:
df.isnull().mean()
returns a Series
that has as many elements as the columns of the data frame df
and where each element is the proportion of missing values in the corresponding column.gt(0.1)
comparison which returns True
if the proportion of missing values in the corresponding column is greater than 10%.df.isnull().mean().gt(0.1)
to the second argument of loc[]
to select the columns where the value is True
.