We wish to drop rows or columns that contain missing values.
We cover the following scenarios:
Note that this section assumes that missing values are encoded in one of the valid missing value encodings e.g. as Numpy nan
values. See Missing Value Encoding.
We wish to drop rows with missing values from a data frame.
Any Column
We wish to drop rows where any column has a missing value.
df_2 = df.dropna()
Here is how this works:
dropna()
method of Pandas data frames to drop rows containing any missing values.df_2
will have the same columns as the input data frame df
but with any row that has one or more missing values dropped.Selected Columns
We wish to drop rows where any column from a selected set of columns has a missing value.
df_2 = df.dropna(subset = ['col_1', 'col_2'])
Here is how this works:
subset
argument of drop_na()
a list of the names of the columns that we wish to limit the check for missing values to.df_2
will have the same columns as the input data frame df
but with any row, where one or more of the specified columns has a missing value, dropped.Extension: Each Selected Column
We wish to drop rows if and only if all the values of the specified columns are missing.
df_2 = df.dropna(subset = ['col_1', 'col_2'], how='all')
Here is how this works:
dropna()
has an argument how
that may take one of two values:how=’any’
is the default and specifies that a row is dropped if any of the selected columns are missinghow=’all’
specifies that a row is dropped if all of the selected columns are missing.how=’all’
since we wish to drop a row if and only if all of the selected columns, which in this case are col_1
and col_2
, are missing.df_2
will have the same columns as the input data frame df
but with any row where all the specified columns, which here are col_1
and col_2
, are missing is dropped.Empty Rows
We wish to drop rows where all columns have a missing value.
df_2 = df.dropna(how='all')
Here is how this works:
how
of dropna()
to how=’all’
so a row is dropped if all columns have missing values, i.e. the row is dropped if it is empty.df_2
will have the same columns as the input data frame df
but with any row where all columns have missing values NA
is dropped.Absolute Threshold
We wish to drop rows that have more than a specified number of columns with missing values.
In this example, we wish to drop rows where the values of more than 2 columns are missing.
df_2 = df.dropna(thresh=len(df.columns) - 2)
Here is how this works:
thresh
argument of dropna()
.thresh
argument of dropna()
refers to the number of non-NA values. To set a threshold on the number of missing values, we use len(df.columns) - 2
.df_2
will have the same columns as the input data frame df
but with only rows where the values of more than 2 columns are missing.Relative Threshold
We wish to drop rows that have more than a specified percent of columns with missing values.
In this example, we wish to drop rows where the values of more than half the columns are missing.
df_2 = df.dropna(thresh=0.5 * len(df.columns))
Here is how this works:
thresh
argument of dropna()
.len(df.columns)
.We wish to drop rows with missing values from a data frame.
Any Rows
We wish to drop columns where any row has a missing value.
df_2 = df.dropna(axis=1)
Here is how this works:
dropna()
of Pandas data frames with the argument axis
set to axis=1
to drop columns with missing values.df_2
has the same number of rows as the input data frame df
but where any column, that has one or more missing values, is removed.Empty Columns
We wish to drop columns that are entirely NA
; i.e. we wish to drop columns that take a value of NA
for all rows.
df_2 = df.dropna(axis=1, how='all')
Here is how this works:
We use the method dropna()
with its axis
argument set to axis=1
to drop columns and how
argument set to how=’all’
to require that all a column’s values must be missing to be dropped.
Absolute Threshold
We wish to drop columns that have more than a specified number of missing values.
In this example, we wish to drop columns where more than 2 values are missing.
df_2 = df.dropna(axis=1, thresh=len(df) - 2)
Here is how this works:
thresh
of dropna()
refers to the number of non-NA values. To set a threshold in terms of the number of missing values, we use len(df) - 2
.df_2
has the same number of rows as the input data frame df
but where any column that has more than 2 missing values removed.Relative Threshold
We wish to drop columns with more than a specified percentage of missing values.
In this example, we wish to drop columns where more than half the values are missing.
df_2 = df.dropna(axis=1, thresh=0.5 * len(df))
Here is how this works:
thresh
argument of dropna()
.len(df)
.We wish to drop missing values from a Series.
In this example, we wish to drop missing values from the column col_1
before we obtain the proportion of values that are positive (values higher than zero).
df['col_1'].dropna().gt(0).mean()
Here is how this works:
dropna()
method of Pandas Series
objects to drop missing values fromdf['col_1']
.gt(0)
to return a logical vector that is True
for positive values and False
otherwise.mean()
to obtain the proportion of positive values.False
which may cause subsequent operations to yield wrong results. In this example, applying the greater than gt()
function to a missing value returns False
which would be indistinguishable from a False
obtained because a value is negative.Extension: List
input_list = [1, 2, None, 4, 5, np.nan, "", "NA"]
missing_values = [None, np.nan, "", "NA"]
filtered_list = [x for x in input_list if x not in missing_values]
Here is how this works:
missing_values
, we specify the encodings of missing values that we wish to drop from the list.input_list
and returnd another list filtered_list
.