Drop Missing

We wish to drop rows or columns that contain missing values.

We cover the following scenarios:

  • Rows: Drop rows with missing values from a data frame.
    • Any Column: Drop rows where any column has a missing value.
    • Selected Columns: Drop rows where any or all of a selected set of columns has a missing value.
    • Empty Rows: Drop rows where all columns have missing values.
    • Absolute Threshold: Drop rows that have more than a specified number of columns with missing values.
    • Relative Threshold: Drop rows that have more than a specified percent of columns with missing values.
  • Columns: Drop columns with missing values from a data frame.
    • Any Row: Drop columns where any row has a missing value.
    • Empty Columns: Drop columns where all values are missing.
    • Absolute Threshold: Drop columns that have more than a specified number of missing values.
    • Relative Threshold: Drop columns with more than a specified percentage of missing values.
  • Series Elements: Drop missing values from a Series.

Note that this section assumes that missing values are encoded in one of the valid missing value encodings e.g. as Numpy nan values. See Missing Value Encoding.

Rows

We wish to drop rows with missing values from a data frame.

Any Column

We wish to drop rows where any column has a missing value.

df_2 = df.dropna()

Here is how this works:

  • We use the dropna()method of Pandas data frames to drop rows containing any missing values.
  • The output data frame df_2 will have the same columns as the input data frame df but with any row that has one or more missing values dropped.

Selected Columns

We wish to drop rows where any column from a selected set of columns has a missing value.

df_2 = df.dropna(subset = ['col_1', 'col_2'])

Here is how this works:

  • We pass to the subset argument of drop_na() a list of the names of the columns that we wish to limit the check for missing values to.
  • The output data frame df_2 will have the same columns as the input data frame df but with any row, where one or more of the specified columns has a missing value, dropped.

Extension: Each Selected Column

We wish to drop rows if and only if all the values of the specified columns are missing.

df_2 = df.dropna(subset = ['col_1', 'col_2'], how='all')

Here is how this works:

  • The method dropna() has an argument how that may take one of two values:
    • how=’any’ is the default and specifies that a row is dropped if any of the selected columns are missing
    • how=’all’ specifies that a row is dropped if all of the selected columns are missing.
  • In this scenario, we set how=’all’ since we wish to drop a row if and only if all of the selected columns, which in this case are col_1 and col_2, are missing.
  • The output data frame df_2 will have the same columns as the input data frame df but with any row where all the specified columns, which here are col_1 and col_2, are missing is dropped.

Empty Rows

We wish to drop rows where all columns have a missing value.

df_2 = df.dropna(how='all')

Here is how this works:

  • We set the argument how of dropna() to how=’all’ so a row is dropped if all columns have missing values, i.e. the row is dropped if it is empty.
  • The output data frame df_2 will have the same columns as the input data frame df but with any row where all columns have missing values NA is dropped.

Absolute Threshold

We wish to drop rows that have more than a specified number of columns with missing values.

In this example, we wish to drop rows where the values of more than 2 columns are missing.

df_2 = df.dropna(thresh=len(df.columns) - 2)

Here is how this works:

  • To specify a threshold on the number of missing values to tolerate in a row, we set the thresh argument of dropna().
  • The thresh argument of dropna() refers to the number of non-NA values. To set a threshold on the number of missing values, we use len(df.columns) - 2.
  • The output data frame df_2 will have the same columns as the input data frame df but with only rows where the values of more than 2 columns are missing.

Relative Threshold

We wish to drop rows that have more than a specified percent of columns with missing values.

In this example, we wish to drop rows where the values of more than half the columns are missing.

df_2 = df.dropna(thresh=0.5 * len(df.columns))

Here is how this works:

  • To specify a threshold on the number of missing values to tolerate in a row, we set the thresh argument of dropna().
  • To set a threshold in terms of the proportion of columns that have non-missing values for each row, we multiply the target proportion, which in this case is 0.5, with the number of columns len(df.columns).

Columns

We wish to drop rows with missing values from a data frame.

Any Rows

We wish to drop columns where any row has a missing value.

df_2 = df.dropna(axis=1)

Here is how this works:

  • We use the method dropna() of Pandas data frames with the argument axis set to axis=1 to drop columns with missing values.
  • The output data frame df_2 has the same number of rows as the input data frame df but where any column, that has one or more missing values, is removed.

Empty Columns

We wish to drop columns that are entirely NA; i.e. we wish to drop columns that take a value of NA for all rows.

df_2 = df.dropna(axis=1, how='all')

Here is how this works:

We use the method dropna() with its axis argument set to axis=1 to drop columns and how argument set to how=’all’ to require that all a column’s values must be missing to be dropped.

Absolute Threshold

We wish to drop columns that have more than a specified number of missing values.

In this example, we wish to drop columns where more than 2 values are missing.

df_2 = df.dropna(axis=1, thresh=len(df) - 2)

Here is how this works:

  • The argument thresh of dropna() refers to the number of non-NA values. To set a threshold in terms of the number of missing values, we use len(df) - 2.
  • The output data frame df_2 has the same number of rows as the input data frame df but where any column that has more than 2 missing values removed.

Relative Threshold

We wish to drop columns with more than a specified percentage of missing values.

In this example, we wish to drop columns where more than half the values are missing.

df_2 = df.dropna(axis=1, thresh=0.5 * len(df))

Here is how this works:

  • To specify a threshold on the number of missing values to tolerate in a row, we set the thresh argument of dropna().
  • To set a threshold in terms of the proportion of non-missing values needed to retain a column, we multiply the target proportion, which in this case is 0.5, with the number of columns len(df).

Series Elements

We wish to drop missing values from a Series.

In this example, we wish to drop missing values from the column col_1 before we obtain the proportion of values that are positive (values higher than zero).

df['col_1'].dropna().gt(0).mean()

Here is how this works:

  • We use the dropna() method of Pandas Series objects to drop missing values fromdf['col_1'].
  • We then use gt(0) to return a logical vector that is True for positive values and False otherwise.
  • Finally, we use mean() to obtain the proportion of positive values.
  • It is worth noting that when predicate functions are applied to missing values we may get False which may cause subsequent operations to yield wrong results. In this example, applying the greater than gt() function to a missing value returns False which would be indistinguishable from a False obtained because a value is negative.

Extension: List

input_list = [1, 2, None, 4, 5, np.nan, "", "NA"]
missing_values = [None, np.nan, "", "NA"]
filtered_list = [x for x in input_list if x not in missing_values]

Here is how this works:

  • In missing_values, we specify the encodings of missing values that we wish to drop from the list.
  • We then use a list comprehension to drop missing values from the list input_list and returnd another list filtered_list.
PYTHON
I/O