Duplicates

We wish to identify and act on duplicate rows in a data frame.

This section is structured as follows:

This section is complemented by Inspecting Duplicates, where we cover how to:

  • Obtain the duplicate rows.
  • Count duplicate rows.
  • Calculate the proportion of duplicate rows.

Identify Duplicates

We wish to identify which rows are duplicates of other rows.

Entire Rows

We wish to identify duplicate rows in a data frame. A row is considered a duplicate if it has the same values for all columns as a row that appears earlier in the data frame.

In this example, we wish to add to the data frame df a new logical column with the name is_duplicated that is TRUE for rows that are duplicates of rows that appeared earlier and FALSE otherwise.

df_2 = df\
  .assign(is_duplicate = df.duplicated())

Here is how this works:

The duplicated() method of Pandas data frames checks a data frame, which here is df, for duplicate rows. It returns a logical (True or False) value for each row, indicating whether the row is a duplicate or not.

Selected Columns

We wish to identify duplicate rows in a data frame. A row is considered a duplicate if it has the same values for a selected subset of columns as a row appearing earlier in the data frame.

df_2 = df\
  .assign(is_duplicate = df.duplicated(subset=['col_1', 'col_2']))

Here is how this works:

This code works similarly to the “Entire Rows” scenario above except that we pass to the subset argument of the duplicated() method a list of only the columns that define a unique row, which in this case is subset=['col_1', 'col_2'].

Extension: Duplicate Details

For each row, we wish to compute (1) whether it’s a duplicate of a row that showed up earlier, (2) the index of the duplicate e.g. the 3rd time the row shows up, (3) the number of existing duplicates.

In this example, we wish to compute the three duplicate details specified in the previous paragraph. A row is considered a duplicate if it has the same values for columns col_1 and col_2 as a row appearing earlier in the data frame.

df_2 = df.copy()
df_g = df.groupby(['col_1', 'col_2'])
df_2['dupe_count'] = df_g.transform('size')
df_2['dupe_index'] = df_g.cumcount() + 1
df_2['is_duplicate'] = df_2['dupe_index'] > 1

Here is how this works:

  • We group by the set of columns whose values determine whether a row is a duplicate; which in this case are col_1 and col_2.
  • dupe_count tells us how many times each combination of the values of col_1 and col_2 is duplicated.
  • dupe_index tells us the index of the duplicate e.g. the 3rd time the row shows up.
  • is_duplicate is a logical column that is TRUE if the row is a duplicate of a row that showed up earlier and FALSE otherwise.
  • The output data frame df_2 is a copy of the input data frame df with three columns added dupe_count, dupe_index, and is_duplicate.

Extension: Check for a 1-to-1 Relationship.

We wish to identify cases where one column does not have a one-to-one relationship with another column.

df_2 = df.groupby('col_1', as_index=False)['col_2'].agg(lambda x: x.nunique() == 1)

Here is how this works:

  • We group by the first column of the two which here is col_1.
  • We then count the number of unique values the second column takes which here is col_2.
  • Intuitively if there is a one-to-one relationship between col_1 and col_2 then there should be only 1 unique value for col_2 for each unique value of col_1.
  • Note that a one-to-one relationship between the values of col_1 and the values of col_2 does not imply a one-to-one relationship between the values of col_2 and the values of col_1.

Drop Duplicates

We wish to drop (remove) duplicate rows in a data frame.

Entire Rows

We wish to drop (remove) duplicate rows in a data frame. A row is considered a duplicate if it has the same values for all columns as a row that appears higher in the data frame.

df_2 = df.drop_duplicates()

Here is how this works:

  • The drop_duplicates() function, as the name suggests, removes duplicate rows from a data frame, which here is df.
  • The result df_2 is a new data frame that contains the unique rows of the original data frame df.

Selected Columns

We wish to remove duplicate rows in a data frame. A row is considered a duplicate if it has the same values for a selected subset of columns as a row appearing earlier in the data frame.

df_2 = df.drop_duplicates(subset=['col_1', 'col_2'])

Here is how this works:

  • We use drop_duplicates() to remove duplicate rows from a data frame.
  • In order to limit the definition of a duplicate to a subset of columns, we pass those columns to the subset argument of drop_duplicates() which we do here via subset=['col_1', 'col_2'].

Extension: Keep Last

When dropping duplicates, it’s common to keep the first occurrence and drop the rest. In this example, we show how to keep the last occurrence and drop the rest. This is often needed in practice.

df_2 = df.drop_duplicates(['col_1', 'col_2'], keep='last')

Here is how this works:

  • By default, drop_duplicates() keeps the first occurrence of a row and drops any following duplicate rows.
  • In order to keep the last occurrence rather than the first, we set the keep argument of drop_duplicates() to keep='last'.
PYTHON
I/O