Duplicates

On this page, we look at how to inspect duplication in a data frame. We will cover three scenarios:

  1. We wish to know the number of rows where duplication occurs.
  2. We wish to know the proportion (percent) of rows where duplication occurs.
  3. We wish to get the actual duplicates (to inspect or further process).

For each scenario, we will look at two variants:

  1. Cases where a row is an exact replica of another row i.e. all column values are the same.
  2. Cases where a row has column values equal to those of another row for some selected columns.

See Uniqueness for a more detailed coverage of detecting and dealing with duplication in data.

Count

We wish to know how many rows involve duplication.

Entire Rows

We wish to get the number of rows that are a duplicate of another row in a data frame.

df.duplicated().sum()

Here is how this works:

  • The duplicated() method of Pandas data frames returns a boolean vector where an element is True if the row is a duplicate of a row that occurred earlier.
  • We then apply the sum() method to the vector (Series) resulting from duplicated() to count the number of True elements which is the number of duplicate rows.

Selected Columns

We wish to get the number of rows where a subset of columns (one or more columns) take the same values as in another row in the data frame.

df.duplicated(subset=['col_1', 'col_2', 'col_3']).sum()

Here is how this works:

  • We covered above how duplicated() and sum() work together to count the number of duplicate rows in a data frame.
  • The duplicated() method has a subset argument to which we can pass a list of the columns we wish to consider for duplicate detection. In this example, we consider only columns ['col_1', 'col_2', 'col_3'] in our duplicate detection.

Proportion

We wish to know what proportion (percent) of rows involves duplication.

Entire Rows

We wish to get the proportion of rows of a data frame that are a duplicate of another row in the data frame.

df.duplicated().mean()

Here is how this works:

  • This works similarly to how we obtained the number of duplicated rows above except that we use mean() instead of sum() to obtain the proportion (percent) of duplicated rows.
  • mean() regards True as 1 and False as 0. Therefore, taking the mean of a boolean vector is basically taking the ratio between the number of True values (which in this case is the number of duplicate rows) and the total number of rows in the data frame df.

Selected Columns

We wish to get the proportion (percent) of rows where a subset of columns (one or more columns) take the same values as in another row of the data frame.

df.duplicated(subset=['col_1', 'col_2', 'col_3']).mean()

Here is how this works:

  • df.duplicated().mean() works as described above.
  • We use the subset argument of duplicated() to specify the columns to consider for duplicate detection.

Get Rows

We wish to get the actual duplicated rows.

Entire Rows

We wish to get rows that appear more than once in a data frame.

df.loc[df.duplicated()]

Here is how this works:

  • The duplicated() method of Pandas data frames returns a boolean vector that has as many values as the number of rows in the data frame (here df) where an element is True if the row is a duplicate of a row that occurred earlier.
  • We use loc[] to filter rows by the boolean vector generated by duplicated(). See Filtering for more on data frame filtering.

Selected Columns

We wish to get cases where a row has column values equal to those of another row for a selected subset of columns.

df.loc[df.duplicated(subset=['col_1', 'col_2', 'col_3'])]

Here is how this works:

  • The filtering of duplicate rows works similarly as above.
  • We use the subset argument of duplicated() to specify the columns to consider for duplicate detection.
PYTHON
I/O