On this page, we look at how to inspect duplication in a data frame. We will cover three scenarios:
For each scenario, we will look at two variants:
See Uniqueness for a more detailed coverage of detecting and dealing with duplication in data.
We wish to know how many rows involve duplication.
Entire Rows
We wish to get the number of rows that are a duplicate of another row in a data frame.
df.duplicated().sum()
Here is how this works:
duplicated()
method of Pandas data frames returns a boolean vector where an element is True
if the row is a duplicate of a row that occurred earlier.sum()
method to the vector (Series) resulting from duplicated()
to count the number of True
elements which is the number of duplicate rows.Selected Columns
We wish to get the number of rows where a subset of columns (one or more columns) take the same values as in another row in the data frame.
df.duplicated(subset=['col_1', 'col_2', 'col_3']).sum()
Here is how this works:
duplicated()
and sum()
work together to count the number of duplicate rows in a data frame.duplicated()
method has a subset
argument to which we can pass a list of the columns we wish to consider for duplicate detection. In this example, we consider only columns ['col_1', 'col_2', 'col_3']
in our duplicate detection.We wish to know what proportion (percent) of rows involves duplication.
Entire Rows
We wish to get the proportion of rows of a data frame that are a duplicate of another row in the data frame.
df.duplicated().mean()
Here is how this works:
mean()
instead of sum()
to obtain the proportion (percent) of duplicated rows.mean()
regards True
as 1
and False
as 0
. Therefore, taking the mean of a boolean vector is basically taking the ratio between the number of True
values (which in this case is the number of duplicate rows) and the total number of rows in the data frame df
.Selected Columns
We wish to get the proportion (percent) of rows where a subset of columns (one or more columns) take the same values as in another row of the data frame.
df.duplicated(subset=['col_1', 'col_2', 'col_3']).mean()
Here is how this works:
df.duplicated().mean()
works as described above.subset
argument of duplicated()
to specify the columns to consider for duplicate detection.We wish to get the actual duplicated rows.
Entire Rows
We wish to get rows that appear more than once in a data frame.
df.loc[df.duplicated()]
Here is how this works:
duplicated()
method of Pandas data frames returns a boolean vector that has as many values as the number of rows in the data frame (here df
) where an element is True
if the row is a duplicate of a row that occurred earlier.loc[]
to filter rows by the boolean vector generated by duplicated()
. See Filtering for more on data frame filtering.Selected Columns
We wish to get cases where a row has column values equal to those of another row for a selected subset of columns.
df.loc[df.duplicated(subset=['col_1', 'col_2', 'col_3'])]
Here is how this works:
subset
argument of duplicated()
to specify the columns to consider for duplicate detection.