We wish to identify and act on duplicate rows in a data frame.
This section is structured as follows:
This section is complemented by Inspecting Duplicates, where we cover how to:
We wish to identify which rows are duplicates of other rows.
Entire Rows
We wish to identify duplicate rows in a data frame. A row is considered a duplicate if it has the same values for all columns as a row that appears earlier in the data frame.
In this example, we wish to add to the data frame df
a new logical column with the name is_duplicated
that is TRUE
for rows that are duplicates of rows that appeared earlier and FALSE
otherwise.
df_2 = df\
.assign(is_duplicate = df.duplicated())
Here is how this works:
The duplicated()
method of Pandas data frames checks a data frame, which here is df
, for duplicate rows. It returns a logical (True
or False
) value for each row, indicating whether the row is a duplicate or not.
Selected Columns
We wish to identify duplicate rows in a data frame. A row is considered a duplicate if it has the same values for a selected subset of columns as a row appearing earlier in the data frame.
df_2 = df\
.assign(is_duplicate = df.duplicated(subset=['col_1', 'col_2']))
Here is how this works:
This code works similarly to the “Entire Rows” scenario above except that we pass to the subset
argument of the duplicated()
method a list of only the columns that define a unique row, which in this case is subset=['col_1', 'col_2']
.
Extension: Duplicate Details
For each row, we wish to compute (1) whether it’s a duplicate of a row that showed up earlier, (2) the index of the duplicate e.g. the 3rd time the row shows up, (3) the number of existing duplicates.
In this example, we wish to compute the three duplicate details specified in the previous paragraph. A row is considered a duplicate if it has the same values for columns col_1
and col_2
as a row appearing earlier in the data frame.
df_2 = df.copy()
df_g = df.groupby(['col_1', 'col_2'])
df_2['dupe_count'] = df_g.transform('size')
df_2['dupe_index'] = df_g.cumcount() + 1
df_2['is_duplicate'] = df_2['dupe_index'] > 1
Here is how this works:
col_1
and col_2
.dupe_count
tells us how many times each combination of the values of col_1
and col_2
is duplicated.dupe_index
tells us the index of the duplicate e.g. the 3rd time the row shows up.is_duplicate
is a logical column that is TRUE
if the row is a duplicate of a row that showed up earlier and FALSE
otherwise.df_2
is a copy of the input data frame df
with three columns added dupe_count
, dupe_index
, and is_duplicate
.Extension: Check for a 1-to-1 Relationship.
We wish to identify cases where one column does not have a one-to-one relationship with another column.
df_2 = df.groupby('col_1', as_index=False)['col_2'].agg(lambda x: x.nunique() == 1)
Here is how this works:
col_1
.col_2
.col_1
and col_2
then there should be only 1 unique value for col_2
for each unique value of col_1
.col_1
and the values of col_2
does not imply a one-to-one relationship between the values of col_2
and the values of col_1
.We wish to drop (remove) duplicate rows in a data frame.
Entire Rows
We wish to drop (remove) duplicate rows in a data frame. A row is considered a duplicate if it has the same values for all columns as a row that appears higher in the data frame.
df_2 = df.drop_duplicates()
Here is how this works:
drop_duplicates()
function, as the name suggests, removes duplicate rows from a data frame, which here is df
.df_2
is a new data frame that contains the unique rows of the original data frame df
.Selected Columns
We wish to remove duplicate rows in a data frame. A row is considered a duplicate if it has the same values for a selected subset of columns as a row appearing earlier in the data frame.
df_2 = df.drop_duplicates(subset=['col_1', 'col_2'])
Here is how this works:
drop_duplicates()
to remove duplicate rows from a data frame.subset
argument of drop_duplicates()
which we do here via subset=['col_1', 'col_2']
.Extension: Keep Last
When dropping duplicates, it’s common to keep the first occurrence and drop the rest. In this example, we show how to keep the last occurrence and drop the rest. This is often needed in practice.
df_2 = df.drop_duplicates(['col_1', 'col_2'], keep='last')
Here is how this works:
drop_duplicates()
keeps the first occurrence of a row and drops any following duplicate rows.drop_duplicates()
to keep='last'
.