We wish to identify and act on duplicate rows in a data frame.
This section is structured as follows:
This section is complemented by Inspecting Duplicates, where we cover how to:
We wish to identify which rows are duplicates of other rows.
Entire Rows
We wish to identify duplicate rows in a data frame. A row is considered a duplicate if it has the same values for all columns as a row that appears earlier in the data frame.
In this example, we wish to add to the data frame df
a new logical column with the name is_duplicate
that is TRUE
for rows that are duplicates of rows that appeared earlier and FALSE
otherwise.
df_2 = df %>%
mutate(is_duplicate = duplicated(.))
Here is how this works:
duplicated()
function from base R.duplicated()
expects to get a data frame.df
to duplicated()
from within mutate()
(or any other tidyverse verb), we use the dot .
operator.Selected Columns
We wish to identify duplicate rows in a data frame. A row is considered a duplicate if it has the same values for a selected subset of columns as a row appearing earlier in the data frame.
df_2 = df %>%
mutate(is_duplicated = duplicated(tibble(col_1, col_2)))
Here is how this works:
duplicated()
only to the subset of columns that define a unique row.duplicated()
, we use tibble()
to create a data frame (a tibble) out of the columns.Extension: Duplicate Details
For each row, we wish to compute (1) whether it’s a duplicate of a row that showed up earlier, (2) the index of the duplicate e.g. the 3rd time the row shows up, (3) the number of existing duplicates.
In this example, we wish to compute the three duplicate details specified in the previous paragraph. A row is considered a duplicate if it has the same values for columns col_1
and col_2
as a row appearing earlier in the data frame.
df_2 = df %>%
group_by(col_1, col_2) %>%
mutate(dupe_count = n(),
dup_index = row_number()) %>%
ungroup() %>%
mutate(is_duplicated = dup_index > 1)
Here is how this works:
col_1
and col_2
.dupe_count
tells us how many times each combination of the values of col_1
and col_2
is duplicated.dupe_index
tells us the index of the duplicate e.g. the 3rd time the row shows up.is_duplicate
is a logical column that is TRUE
if the row is a duplicate of a row that showed up earlier and FALSE
otherwise.df_2
is a copy of the input data frame df
with three columns added dupe_count
, dupe_index
, and is_duplicate
.Extension: Check for a 1-to-1 Relationship.
We wish to identify cases where one column does not have a one-to-one relationship with another column.
df_2 = df %>%
group_by(col_1) %>%
summarize(is_one_to_one = n_distinct(col_2) == 1)
Here is how this works:
col_1
.col_2
.col_1
and col_2
then there should be only 1 unique value for col_2
for each unique value of col_1
.col_1
and the values of col_2
does not imply a one-to-one relationship between the values of col_2
and the values of col_1
.We wish to drop (remove) duplicate rows in a data frame.
Entire Rows
We wish to remove duplicate rows in a data frame. A row is considered a duplicate if it has the same values for all columns as a row that appears earlier in the data frame.
df_2 = df %>% distinct()
Here is how this works:
distinct()
function, as the name suggests, removes duplicate rows from a data frame, which here is df
, and is passed to distinct()
via the pipe operator.df_2
is a new data frame that contains the unique rows of the original data frame df
.Alternative: via unique()
df_2 = df %>% unique()
Here is how this works:
This works similarly to the primary solution above, except that we use unique()
from base R instead of distinct()
from dplyr
. We generally recommend distinct()
.
Selected Columns
We wish to remove duplicate rows in a data frame. A row is considered a duplicate if it has the same values for a selected subset of columns as a row appearing earlier in the data frame.
df_2 = df %>%
distinct(col_1, col_2, .keep_all = TRUE)
Here is how this works:
distinct()
to remove duplicate rows from a data frame.distinct()
which we do here via col_1
and col_2
..keep_all
of dinstinct()
to .keep_all = TRUE
so that entire rows are returned, which is typically what we want when dropping duplicate rows. If not used, dinstinct()
will return only the specified columns and drop the rest of the columns.Extension: Keep Last
When dropping duplicates, it’s common to keep the first occurrence and drop the rest. In this example, we show how to keep the last occurrence and drop the rest. This is often needed in practice.
df_2 = df %>%
slice(n():1) %>%
distinct(col_1, col_2, .keep_all = TRUE) %>%
slice(n():1)
Here is how this works:
distinct()
keeps the first occurrence of a row and drops any following duplicate rows.distinct()
as described in the primary solution aboveslice(n():1)
to reverse the order of rows of a data frame.