On this page, we look at how to inspect duplication in a data frame. We will cover three scenarios:
For each scenario, we will look at two variants:
See Uniqueness for a more detailed coverage of detecting and dealing with duplication in data.
We wish to know how many rows involve duplication.
Entire Rows
We wish to get the number of rows that are a duplicate of another row in a data frame.
df %>% duplicated() %>% sum()
Here is how this works:
df
to the function duplicated()
from base R.duplicated()
returns a boolean vector that has as many values as there are rows in the data frame and where it takes a value of TRUE
if a row is duplicated and FALSE
otherwise.duplicated()
to sum()
. Summing a boolean vector is basically counting the number of TRUE
values (which in this case is the number of duplicate rows) because sum()
regards TRUE
as 1
and FALSE
as 0
.Selected Columns
We wish to get the number of rows where a subset of columns (one or more columns) take the same values as in another row in the data frame.
df %>% select(col_1, col_2, col_3) %>% duplicated() %>% sum()
Here is how this works:
select()
to extract a subset of the original data frame df that has only the columns of interest (in this example col_1
, col_2
, and col_3
).duplicated()
and sum()
as described above.We wish to know what proportion (percent) of rows involves duplication.
Entire Rows
We wish to get the proportion of rows of a data frame that are a duplicate of another row in the data frame.
df %>% duplicated() %>% mean()
Here is how this works:
mean()
instead of sum()
.mean()
regards TRUE
as 1
and FALSE
as 0
. Therefore, taking the mean of a boolean vector is basically taking the ratio between the number of TRUE
values (which in this case is the number of duplicate rows) and the total number of rows in the data frame df
.Selected Columns
We wish to get the proportion (percent) of rows where a subset of columns (one or more columns) take the same values as in another row of the data frame.
df %>% select(col_1, col_2, col_3) %>% duplicated() %>% mean()
Here is how this works:
select()
to extract a subset of the original data frame df that has only the columns of interest (in this example col_1
, col_2
, and col_3
).duplicated()
and mean()
as described above.We wish to get the actual duplicated rows.
Entire Rows
We wish to get rows that appear more than once in a data frame.
library(janitor)
df %>% get_dupes()
Here is how this works:
get_dupes()
function from the janitor
package. Given a data frame (here df
), get_dupes()
returns rows that show up more than once.df %>% filter(duplicated(df))
. We prefer the get_dupes()
function because it returns an additional helpful column dupe_count
that tells us how many duplicates there are.Selected Columns
We wish to get cases where a row has column values equal to those of another row for a selected subset of columns.
df %>% get_dupes(col_2, col_2, col_3)
Here is how this works:
get_dupes()
function like described above.get_dupes()
function from the janitor
package can take a set of columns (here col_1
, col_2
and col_3
) and return rows where those columns take the same values.