Duplicates

On this page, we look at how to inspect duplication in a data frame. We will cover three scenarios:

  1. We wish to know the number of rows where duplication occurs.
  2. We wish to know the proportion (percent) of rows where duplication occurs.
  3. We wish to get the actual duplicates (to inspect or further process).

For each scenario, we will look at two variants:

  1. Cases where a row is an exact replica of another row i.e. all column values are the same.
  2. Cases where a row has column values equal to those of another row for some selected columns.

See Uniqueness for a more detailed coverage of detecting and dealing with duplication in data.

Count

We wish to know how many rows involve duplication.

Entire Rows

We wish to get the number of rows that are a duplicate of another row in a data frame.

df %>% duplicated() %>% sum()

Here is how this works:

  • We pass a data frame df to the function duplicated() from base R.
  • duplicated() returns a boolean vector that has as many values as there are rows in the data frame and where it takes a value of TRUE if a row is duplicated and FALSE otherwise.
  • We pass the boolean vector returned by duplicated() to sum(). Summing a boolean vector is basically counting the number of TRUE values (which in this case is the number of duplicate rows) because sum() regards TRUE as 1 and FALSE as 0.

Selected Columns

We wish to get the number of rows where a subset of columns (one or more columns) take the same values as in another row in the data frame.

df %>% select(col_1, col_2, col_3) %>% duplicated() %>% sum()

Here is how this works:

  • We start off by applying select() to extract a subset of the original data frame df that has only the columns of interest (in this example col_1, col_2, and col_3).
  • We then use duplicated() and sum() as described above.

Proportion

We wish to know what proportion (percent) of rows involves duplication.

Entire Rows

We wish to get the proportion of rows of a data frame that are a duplicate of another row in the data frame.

df %>% duplicated() %>% mean()

Here is how this works:

  • This works similarly to the above except that we use mean() instead of sum().
  • mean() regards TRUE as 1 and FALSE as 0. Therefore, taking the mean of a boolean vector is basically taking the ratio between the number of TRUE values (which in this case is the number of duplicate rows) and the total number of rows in the data frame df.

Selected Columns

We wish to get the proportion (percent) of rows where a subset of columns (one or more columns) take the same values as in another row of the data frame.

df %>% select(col_1, col_2, col_3) %>% duplicated() %>% mean()

Here is how this works:

  • We start off by applying select() to extract a subset of the original data frame df that has only the columns of interest (in this example col_1, col_2, and col_3).
  • We then use duplicated() and mean() as described above.

Get Rows

We wish to get the actual duplicated rows.

Entire Rows

We wish to get rows that appear more than once in a data frame.

library(janitor)
df %>% get_dupes()

Here is how this works:

  • We used the get_dupes() function from the janitor package. Given a data frame (here df), get_dupes() returns rows that show up more than once.
  • We could also use df %>% filter(duplicated(df)). We prefer the get_dupes() function because it returns an additional helpful column dupe_count that tells us how many duplicates there are.

Selected Columns

We wish to get cases where a row has column values equal to those of another row for a selected subset of columns.

df %>% get_dupes(col_2, col_2, col_3)

Here is how this works:

  • We use the get_dupes() function like described above.
  • The get_dupes() function from the janitor package can take a set of columns (here col_1, col_2 and col_3) and return rows where those columns take the same values.
R
I/O