Duplicates

We wish to identify and act on duplicate rows in a data frame.

This section is structured as follows:

This section is complemented by Inspecting Duplicates, where we cover how to:

  • Obtain the duplicate rows.
  • Count duplicate rows.
  • Calculate the proportion of duplicate rows.

Identify Duplicates

We wish to identify which rows are duplicates of other rows.

Entire Rows

We wish to identify duplicate rows in a data frame. A row is considered a duplicate if it has the same values for all columns as a row that appears earlier in the data frame.

In this example, we wish to add to the data frame df a new logical column with the name is_duplicate that is TRUE for rows that are duplicates of rows that appeared earlier and FALSE otherwise.

df_2 = df %>% 
  mutate(is_duplicate = duplicated(.))

Here is how this works:

  • We use the duplicated() function from base R.
  • As input, duplicated() expects to get a data frame.
  • To pass the data frame df to duplicated() from within mutate() (or any other tidyverse verb), we use the dot . operator.

Selected Columns

We wish to identify duplicate rows in a data frame. A row is considered a duplicate if it has the same values for a selected subset of columns as a row appearing earlier in the data frame.

df_2 = df %>% 
    mutate(is_duplicated = duplicated(tibble(col_1, col_2)))

Here is how this works:

  • This code works similarly to the “Entire Rows” scenario above except that we apply the function duplicated() only to the subset of columns that define a unique row.
  • In order to combine multiple rows into a data frame that can be passed to duplicated(), we use tibble() to create a data frame (a tibble) out of the columns.

Extension: Duplicate Details

For each row, we wish to compute (1) whether it’s a duplicate of a row that showed up earlier, (2) the index of the duplicate e.g. the 3rd time the row shows up, (3) the number of existing duplicates.

In this example, we wish to compute the three duplicate details specified in the previous paragraph. A row is considered a duplicate if it has the same values for columns col_1 and col_2 as a row appearing earlier in the data frame.

df_2 = df %>%
  group_by(col_1, col_2) %>%
  mutate(dupe_count = n(),
         dup_index = row_number()) %>%
  ungroup() %>%
  mutate(is_duplicated = dup_index > 1)

Here is how this works:

  • We group by the set of columns whose values determine whether a row is a duplicate; which in this case are col_1 and col_2.
  • dupe_count tells us how many times each combination of the values of col_1 and col_2 is duplicated.
  • dupe_index tells us the index of the duplicate e.g. the 3rd time the row shows up.
  • is_duplicate is a logical column that is TRUE if the row is a duplicate of a row that showed up earlier and FALSE otherwise.
  • The output data frame df_2 is a copy of the input data frame df with three columns added dupe_count, dupe_index, and is_duplicate.

Extension: Check for a 1-to-1 Relationship.

We wish to identify cases where one column does not have a one-to-one relationship with another column.

df_2 = df %>% 
    group_by(col_1) %>% 
    summarize(is_one_to_one = n_distinct(col_2) == 1)

Here is how this works:

  • We group by the first column of the two which here is col_1.
  • We then count the number of unique values the second column takes which here is col_2.
  • Intuitively if there is a one-to-one relationship between col_1 and col_2 then there should be only 1 unique value for col_2 for each unique value of col_1.
  • Note that a one-to-one relationship between the values of col_1 and the values of col_2 does not imply a one-to-one relationship between the values of col_2 and the values of col_1.

Drop Duplicates

We wish to drop (remove) duplicate rows in a data frame.

Entire Rows

We wish to remove duplicate rows in a data frame. A row is considered a duplicate if it has the same values for all columns as a row that appears earlier in the data frame.

df_2 = df %>% distinct()

Here is how this works:

  • The distinct() function, as the name suggests, removes duplicate rows from a data frame, which here is df, and is passed to distinct() via the pipe operator.
  • The result df_2 is a new data frame that contains the unique rows of the original data frame df.

Alternative: via unique()

df_2 = df %>% unique()

Here is how this works:

This works similarly to the primary solution above, except that we use unique() from base R instead of distinct() from dplyr. We generally recommend distinct().

Selected Columns

We wish to remove duplicate rows in a data frame. A row is considered a duplicate if it has the same values for a selected subset of columns as a row appearing earlier in the data frame.

df_2 = df %>% 
    distinct(col_1, col_2, .keep_all = TRUE)

Here is how this works:

  • We use distinct() to remove duplicate rows from a data frame.
  • In order to limit the definition of a duplicate to a subset of columns, we pass those columns to distinct() which we do here via col_1 and col_2.
  • We set the argument .keep_all of dinstinct() to .keep_all = TRUE so that entire rows are returned, which is typically what we want when dropping duplicate rows. If not used, dinstinct() will return only the specified columns and drop the rest of the columns.

Extension: Keep Last

When dropping duplicates, it’s common to keep the first occurrence and drop the rest. In this example, we show how to keep the last occurrence and drop the rest. This is often needed in practice.

df_2 = df %>%  
    slice(n():1) %>% 
    distinct(col_1, col_2, .keep_all = TRUE) %>%
    slice(n():1)

Here is how this works:

  • By default, distinct() keeps the first occurrence of a row and drops any following duplicate rows.
  • In order to keep the last occurrence rather than the first, we do this:
    • We flip the data frame reversing the order of rows
    • We then apply distinct() as described in the primary solution above
    • Finally, we reverse the results, so they are in the original order.
  • We use slice(n():1) to reverse the order of rows of a data frame.
R
I/O