On this page, we look at how to inspect duplication in a data frame. We will cover three scenarios:

- We wish to know the number of rows where duplication occurs.
- We wish to know the proportion (percent) of rows where duplication occurs.
- We wish to get the actual duplicates (to inspect or further process).

For each scenario, we will look at two variants:

- Cases where a row is an exact replica of another row i.e. all column values are the same.
- Cases where a row has column values equal to those of another row for some selected columns.

See Uniqueness for a more detailed coverage of detecting and dealing with duplication in data.

We wish to know how many rows involve duplication.

**Entire Rows**

We wish to get the number of rows that are a duplicate of another row in a data frame.

```
df %>% duplicated() %>% sum()
```

Here is how this works:

- We pass a data frame
`df`

to the function`duplicated()`

from base R. `duplicated()`

returns a boolean vector that has as many values as there are rows in the data frame and where it takes a value of`TRUE`

if a row is duplicated and`FALSE`

otherwise.- We pass the boolean vector returned by
`duplicated()`

to`sum()`

. Summing a boolean vector is basically counting the number of`TRUE`

values (which in this case is the number of duplicate rows) because`sum()`

regards`TRUE`

as`1`

and`FALSE`

as`0`

.

**Selected Columns**

We wish to get the number of rows where a subset of columns (one or more columns) take the same values as in another row in the data frame.

```
df %>% select(col_1, col_2, col_3) %>% duplicated() %>% sum()
```

Here is how this works:

- We start off by applying
`select()`

to extract a subset of the original data frame df that has only the columns of interest (in this example`col_1`

,`col_2`

, and`col_3`

). - We then use
`duplicated()`

and`sum()`

as described above.

We wish to know what proportion (percent) of rows involves duplication.

**Entire Rows**

We wish to get the proportion of rows of a data frame that are a duplicate of another row in the data frame.

```
df %>% duplicated() %>% mean()
```

Here is how this works:

- This works similarly to the above except that we use
`mean()`

instead of`sum()`

. `mean()`

regards`TRUE`

as`1`

and`FALSE`

as`0`

. Therefore, taking the mean of a boolean vector is basically taking the ratio between the number of`TRUE`

values (which in this case is the number of duplicate rows) and the total number of rows in the data frame`df`

.

**Selected Columns**

We wish to get the proportion (percent) of rows where a subset of columns (one or more columns) take the same values as in another row of the data frame.

```
df %>% select(col_1, col_2, col_3) %>% duplicated() %>% mean()
```

Here is how this works:

- We start off by applying
`select()`

to extract a subset of the original data frame df that has only the columns of interest (in this example`col_1`

,`col_2`

, and`col_3`

). - We then use
`duplicated()`

and`mean()`

as described above.

We wish to get the actual duplicated rows.

**Entire Rows**

We wish to get rows that appear more than once in a data frame.

```
library(janitor)
df %>% get_dupes()
```

Here is how this works:

- We used the
`get_dupes()`

function from the`janitor`

package. Given a data frame (here`df`

),`get_dupes()`

returns rows that show up more than once. - We could also use
`df %>% filter(duplicated(df))`

. We prefer the`get_dupes()`

function because it returns an additional helpful column`dupe_count`

that tells us how many duplicates there are.

**Selected Columns**

We wish to get cases where a row has column values equal to those of another row for a selected subset of columns.

```
df %>% get_dupes(col_2, col_2, col_3)
```

Here is how this works:

- We use the
`get_dupes()`

function like described above. - The
`get_dupes()`

function from the`janitor`

package can take a set of columns (here`col_1`

,`col_2`

and`col_3`

) and return rows where those columns take the same values.

R