On this page, we look at how to inspect duplication in a data frame. We will cover three scenarios:

- We wish to know the number of rows where duplication occurs.
- We wish to know the proportion (percent) of rows where duplication occurs.
- We wish to get the actual duplicates (to inspect or further process).

For each scenario, we will look at two variants:

- Cases where a row is an exact replica of another row i.e. all column values are the same.
- Cases where a row has column values equal to those of another row for some selected columns.

See Uniqueness for a more detailed coverage of detecting and dealing with duplication in data.

We wish to know how many rows involve duplication.

**Entire Rows**

We wish to get the number of rows that are a duplicate of another row in a data frame.

```
df.duplicated().sum()
```

Here is how this works:

- The
`duplicated()`

method of Pandas data frames returns a boolean vector where an element is`True`

if the row is a duplicate of a row that occurred earlier. - We then apply the
`sum()`

method to the vector (Series) resulting from`duplicated()`

to count the number of`True`

elements which is the number of duplicate rows.

**Selected Columns**

We wish to get the number of rows where a subset of columns (one or more columns) take the same values as in another row in the data frame.

```
df.duplicated(subset=['col_1', 'col_2', 'col_3']).sum()
```

Here is how this works:

- We covered above how
`duplicated()`

and`sum()`

work together to count the number of duplicate rows in a data frame. - The
`duplicated()`

method has a`subset`

argument to which we can pass a list of the columns we wish to consider for duplicate detection. In this example, we consider only columns`['col_1', 'col_2', 'col_3']`

in our duplicate detection.

We wish to know what proportion (percent) of rows involves duplication.

**Entire Rows**

We wish to get the proportion of rows of a data frame that are a duplicate of another row in the data frame.

```
df.duplicated().mean()
```

Here is how this works:

- This works similarly to how we obtained the number of duplicated rows above except that we use
`mean()`

instead of`sum()`

to obtain the proportion (percent) of duplicated rows. `mean()`

regards`True`

as`1`

and`False`

as`0`

. Therefore, taking the mean of a boolean vector is basically taking the ratio between the number of`True`

values (which in this case is the number of duplicate rows) and the total number of rows in the data frame`df`

.

**Selected Columns**

We wish to get the proportion (percent) of rows where a subset of columns (one or more columns) take the same values as in another row of the data frame.

```
df.duplicated(subset=['col_1', 'col_2', 'col_3']).mean()
```

Here is how this works:

`df.duplicated().mean()`

works as described above.- We use the
`subset`

argument of`duplicated()`

to specify the columns to consider for duplicate detection.

We wish to get the actual duplicated rows.

**Entire Rows**

We wish to get rows that appear more than once in a data frame.

```
df.loc[df.duplicated()]
```

Here is how this works:

- The
`duplicated()`

method of Pandas data frames returns a boolean vector that has as many values as the number of rows in the data frame (here`df`

) where an element is`True`

if the row is a duplicate of a row that occurred earlier. - We use
`loc[]`

to filter rows by the boolean vector generated by`duplicated()`

. See Filtering for more on data frame filtering.

**Selected Columns**

We wish to get cases where a row has column values equal to those of another row for a selected subset of columns.

```
df.loc[df.duplicated(subset=['col_1', 'col_2', 'col_3'])]
```

Here is how this works:

- The filtering of duplicate rows works similarly as above.
- We use the
`subset`

argument of`duplicated()`

to specify the columns to consider for duplicate detection.

PYTHON