We wish to identify and act on duplicate rows in a table.
This section is structured as follows:
This section is complemented by Inspecting Duplicates, where we cover how to:
We wish to identify which rows are duplicates of other rows.
Entire Rows
We wish to identify duplicate rows in a table. A row is considered a duplicate if it has the same values for all columns as a row that appears earlier in the table.
In this example, we wish to add to the table table_1
a new logical column with the
name is_duplicate
that is TRUE
for rows that are duplicates of rows that appeared earlier
and FALSE
otherwise.
WITH ranked_table AS
(SELECT *,
ROW_NUMBER() OVER (PARTITION BY col_1,col_2,col_3) dupe_index
FROM table_1)
SELECT * EXCEPT ( dupe_index),
CASE WHEN dupe_index = 1 THEN FALSE ELSE TRUE END AS is_duplicate
FROM ranked_table;
Here is how this works:
ROW_NUMBER()
window function to rank each row using all columns in the
table using PARTITION BY col_1,col_2,col_3
.is_duplicate
columns using a CASE WHEN
expression, and we exclude the dupe_index
column from the output using EXCEPT()
.
See Exclude Columns.Selected Columns
We wish to identify duplicate rows in a table. A row is considered a duplicate if it has the same values for a selected subset of columns as a row appearing earlier in the table.
WITH ranked_table AS
(SELECT *,
ROW_NUMBER() OVER (PARTITION BY col_1,col_2) dupe_index
FROM table_1)
SELECT * EXCEPT (dupe_index),
CASE WHEN dupe_index = 1 THEN FALSE ELSE TRUE END AS is_duplicate
FROM ranked_table;
Here is how this works:
PARTITION BY
, in this example PARTITION BY col_1,col_2
.Extension: Duplicate Details
For each row, we wish to compute (1) whether it’s a duplicate of a row that showed up earlier, (2) the index of the duplicate e.g. the 3rd time the row shows up, (3) the number of existing duplicates.
In this example, we wish to compute the three duplicate details specified in the previous paragraph.
A row is considered a duplicate if it has the same values for columns col_1
and col_2
as a row
appearing earlier in the table.
WITH ranked_table AS
(SELECT *,
ROW_NUMBER() OVER (PARTITION BY col_1,col_2) dupe_index
FROM table_1)
SELECT *,
CASE WHEN dupe_index = 1 THEN FALSE ELSE TRUE END AS is_duplicate,
MAX(dupe_index) OVER (PARTITION BY col_1,col_2) AS dupe_count
FROM ranked_table;
Here is how this works:
dupe_index
column and calculate the duplicate count.is_duplicate
is a logical column that is TRUE
if the row is a duplicate of a row that showed
up earlier and FALSE
otherwise.dupe_index
tells us the index of the duplicate e.g. the 3rd time the row shows up.dupe_index
tells us how many times each combination of the values of col_1
and col_2
is duplicated.MAX()
as a window function to get the max dupe_index
per group where each group is
defined by col_1
and col_2
.table_1
with three columns
added dupe_count
, dupe_index
, and is_duplicate
.Extension: Check for a 1-to-1 Relationship.
We wish to identify cases where one column does not have a one-to-one relationship with another column.
SELECT col_1,
CASE WHEN COUNT(DISTINCT col_2) = 1 THEN TRUE ELSE FALSE END col_2
FROM table_1
GROUP BY col_1;
Here is how this works:
col_1
.col_2
.col_1
and col_2
then there should be
only 1 unique value for col_2
for each unique value of col_1
.col_1
and the values of col_2
does
not imply a one-to-one relationship between the values of col_2
and the values of col_1
.We wish to drop (remove) duplicate rows in a table.
Entire Rows
We wish to remove duplicate rows in a table. A row is considered a duplicate if it has the same values for all columns as a row that appears earlier in the table.
SELECT DISTINCT *
FROM table_1;
Here is how this works:
SELECT DISTINCT
statement, as the name suggests, selects only unique rows from a table.table_1
.Selected Columns
We wish to remove duplicate rows in a table. A row is considered a duplicate if it has the same values for a selected subset of columns as a row appearing earlier in the table.
SELECT *
FROM table_1
QUALIFY ROW_NUMBER() OVER (PARTITION BY col_1,col_2) = 1;
Here is how this works:
ROW_NUMBER()
window function to calculate the duplicate index for each row.PARTITION BY
which we do here via col_1
and col_2
.QUALIFY
to filter a table based on the results of a window function. In this example we
only keep rows where duplicate index is 1.