Duplicates

We wish to identify and act on duplicate rows in a table.

This section is structured as follows:

This section is complemented by Inspecting Duplicates, where we cover how to:

  • Obtain the duplicate rows.
  • Count duplicate rows.
  • Calculate the proportion of duplicate rows.

Identify Duplicates

We wish to identify which rows are duplicates of other rows.

Entire Rows

We wish to identify duplicate rows in a table. A row is considered a duplicate if it has the same values for all columns as a row that appears earlier in the table.

In this example, we wish to add to the table table_1 a new logical column with the name is_duplicate that is TRUE for rows that are duplicates of rows that appeared earlier and FALSE otherwise.

WITH ranked_table AS
         (SELECT *,
                 ROW_NUMBER() OVER (PARTITION BY col_1,col_2,col_3) dupe_index
          FROM table_1)
SELECT * EXCEPT ( dupe_index),
       CASE WHEN dupe_index = 1 THEN FALSE ELSE TRUE END AS is_duplicate
FROM ranked_table;

Here is how this works:

  • We use the ROW_NUMBER() window function to rank each row using all columns in the table using PARTITION BY col_1,col_2,col_3.
  • We create a is_duplicate columns using a CASE WHEN expression, and we exclude the dupe_index column from the output using EXCEPT(). See Exclude Columns.

Selected Columns

We wish to identify duplicate rows in a table. A row is considered a duplicate if it has the same values for a selected subset of columns as a row appearing earlier in the table.

WITH ranked_table AS
         (SELECT *,
                 ROW_NUMBER() OVER (PARTITION BY col_1,col_2) dupe_index
          FROM table_1)
SELECT * EXCEPT (dupe_index),
       CASE WHEN dupe_index = 1 THEN FALSE ELSE TRUE END AS is_duplicate
FROM ranked_table;

Here is how this works:

  • This code works similarly to the “Entire Rows” scenario above except that we only use the subset of columns that define a unique row in PARTITION BY, in this example PARTITION BY col_1,col_2.

Extension: Duplicate Details

For each row, we wish to compute (1) whether it’s a duplicate of a row that showed up earlier, (2) the index of the duplicate e.g. the 3rd time the row shows up, (3) the number of existing duplicates.

In this example, we wish to compute the three duplicate details specified in the previous paragraph. A row is considered a duplicate if it has the same values for columns col_1 and col_2 as a row appearing earlier in the table.

WITH ranked_table AS
         (SELECT *,
                 ROW_NUMBER() OVER (PARTITION BY col_1,col_2) dupe_index
          FROM table_1)
SELECT *,
       CASE WHEN dupe_index = 1 THEN FALSE ELSE TRUE END AS is_duplicate,
       MAX(dupe_index) OVER (PARTITION BY col_1,col_2)   AS dupe_count
FROM ranked_table;

Here is how this works:

  • This code works similarly to the “Selected Columns” scenario above except that we keep dupe_index column and calculate the duplicate count.
  • is_duplicate is a logical column that is TRUE if the row is a duplicate of a row that showed up earlier and FALSE otherwise.
  • dupe_index tells us the index of the duplicate e.g. the 3rd time the row shows up.
  • The maximum dupe_index tells us how many times each combination of the values of col_1 and col_2 is duplicated.
  • We use MAX() as a window function to get the max dupe_index per group where each group is defined by col_1 and col_2.
  • The output table is a copy of the input table table_1 with three columns added dupe_count, dupe_index, and is_duplicate.

Extension: Check for a 1-to-1 Relationship.

We wish to identify cases where one column does not have a one-to-one relationship with another column.

SELECT col_1,
       CASE WHEN COUNT(DISTINCT col_2) = 1 THEN TRUE ELSE FALSE END col_2
FROM table_1
GROUP BY col_1;

Here is how this works:

  • We group by the first column of the two which here is col_1.
  • We then count the number of unique values the second column takes which here is col_2.
  • Intuitively if there is a one-to-one relationship between col_1 and col_2 then there should be only 1 unique value for col_2 for each unique value of col_1.
  • Note that a one-to-one relationship between the values of col_1 and the values of col_2 does not imply a one-to-one relationship between the values of col_2 and the values of col_1.

Drop Duplicates

We wish to drop (remove) duplicate rows in a table.

Entire Rows

We wish to remove duplicate rows in a table. A row is considered a duplicate if it has the same values for all columns as a row that appears earlier in the table.

SELECT DISTINCT  *
FROM table_1;

Here is how this works:

  • The SELECT DISTINCT statement, as the name suggests, selects only unique rows from a table.
  • The result table contains the unique rows of the original table table_1.

Selected Columns

We wish to remove duplicate rows in a table. A row is considered a duplicate if it has the same values for a selected subset of columns as a row appearing earlier in the table.

SELECT *
FROM table_1
QUALIFY ROW_NUMBER() OVER (PARTITION BY col_1,col_2) = 1;

Here is how this works:

  • We use ROW_NUMBER() window function to calculate the duplicate index for each row.
  • In order to limit the definition of a duplicate to a subset of columns, we pass those columns to PARTITION BY which we do here via col_1 and col_2.
  • We use QUALIFY to filter a table based on the results of a window function. In this example we only keep rows where duplicate index is 1.
SQL
I/O