Duplicates

On this page, we look at how to inspect duplication in a table. We will cover three scenarios:

  1. We wish to know the number of rows where duplication occurs.
  2. We wish to know the proportion (percent) of rows where duplication occurs.
  3. We wish to get the actual duplicates (to inspect or further process).

For each scenario, we will look at two variants:

  1. Cases where a row is an exact replica of another row i.e. all column values are the same.
  2. Cases where a row has column values equal to those of another row for some selected columns.

See Uniqueness for a more detailed coverage of detecting and dealing with duplication in data.

Count

We wish to know how many rows involve duplication.

Entire Rows

We wish to get the number of rows that are a duplicate of another row in a table.

SELECT (SELECT COUNT(1) row_count
        FROM refcon.dataset.table_1) -
       (SELECT COUNT(1) unique_row_count
        FROM (SELECT DISTINCT *
              FROM refcon.dataset.table_1));

Here is how this works:

  • We count all rows in the table to get row_count.
  • DISTINCT * returns unique rows which we then count to get unique_row_count.
  • We subtract the two values to get the number of duplicate rows in the table.

Selected Columns

We wish to get the number of rows where a subset of columns (one or more columns) take the same values as in another row in the table.

WITH duplicate_rows AS
         (SELECT col_1, col_2, col_3, COUNT(1) occurrence
          FROM refcon.dataset.table_1
          GROUP BY col_1, col_2, col_3
          HAVING COUNT(1) > 1)
SELECT COUNT(1) unique_rows, SUM(occurrence) total_rows
FROM duplicate_rows

Here is how this works:

  • We group by the columns we are interested in which are col_1, col_2, col_3 in this example.
  • HAVING COUNT(1) > 1 will only keep rows where the values of the 3 columns repeat in at least two rows.
  • COUNT(1) gives the count of unique rows with duplicated values, and total_rows is the number of rows that are duplicates of at least one other row (relative to col_1, col_2, col_3).

Proportion

We wish to know what proportion (percent) of rows involves duplication.

Entire Rows

We wish to get the proportion of rows of a table that are a duplicate of another row in the table.

WITH table_rows AS (SELECT COUNT(1) row_count
                    FROM refcon.dataset.table_1),
     unique_rows AS (SELECT COUNT(1) unique_row_count
                     FROM (SELECT DISTINCT *
                           FROM refcon.dataset.table_1))
SELECT ((SELECT row_count FROM table_rows) -
        (SELECT unique_row_count FROM unique_rows))
           /
       (SELECT row_count FROM table_rows)

Here is how this works:

  • This works similarly to the "Entire Rows" scenario above except now we divide by the total row count.

Selected Columns

We wish to get the proportion (percent) of rows where a subset of columns (one or more columns) take the same values as in another row of the table.

WITH table_rows AS (SELECT COUNT(1) row_count
                    FROM refcon.dataset.table_1),
     duplicate_rows AS
         (SELECT col_1, col_2, col_3
          FROM refcon.dataset.table_1
          GROUP BY col_1, col_2, col_3
          HAVING COUNT(1) > 1)
SELECT (SELECT COUNT(1) FROM duplicate_rows) /
       (SELECT row_count FROM table_rows)

Here is how this works:

This works similarly to the "Count: Selected Columns" scenario above, except we calculate proportion.

Get Rows

We wish to get the actual duplicated rows.

Entire Rows

We wish to get rows that appear more than once in a table.

WITH duplicate_rows AS
         (SELECT TO_JSON_STRING(t) row
          FROM refcon.dataset.table_1 t
              QUALIFY ROW_NUMBER() OVER
                  (PARTITION BY TO_JSON_STRING(t)) > 1)
SELECT *
FROM refcon.dataset.table_1 t
WHERE TO_JSON_STRING(t) IN (SELECT row FROM duplicate_rows);

Here is how this works:

  • ROW_NUMBER() returns the index of the current row starting at 1 for the topmost row.
  • We use PARTITION BY TO_JSON_STRING(t) to calculate the index per row. TO_JSON_STRING will convert the full row into a single json.
  • We use QUALIFY clause to filter based on the output of ROW_NUMBER().
  • ROW_NUMBER() > 1 is TRUE any row that is a duplicate of another row.
  • We filter the table by converting each row to a json string and checking if it is one of the duplicate rows.
  • We can use SELECT DISTINCT * if we wish to return the unique rows only.

Selected Columns

We wish to get cases where a row has column values equal to those of another row for a selected subset of columns.

WITH duplicate_rows AS
         (SELECT STRUCT (col_1,col_2,col_3) duplicate_key
          FROM (SELECT col_1, col_2, col_3
                FROM refcon.dataset.table_1
                GROUP BY col_1, col_2, col_3
                HAVING COUNT(1) > 1))
SELECT *
FROM refcon.dataset.table_1
WHERE STRUCT (col_1,col_2,col_3) IN
      (SELECT duplicate_key FROM duplicate_rows);

Here is how this works:

  • We get the duplicate rows in duplicate_rows CTE by grouping by col_1, col_2, col_3.
  • We combine the values of the three columns in a STRUCT.
  • We get the rows that have the same value for these columns using IN operator.
  • We can use SELECT DISTINCT * if we wish to return the unique rows only.
SQL
I/O