On this page, we look at how to inspect duplication in a table. We will cover three scenarios:
For each scenario, we will look at two variants:
See Uniqueness for a more detailed coverage of detecting and dealing with duplication in data.
We wish to know how many rows involve duplication.
Entire Rows
We wish to get the number of rows that are a duplicate of another row in a table.
SELECT (SELECT COUNT(1) row_count
FROM refcon.dataset.table_1) -
(SELECT COUNT(1) unique_row_count
FROM (SELECT DISTINCT *
FROM refcon.dataset.table_1));
Here is how this works:
row_count
.DISTINCT *
returns unique rows which we then count to get unique_row_count
.Selected Columns
We wish to get the number of rows where a subset of columns (one or more columns) take the same values as in another row in the table.
WITH duplicate_rows AS
(SELECT col_1, col_2, col_3, COUNT(1) occurrence
FROM refcon.dataset.table_1
GROUP BY col_1, col_2, col_3
HAVING COUNT(1) > 1)
SELECT COUNT(1) unique_rows, SUM(occurrence) total_rows
FROM duplicate_rows
Here is how this works:
col_1, col_2, col_3
in this example.HAVING COUNT(1) > 1
will only keep rows where the values of the 3 columns repeat in at least two
rows.COUNT(1)
gives the count of unique rows with duplicated values, and total_rows
is the number
of rows that are
duplicates of at least one other row (relative to col_1, col_2, col_3
).We wish to know what proportion (percent) of rows involves duplication.
Entire Rows
We wish to get the proportion of rows of a table that are a duplicate of another row in the table.
WITH table_rows AS (SELECT COUNT(1) row_count
FROM refcon.dataset.table_1),
unique_rows AS (SELECT COUNT(1) unique_row_count
FROM (SELECT DISTINCT *
FROM refcon.dataset.table_1))
SELECT ((SELECT row_count FROM table_rows) -
(SELECT unique_row_count FROM unique_rows))
/
(SELECT row_count FROM table_rows)
Here is how this works:
Selected Columns
We wish to get the proportion (percent) of rows where a subset of columns (one or more columns) take the same values as in another row of the table.
WITH table_rows AS (SELECT COUNT(1) row_count
FROM refcon.dataset.table_1),
duplicate_rows AS
(SELECT col_1, col_2, col_3
FROM refcon.dataset.table_1
GROUP BY col_1, col_2, col_3
HAVING COUNT(1) > 1)
SELECT (SELECT COUNT(1) FROM duplicate_rows) /
(SELECT row_count FROM table_rows)
Here is how this works:
This works similarly to the "Count: Selected Columns" scenario above, except we calculate proportion.
We wish to get the actual duplicated rows.
Entire Rows
We wish to get rows that appear more than once in a table.
WITH duplicate_rows AS
(SELECT TO_JSON_STRING(t) row
FROM refcon.dataset.table_1 t
QUALIFY ROW_NUMBER() OVER
(PARTITION BY TO_JSON_STRING(t)) > 1)
SELECT *
FROM refcon.dataset.table_1 t
WHERE TO_JSON_STRING(t) IN (SELECT row FROM duplicate_rows);
Here is how this works:
ROW_NUMBER()
returns the index of the current row starting at 1 for the topmost row.PARTITION BY TO_JSON_STRING(t)
to calculate the index per row. TO_JSON_STRING
will
convert the full row
into a single json.QUALIFY
clause to filter based on the output of ROW_NUMBER()
.ROW_NUMBER() > 1
is TRUE
any row that is a duplicate of another row.SELECT DISTINCT *
if we wish to return the unique rows only.Selected Columns
We wish to get cases where a row has column values equal to those of another row for a selected subset of columns.
WITH duplicate_rows AS
(SELECT STRUCT (col_1,col_2,col_3) duplicate_key
FROM (SELECT col_1, col_2, col_3
FROM refcon.dataset.table_1
GROUP BY col_1, col_2, col_3
HAVING COUNT(1) > 1))
SELECT *
FROM refcon.dataset.table_1
WHERE STRUCT (col_1,col_2,col_3) IN
(SELECT duplicate_key FROM duplicate_rows);
Here is how this works:
col_1, col_2, col_3
.STRUCT
.IN
operator.SELECT DISTINCT *
if we wish to return the unique rows only.