Unique Values

We wish to obtain or learn about the unique values in data.

In particular, this section covers the following:

  • In Get Unique Values, we cover how to obtain the actual unique values.
  • In Count Unique Value, we cover how to obtain the number of possible unique values.
  • In Occurrence Frequency, we cover how to compute the frequency of occurrence of each unique value.
  • In Occurrence Proportion, we cover how to compute the ratio between the number of occurrences of each unique value to the total number of occurrences.

In each of those, we cover two scenarios:

  1. One Column: The unique values a single column can take.
  2. Multiple Columns: The unique combinations a set of columns can take.

Get Unique Values

We wish to obtain unique values.

One Column

We wish to obtain the unique values in one column of a table.

In this example, we wish to obtain the unique values of the column col_1 of the table table_1.

SELECT DISTINCT col_1
FROM table_1;

Here is how this works:

  • We use the SELECT DISTINCT to obtain all unique values taken by a column of a table.
  • We specify the name of the column whose unique values we wish to obtain after DISTINCT.

Multiple Columns

We wish to obtain the unique combinations of values of a set of columns of a table.

SELECT DISTINCT col_1, col_2
FROM table_1;

Here is how this works:

This works similarly to the solution above, except we pass multiple columns.

Extension: Unique Rows

SELECT DISTINCT *
FROM table_1;

Here is how this works:

We use SELECT DISTINCT * to get unique rows in a table.

Count Unique Values

We wish to obtain the number of possible unique values.

One Column

We wish to obtain the number of possible unique values in a column of a table.

In this example, we have a table table_1 and we wish to obtain the number of unique values of the column col_2 per group, where the groups are specified by the column col_1.

SELECT col_1,
       COUNT(DISTINCT col_2) col_2_n_distinct
FROM table_1
GROUP BY col_1;

Here is how this works:

  • We use GROUP BY clause to obtain a summary for each group. In this example GROUP BY col_1. See Aggregating.
  • We use COUNT(DISTINCT col_2) to compute the number of unique values of the column col_2 for each group.

Multiple Columns

We wish to obtain the number of possible unique combinations of a set of columns of a table.

In this example, we have a table table_1 and we wish to obtain the number of unique combinations of the columns col_2 and col_3 per group, where the groups are specified by the column col_1.

SELECT col_1,
       COUNT(DISTINCT CONCAT(col_2, col_3)) AS col_2_3_n_distinct
FROM table_1
GROUP BY col_1;

Here is how this works:

We can't pass multiple columns to COUNT(DISTINCT ), so we concatenate the values of col_2 and col_3 using CONCAT() function. This will return the number of unique combinations of those columns per group.

Occurrence Frequency

We wish to compute the frequency of occurrence of each unique value.

One Column

We wish to compute the frequency of occurrence of each unique value of a column of a table.

In this example, we wish to obtain the number of occurrences of each unique value of the column col_1.

SELECT col_1,
       COUNT(col_1) n
FROM table_1
GROUP BY col_1
ORDER BY n DESC;

Here is how this works:

  • We use GROUP BY col_1 clause to obtain unique values for col_1. See Aggregating.
  • We use COUNT(col_1) to compute the number of occurrences for each value of the column col_1.
  • Typically, when we look at unique value counts, we wish to see them sorted in descending order of frequency. To do that, we use ORDER BY n DESC.

Extension: Most Frequent Value

We wish to obtain the most frequent value taken by a column of a table.

In this example, we have a table table_1 and we wish to obtain the most frequent value of the column col_2 per group where the groups are defined by the value of the column col_1.

WITH ranked_table AS
         (SELECT col_1,
                 col_2,
                 ROW_NUMBER()
                         OVER (
                             PARTITION BY col_1
                             ORDER BY COUNT(col_2) DESC ) row_rank
          FROM table_1
          GROUP BY col_1, col_2)

SELECT col_1, col_2 AS col_2_mode
FROM ranked_table
WHERE row_rank = 1

Here is how this works:

  • First we GROUP BY col_1, col_2 to get the rank of each value in col_2 based on its frequency. See Aggregating.
  • We do this using ROW_NUMBER() window function where we:
    • PARTITION BY col_1 to get rank per group.
    • ORDER BY COUNT(col_2) DESC to order by frequency. The most frequent value will have 1 as the row number.
  • We then filter ranked_table using WHERE row_rank = 1 to get the most frequent value of col_2 per group.
  • See Ranking for a coverage of computing the ranks of values in a column.

Multiple Columns

We wish to compute the frequency of occurrence of each unique combination of values of a set of columns of a table.

In this example, we wish to obtain the number of occurrences of each unique combination of values of the columns col_1 and col_2.

SELECT col_1,
       col_2,
       COUNT(1) n
FROM table_1
GROUP BY col_1, col_2
ORDER BY n DESC ;

Here is how this works:

  • We use GROUP BY col_1, col_2 clause to obtain unique combinations for col_1 and col_2. See Aggregating.
  • We use COUNT(1) to compute the number of occurrences for each combination.
  • Typically, when we look at unique value counts, we wish to see them sorted in descending order of frequency. To do that, we use ORDER BY n DESC. See Sorting.

Occurrence Proportion

We wish to compute the ratio between the number of occurrences of each unique value to the total number of occurrences.

One Column

We wish to compute the ratio between the number of occurrences of each unique value of a column of a table to the length of the column.

In this example, we wish to obtain the ratio between the number of occurrences of each unique value of the column col_1 of the table table_1 to the length of the column.

SELECT col_1,
       COUNT(1)                              n,
       COUNT(1) / (SUM(COUNT(1)) OVER ()) AS percent
FROM table_1
GROUP BY col_1;

Here is how this works:

  • We use COUNT(1) to obtain the frequency of occurrence of each unique value as described under Occurrence Frequency above.
  • SUM(COUNT(1)) OVER () is a window function that counts the number of rows in the table and ignores the GROUP BY clause.
  • The output is a table with one row for each unique value in col_1. It has three columns: col_1, n, and percent. col_1 holds the unique values, n holds the number of occurrences of the corresponding unique value, and percent holds the proportion of occurrence of the corresponding unique value.

Multiple Columns

We wish to compute the ratio between the number of occurrences of each unique combination of values of a set of columns of a table to the length of the table.

In this example, we wish to obtain the ratio between the number of occurrences of each unique combination of the values of the columns col_1 and col_2 of the table table_1 to the length of the table.

SELECT col_1,
       col_2,
       COUNT(1) / (SUM(COUNT(1)) OVER ()) AS percent
FROM table_1
GROUP BY col_1, col_2;

Here is how this works:

  • This works similarly to the "One Column" solution above except we group by the names of the set of columns whose unique combinations we are interested in.
  • See Inspecting Factor by Factor for how to create a cross-table between two columns.
SQL
I/O