Filter Groups By Value Rank

For each group in a column, we wish to filter rows by the rank of a value of a specific column.

This section is organized as follows:

Largest where we cover how to return the row (or rows) where a particular column has its largest value for each group.
Smallest where we cover how to return the row (or rows) where a particular column has its smallest value for each group.
nth where we cover how to return the row (or rows) where a particular column has its nth largest or smallest value for each group.
Largest Values where we cover how to return the rows where a particular column has its largest values, defined by a count or a proportion, for each group.
Smallest Values where we cover how to return the rows where a particular column has its smallest values, defined by a count or a proportion, for each group.
Range where we cover how to return the rows where a particular column takes a value whose rank falls in a given range of value ranks for each group.
List where we cover how to return the rows where a particular column takes a value whose rank falls in a given list of value ranks for each group.

Largest

We wish to return the row (or rows) where a particular column has its largest value for each group.

In this example, we wish to return the row (or rows) of each group where the column col_1 has its max value for the group. The groups are defined by the column col_10.

SELECT *
FROM refcon.examples.dummy_table
    QUALIFY DENSE_RANK() OVER (PARTITION BY col_10 ORDER BY col_1 DESC ) = 1;

Here is how this works:

DENSE_RANK() returns a ranking of the rows by values of a given column. It is different from regular ranking in that it leaves no gaps between ranks. See Ranking.
We PARTITION BY col_10 to calculate the rank per group and sort by col_1 in a descending manner i.e. rank 1 is the largest, not the smallest value.
The expression QUALIFY = 1 evaluates to TRUE for rows where the rank of the value of col_1 is any of its three largest values for the group.

Alternatively:

WITH group_max AS
         (SELECT col_10, MAX(col_1) col_1_max
          FROM refcon.examples.dummy_table
          GROUP BY col_10)
SELECT *
FROM refcon.examples.dummy_table t
         INNER JOIN group_max m
                    ON t.col_10 = m.col_10
                        AND t.col_1 = m.col_1_max;

Here is how this works:

We calculate the MAX(col_1) in a separate CTE by grouping by col_10.
We inner join group_max with dummy_table on both col_10 and col_1_max. This will only keep rows with a value equal to the max of the group.

Smallest

We wish to return the row (or rows) where a particular column has its smallest value for each group.

In this example, we wish to return the row (or rows) of each group where the column col_1 has its smallest value for the group. The groups are defined by the column col_10.

SELECT *
FROM refcon.examples.dummy_table
    QUALIFY DENSE_RANK() OVER (PARTITION BY col_10 ORDER BY col_1 ) = 1

Here is how this works:

This works similarly to the “Largest” scenario above except that we don't order using DESC so we get a ranking that starts at the smallest value i.e. rank 1 is the smallest, not the largest value.

nth

We wish to return the row (or rows) where a particular column has its nth largest or smallest value for each group.

Largest

In this example, we wish to return the row (or rows) of each group where the column col_1 has its third-largest value for the group. The groups are defined by the column col_10.

SELECT *
FROM refcon.examples.dummy_table
    QUALIFY DENSE_RANK() OVER (PARTITION BY col_10 ORDER BY col_1 DESC) = 3

Here is how this works:

This works similarly to the “Largest” scenario above except we filter using QUALIFY =3.

Smallest

In this example, we wish to return the row (or rows) of each group where the column col_1 has its third-smallest value for the group. The groups are defined by the column col_10.

SELECT *
FROM refcon.examples.dummy_table
    QUALIFY DENSE_RANK() OVER (PARTITION BY col_10 ORDER BY col_1 ) = 3

Here is how this works:

This works similarly to the "Largest" scenario above.
We don't add DESC keyword after the ORDER BY clause to order by col_1 in ascending manner. i.e. rank 1 is the smallest, not the largest value.

Largest Values

We wish to return the rows where a particular column has its largest values, defined by a count or a proportion, for each group.

Count

For each group, we wish to return the rows for which a particular column takes its nth largest values.

In this example, we wish to return the rows from each group where the column col_1 takes its three largest values. The groups are defined by the column col_10.

SELECT *
FROM refcon.examples.dummy_table
    QUALIFY DENSE_RANK() OVER (PARTITION BY col_10 ORDER BY col_1 DESC) <= 3

Here is how this works:

This works similarly to the "Largest" scenario above except we keep rows where the column col_1 takes its three highest values (hence <=3).
How about duplicates?
- If there are no duplicates in the values of the column of interest, QUALIFY will return n rows each with a unique value for the column of interest (col_1 in this example).
- However, if there are duplicates, QUALIFY will return a number of rows that is greater than n and the number of unique values returned will likely be smaller than n.
What to do in the case of duplicates?
- If we wish to get all the rows where the column of interest takes its n largest unique values then the solution above is the way to go. Example input: 2, 2, 2, 3, 4. Example Output: 2, 2, 3, 3. For n = 2
- If the desired behavior is to get the number of rows closest to n including the duplicates, and we do not care that we get n unique values for the column of interest then we use ROW_NUMBER instead of DENSE_RANK. Example input: 2, 2, 2, 3, 4. Example Output: 2, 2. For n = 2.
- If we wish to drop duplicates and return the first row for each of the largest values of the column of interest we use the alternative solution below. We would get n rows each with a unique value for the column of interest just like we would get if there were no duplicates. Example input: 2, 2, 2, 3, 4. Example Output: 2, 3. For n = 2.

Alternatively:

WITH ranked_table AS
         (SELECT *,
                 DENSE_RANK() OVER (PARTITION BY col_10 ORDER BY col_1 DESC) row_rank,
          FROM refcon.examples.dummy_table)

SELECT t.* EXCEPT (row_rank)
FROM ranked_table t
WHERE row_rank <= 3
    QUALIFY ROW_NUMBER() OVER (PARTITION BY col_10,row_rank ORDER BY row_rank) = 1;

Here is how this works:

First, we calculate DENSE_RANK similar to the scenarios above.
The expression row_rank <= 3 evaluates to TRUE for rows where the rank of the value of col_1 is any of its three largest values for the group.
We only keep the first value for each of the largest values by using QUALIFY on the ranked table partition by col_10 and row_rank. Only the first largest value per row_rank will evaluate to TRUE.

Proportion

For each group, we wish to return a proportion of the rows for which a particular column takes its largest values.

In this example, we wish to return rows from each group where the column col_1 takes a value within its 20% largest values. The groups are defined by the column col_10.

SELECT *
FROM refcon.examples.dummy_table
    QUALIFY DENSE_RANK() OVER (PARTITION BY col_10 ORDER BY col_1 DESC)
                < COUNT(DISTINCT col_1) OVER (PARTITION BY col_10) * 0.2;

Here is how this works:

This works similarly to the "Count" scenario above.
COUNT(DISTINCT col_1) OVER (PARTITION BY col_10) returns the number of unique values that the column col_1 takes per group. See Uniqueness.
0.2 * COUNT(DISTINCT col_1) tells us how many values are 20% of the total number of possible values of the column col_1.
The expression after QUALIFY evaluates to TRUE for rows where the rank of the value of col_1 is within the top 20% ranked values.

Smallest Values

We wish to return the rows where a particular column has its smallest values, defined by a count or a proportion, for each group.

In this example, we wish to return the rows from each group where the column col_1 takes its three smallest values. The groups are defined by the column col_10.

SELECT *
FROM refcon.examples.dummy_table
    QUALIFY DENSE_RANK() OVER (PARTITION BY col_10 ORDER BY col_1) <= 3

Here is how this works:

This works similarly to the “Largest Values” scenario above except that we replace and remove DESC keyword.
We can return a proportion rather than a count. See the alternative solution under “Proportion” under “Largest Values” above.

Range

For each group, we wish to return the rows where a particular column takes a value whose rank falls in a given range of value ranks.

In this example, we wish to return rows that take the second through to the fifth ranked possible values of the column col_1. The groups are defined by the column col_10

SELECT *
FROM refcon.examples.dummy_table
    QUALIFY DENSE_RANK() OVER (PARTITION BY col_10 ORDER BY col_1 DESC)
        BETWEEN 2 AND 5

Here is how this works:

This works similarly to the “Largest” scenario above.
The expression BETWEEN 2 AND 5 evaluates to TRUE for rows where the rank of the value of col_1 is between 2 and 5 (inclusive) for the group.

List

For each group, we wish to return the rows where a particular column takes a value whose rank is one of a set of value ranks.

In this example, for each group, we wish to return rows where the column col_1 takes values that are ranked first, third or fifth among the possible values of the column col_1. The groups are defined by the column col_10

SELECT *
FROM refcon.examples.dummy_table
    QUALIFY DENSE_RANK() OVER (PARTITION BY col_10 ORDER BY col_1 DESC) IN (1, 3, 5)

Here is how this works:

This works similarly to the “Largest” scenario above.
The expression IN (1, 3, 5) evaluates to TRUE for rows where the rank of the value of col_1 take on the values 1, 3, 5.

Optima.io Reference beta