Filter Groups By Value Rank

For each group in a grouped data frame, we wish to filter rows by the rank of a value of a specific column.

This section is organized as follows:

  • Largest where we cover how to return the row (or rows) where a particular column has its largest value for each group.
  • Smallest where we cover how to return the row (or rows) where a particular column has its smallest value for each group.
  • nth where we cover how to return the row (or rows) where a particular column has its nth largest or smallest value for each group.
  • Largest Values where we cover how to return the rows where a particular column has its largest values, defined by a count or a proportion, for each group.
  • Smallest Values where we cover how to return the rows where a particular column has its smallest values, defined by a count or a proportion, for each group.
  • Range where we cover how to return the rows where a particular column takes a value whose rank falls in a given range of value ranks for each group.
  • List where we cover how to return the rows where a particular column takes a value whose rank falls in a given list of value ranks for each group.

Largest

We wish to return the row (or rows) where a particular column has its largest value for each group.

In this example, we wish to return the row (or rows) of each group where the column col_2 has its max value for the group. The groups are defined by the column col_1.

df_2 = df %>%
  group_by(col_1) %>%
  slice_max(col_2)

Here is how this works:

  • slice_max() when called on a grouped data frame, returns the row(s) of each group where the given column takes its largest value.
  • Since we are interested in the rows where the column col_2 takes its largest value, we use slice_max(col_2).
  • If more than one row in a group takes the largest value of col_2, we would get back more than one row for that group.

Alternatively,

df_2 = df %>%
  group_by(col_1) %>%
  filter(col_2 == max(col_2))

Here is how this works:

  • filter() when called on a grouped data frame, is applied to each group separately.
  • The expression col_2 == max(col_2) is TRUE for the rows of the current group where the value of col_2 is equal to the largest value that col_2 takes for the group.
  • Solutions based on filter() are often easier to remember and are typically more appropriate when we wish to combine multiple filtering conditions.

Smallest

We wish to return the row (or rows) where a particular column has its smallest value for each group.

In this example, we wish to return the row (or rows) of each group where the column col_2 has its smallest value for the group. The groups are defined by the column col_1.

df_2 = df %>%
  group_by(col_1) %>%
  slice_min(col_2)

Here is how this works:

  • slice_min() when called on a grouped data frame, returns the row(s) of each group where the given column takes its smallest value.
  • Since we are interested in the rows where the column col_2 takes its smallest value, we use slice_min(col_2).
  • If more than one row in a group takes the smallest value of col_2, we would get back more than one row for that group.

Alternatively:

df_2 = df %>%
  group_by(col_1) %>%
  filter(col_2 == min(col_2))

Here is how this works:

  • This works similarly to the alternative solution under the "Largest" scenario above.
  • The expression col_2 == min(col_2) is TRUE for the rows of the current group where the value of col_2 is equal to the smallest value that col_2 takes for the group.

nth

We wish to return the row (or rows) where a particular column has its nth largest or smallest value for each group.

Smallest

In this example, we wish to return the row (or rows) of each group where the column col_2 has its third-smallest value for the group. The groups are defined by the column col_1.

df_2 = df %>%
  group_by(col_1) %>%
  filter(dense_rank(col_2) == 3)

Here is how this works:

  • filter() when called on a grouped data frame, is applied to each group separately.
  • dense_rank() returns a ranking of the rows by values of a given column. It is different from regular ranking in that it leaves no gaps between ranks. See Ranking.
  • The expression dense_rank(col_2) == 3 evaluates to TRUE for rows where the rank of the value of col_2 is 3 i.e. at its third-smallest value.

Largest

In this example, we wish to return the row (or rows) of each group where the column col_2 has its third-largest value for the group. The groups are defined by the column col_1.

df_2 = df %>%
  group_by(col_1) %>%
  filter(dense_rank(desc(col_2)) == 3)

Here is how this works:

This works similarly to the “Smallest” scenario above except that we wrap the column to be ranked in desc() so that we get a ranking that starts at the largest value i.e. rank 1 is the largest not the smallest value.

Largest Values

We wish to return the rows where a particular column has its largest values, defined by a count or a proportion, for each group.

Count

We have a data frame that is grouped by one or more columns. For each group, we wish to return the rows for which a particular column takes its n largest values.

In this example, we wish to return the rows from each group where the column col_2 takes its three largest values. The groups are defined by the column col_1.

df_2 = df %>%
  group_by(col_1) %>%
  slice_max(col_2, n = 3)

Here is how this works:

  • slice_max() when called on a grouped data frame, returns the row(s) of each group where the given column takes its largest value(s).
  • In slice_max(col_2, n = 3) we specify that we wish to return rows where the column col_2 takes its three highest values (hence n=3).
  • How about duplicates?
    • If there are no duplicates in the values of the column of interest, slice_max() will return n rows each with a unique value for the column of interest (col_2 in this example).
    • However, if there are duplicates, slice_max() will return a number of rows that is greater than n and the number of unique values returned will likely be smaller than n.
  • What to do in the case of duplicates?
    • If the desired behavior is to get the number of rows closest to n including the duplicates, and we do not care that we get n unique values for the column of interest then the solution above with slice_max() is the way to go. Example input: 2, 2, 2, 3, 4. Example Output: 2, 2,
    • For n = 2.
    • If we wish to drop duplicates and return the first row for each of the largest values of the column of interest we can set the argument with_ties = FALSE when calling slice_max(). We would get n rows each with a unique value for the column of interest just like we would get if there were no duplicates. Example input: 2, 2, 2, 3, 4. Example Output: 2, 3. For n = 2.
    • If we wish to get all the rows where the column of interest takes its n largest unique values then the alternative solution below is the way to go. Example input: 2, 2, 2, 3, 4. Example Output: 2, 2, 2, 3. For n = 2.

Alternatively:

df_2 = df %>%
  group_by(col_1) %>%
  filter(dense_rank(desc(col_2)) <= 3))

Here is how this works:

  • filter() when called on a grouped data frame, is applied to each group separately.
  • dense_rank() returns a ranking of the rows by values of a given column. It is different from regular ranking in that it leaves no gaps between ranks. See Ranking.
  • We wrap the column to be ranked in desc() so that we get a ranking that starts at the largest value i.e. rank 1 is the largest not the smallest value.
  • The expression dense_rank(col_2) <= 3 evaluates to TRUE for rows where the rank of the value of col_2 is any of its three largest values for the group.

Proportion

We have a data frame that is grouped by one or more columns. For each group, we wish to return a proportion of the rows for which a particular column takes its largest values.

In this example, we wish to return rows from each group where the column col_2 takes a value within its 20% largest values. The groups are defined by the column col_1.

df_2 = df %>%
  group_by(col_1) %>%
  slice_max(col_2, prop = 0.2)

Here is how this works:

  • We use the prop argument of slice_max() to specify the proportion of rows to return (as opposed to using the argument n to specify the number of rows to return).
  • In slice_max(col_2, prop = 0.2) we specify that we wish to return the 20% of the rows where the column col_2 has its largest values.
  • The discussion on duplicates under “Count” applies here too.

Alternatively:

df_2 = df %>%
  group_by(col_1) %>%
  filter(
    dense_rank(desc(col_2)) <= 0.2 * n_distinct(col_2))

Here is how this works:

  • This works similarity to the alternative scenario under “Count” above.
  • n_distinct(col_2) returns the number of unique values that the column col_2 takes. See Uniqueness.
  • 0.2 * n_distinct(col_2) tell us how many values is 20% of the total number of possible values of the column col_2.
  • dense_rank(desc(col_2)) <= 0.2 * n_distinct(col_2) evaluates to TRUE for rows where the rank of the value of col_2 is within the top 20% ranked values.

Smallest Values

We wish to return the rows where a particular column has its smallest values, defined by a count or a proportion, for each group.

In this example, we wish to return the rows from each group where the column col_2 takes its three smallest values. The groups are defined by the column col_1.

df_2 = df %>%
  group_by(col_1) %>%
  slice_min(col_2, n = 3)

Here is how this works:

  • This works similarly to the “Largest Values” scenario above except that we replace slice_max() with slice_min().
  • We can return a proportion rather than a count by using the prop argument of slice_min() instead of the n argument. See “Proportion” under “Largest Values” above.

Alternatively:

df_2 = df %>%
  group_by(col_1) %>%
  filter(dense_rank(col_2) <= 3))

Here is how this works:

  • This works similarly to the alternative solution for the “Count” scenario under “Largest Values” above except that we use dense_rank(col_2) instead of dense_rank(desc(col_2)) because we are looking for the smallest values so ranking in ascending order is appropriate.
  • We can return a proportion rather than a count. See the alternative solution under “Proportion” under “Largest Values” above.

Range

For each group, we wish to return the rows where a particular column takes a value whose rank falls in a given range of value ranks.

In this example, we wish to return rows that take the second through to the fifth ranked possible values of the column col_2. The groups are defined by the column col_1.

df_2 = df %>%
  group_by(col_1) %>%
  filter(between(dense_rank(desc(col_2)), 2, 5))

Here is how this works:

  • filter() when called on a grouped data frame, is applied to each group separately.
  • dense_rank() returns a ranking of the rows by values of a given column. It is different from regular ranking in that it leaves no gaps between ranks. See Ranking.
  • The expression between(dense_rank(desc(col_2)), 2, 5) evaluates to TRUE for rows where the rank of the value of col_2 is between 2 and 5 (inclusive) for the group.
  • We use desc(col_2) so that we get a ranking that starts at the largest value i.e. rank 1 is the largest not the smallest value.

List

For each group, we wish to return the rows where a particular column takes a value whose rank is one of a set of value ranks.

In this example, for each group, we wish to return rows where the column col_2 takes values that are ranked first, third or fifth among the possible values of the column col_2. The groups are defined by the column col_1.

df_2 = df %>%
  group_by(col_1) %>%
  filter(dense_rank(desc(col_2)) %in% c(1, 3, 5))

Here is how this works:

  • This works similarly to the “Range” scenario above.
  • The expression dense_rank(desc(col_2)) %in% c(1, 3, 5) evaluates to TRUE for rows where the rank of the value of col_2 is between 2 and 5 (inclusive) for the group.
R
I/O