For each group in a grouped data frame, we wish to filter rows by the rank of a value of a specific column.
This section is organized as follows:
We wish to return the row (or rows) where a particular column has its largest value for each group.
In this example, we wish to return the row (or rows) of each group where the column col_2
has its
max value for the group. The groups are defined by the column col_1
.
df_2 = df %>%
group_by(col_1) %>%
slice_max(col_2)
Here is how this works:
slice_max()
when called on a grouped data frame, returns the row(s) of each group where the
given column takes its largest value.col_2
takes its largest value, we
use slice_max(col_2)
.col_2
, we would get back more than
one row for that group.Alternatively,
df_2 = df %>%
group_by(col_1) %>%
filter(col_2 == max(col_2))
Here is how this works:
filter()
when called on a grouped data frame, is applied to each group separately.col_2 == max(col_2)
is TRUE
for the rows of the current group where the value
of col_2
is equal to the largest value that col_2
takes for the group.filter()
are often easier to remember and are typically more appropriate when
we wish to combine multiple filtering conditions.We wish to return the row (or rows) where a particular column has its smallest value for each group.
In this example, we wish to return the row (or rows) of each group where the column col_2
has its
smallest value for the group. The groups are defined by the column col_1
.
df_2 = df %>%
group_by(col_1) %>%
slice_min(col_2)
Here is how this works:
slice_min()
when called on a grouped data frame, returns the row(s) of each group where the
given column takes its smallest value.col_2
takes its smallest value, we
use slice_min(col_2)
.col_2
, we would get back more than
one row for that group.Alternatively:
df_2 = df %>%
group_by(col_1) %>%
filter(col_2 == min(col_2))
Here is how this works:
col_2 == min(col_2)
is TRUE
for the rows of the current group where the value
of col_2
is equal to the smallest value that col_2
takes for the group.We wish to return the row (or rows) where a particular column has its nth largest or smallest value for each group.
Smallest
In this example, we wish to return the row (or rows) of each group where the column col_2
has its
third-smallest value for the group. The groups are defined by the column col_1
.
df_2 = df %>%
group_by(col_1) %>%
filter(dense_rank(col_2) == 3)
Here is how this works:
filter()
when called on a grouped data frame, is applied to each group separately.dense_rank()
returns a ranking of the rows by values of a given column. It is different from
regular ranking in that it leaves no gaps between ranks.
See Ranking.dense_rank(col_2) == 3
evaluates to TRUE
for rows where the rank of the value
of col_2
is 3 i.e. at its third-smallest value.Largest
In this example, we wish to return the row (or rows) of each group where the column col_2
has its
third-largest value for the group. The groups are defined by the column col_1
.
df_2 = df %>%
group_by(col_1) %>%
filter(dense_rank(desc(col_2)) == 3)
Here is how this works:
This works similarly to the “Smallest” scenario above except that we wrap the column to be ranked
in desc()
so that we get a ranking that starts at the largest value i.e. rank 1 is the largest not
the smallest value.
We wish to return the rows where a particular column has its largest values, defined by a count or a proportion, for each group.
Count
We have a data frame that is grouped by one or more columns. For each group, we wish to return the rows for which a particular column takes its n largest values.
In this example, we wish to return the rows from each group where the column col_2
takes its three
largest values. The groups are defined by the column col_1
.
df_2 = df %>%
group_by(col_1) %>%
slice_max(col_2, n = 3)
Here is how this works:
slice_max()
when called on a grouped data frame, returns the row(s) of each group where the
given column takes its largest value(s).slice_max(col_2, n = 3)
we specify that we wish to return rows where the column col_2
takes
its three highest values (hence n=3
).slice_max()
will
return n
rows each with a unique value for the column of interest (col_2
in this example).slice_max()
will return a number of rows that is greater
than n
and the number of unique values returned will likely be smaller than n
.n
including the duplicates,
and we do not care that we get n
unique values for the column of interest then the solution
above with slice_max()
is the way to go. Example input: 2, 2, 2, 3, 4. Example Output: 2, 2,n = 2
.with_ties = FALSE
when calling slice_max()
. We
would get n
rows each with a unique value for the column of interest just like we would get
if there were no duplicates. Example input: 2, 2, 2, 3, 4. Example Output: 2, 3. For n = 2
.n
largest unique
values then the alternative solution below is the way to go. Example input: 2, 2, 2, 3, 4.
Example Output: 2, 2, 2, 3. For n = 2
.Alternatively:
df_2 = df %>%
group_by(col_1) %>%
filter(dense_rank(desc(col_2)) <= 3))
Here is how this works:
filter()
when called on a grouped data frame, is applied to each group separately.dense_rank()
returns a ranking of the rows by values of a given column. It is different from
regular ranking in that it leaves no gaps between ranks.
See Ranking.desc()
so that we get a ranking that starts at the largest
value i.e. rank 1 is the largest not the smallest value.dense_rank(col_2) <= 3
evaluates to TRUE
for rows where the rank of the value
of col_2
is any of its three largest values for the group.Proportion
We have a data frame that is grouped by one or more columns. For each group, we wish to return a proportion of the rows for which a particular column takes its largest values.
In this example, we wish to return rows from each group where the column col_2
takes a value
within its 20% largest values. The groups are defined by the column col_1
.
df_2 = df %>%
group_by(col_1) %>%
slice_max(col_2, prop = 0.2)
Here is how this works:
prop
argument of slice_max()
to specify the proportion of rows to return (as
opposed to using the argument n
to specify the number of rows to return).slice_max(col_2, prop = 0.2)
we specify that we wish to return the 20% of the rows where the
column col_2
has its largest values.Alternatively:
df_2 = df %>%
group_by(col_1) %>%
filter(
dense_rank(desc(col_2)) <= 0.2 * n_distinct(col_2))
Here is how this works:
n_distinct(col_2)
returns the number of unique values that the column col_2
takes.
See Uniqueness.0.2 * n_distinct(col_2)
tell us how many values is 20% of the total number of possible values of
the column col_2
.dense_rank(desc(col_2)) <= 0.2 * n_distinct(col_2)
evaluates to TRUE
for rows where the rank
of the value of col_2
is within the top 20% ranked values.We wish to return the rows where a particular column has its smallest values, defined by a count or a proportion, for each group.
In this example, we wish to return the rows from each group where the column col_2
takes its three
smallest values. The groups are defined by the column col_1
.
df_2 = df %>%
group_by(col_1) %>%
slice_min(col_2, n = 3)
Here is how this works:
slice_max()
with slice_min()
.prop
argument of slice_min()
instead of the n
argument. See “Proportion” under “Largest Values” above.Alternatively:
df_2 = df %>%
group_by(col_1) %>%
filter(dense_rank(col_2) <= 3))
Here is how this works:
dense_rank(col_2)
instead of dense_rank(desc(col_2))
because we are
looking for the smallest values so ranking in ascending order is appropriate.For each group, we wish to return the rows where a particular column takes a value whose rank falls in a given range of value ranks.
In this example, we wish to return rows that take the second through to the fifth ranked possible
values of the column col_2
. The groups are defined by the column col_1.
df_2 = df %>%
group_by(col_1) %>%
filter(between(dense_rank(desc(col_2)), 2, 5))
Here is how this works:
filter()
when called on a grouped data frame, is applied to each group separately.dense_rank()
returns a ranking of the rows by values of a given column. It is different from
regular ranking in that it leaves no gaps between ranks.
See Ranking.between(dense_rank(desc(col_2)), 2, 5)
evaluates to TRUE
for rows where the
rank of the value of col_2
is between 2 and 5 (inclusive) for the group.desc(col_2)
so that we get a ranking that starts at the largest value i.e. rank 1 is the
largest not the smallest value.For each group, we wish to return the rows where a particular column takes a value whose rank is one of a set of value ranks.
In this example, for each group, we wish to return rows where the column col_2
takes values that
are ranked first, third or fifth among the possible values of the column col_2
. The groups are
defined by the column col_1.
df_2 = df %>%
group_by(col_1) %>%
filter(dense_rank(desc(col_2)) %in% c(1, 3, 5))
Here is how this works:
dense_rank(desc(col_2)) %in% c(1, 3, 5)
evaluates to TRUE
for rows where the
rank of the value of col_2
is between 2 and 5 (inclusive) for the group.