For each group in a grouped data frame, we wish to filter rows by their position (row number) in the group.
This section is organized to cover the most common "filtering groups by row position" scenarios as follows:
We wish to return the first row from each group.
In this example, we wish to return the first row of each group where the groups are defined by the column col_1
and the rows are sorted by the column col_2
.
df_2 = df %>%
group_by(col_1) %>%
arrange(col_2) %>%
slice(1)
Here is how this works:
slice(1)
when called on a grouped data frame, returns the first row of each group.slice()
is a versatile and powerful function for sub-setting rows by their position. See Inspecting Segments for multiple scenarios of using slice to inspect a data frame.arrange(col_2)
which sorts the rows of each group in ascending order of the values of the column col_2
(see Sorting). Should the existing sorting be suitable for the task at hand, we can omit the call to arrange()
.Alternatively,
df_2 = df %>%
group_by(col_1) %>%
arrange(col_2) %>%
filter(row_number() == 1)
Here is how this works:
row_number()
returns the index of the current row starting at 1 for the top most row.filter()
on a grouped data frame, row_number()
returns the index of the current row for the current group. Therefore, row_number() == 1
is TRUE
for the first row of each group.We wish to return the last row from each group.
In this example, we wish to return the last row of each group where the groups are defined by the column col_1
and the rows are sorted by the column col_2
.
df_2 = df %>%
group_by(col_7) %>%
arrange(col_5) %>%
slice(n())
Here is how this works:
slice(1)
with slice(n())
n()
, when called inside slice()
on a grouped data frame, returns the number of rows in the current group.slice(n())
is equivalent to filter(row_number() == n())
. when called inside filter()
on a grouped data frame, row_number() == n()
is TRUE
for the last row of each group.We wish to return the nth row from each group.
From Top
In this example, we wish to return the second row of each group where the groups are defined by the column col_1
and the rows are sorted by the column col_2
.
df_2 = df %>%
group_by(col_1) %>%
arrange(col_2) %>%
drop_na(col_2) %>%
slice(2)
Here is how this works:
slice(1)
with slice(2)
(or whatever the row position of the row we are interested in is).slice(2)
is equivalent to filter(row_number() == 2)
.drop_na(col_2)
to exclude rows where the sorting column col_2
is NA
. Should this not be a requirement, this expression can be omitted.From Bottom
In this example, we wish to return the row before the last of each group where the groups are defined by the column col_1
and the rows are sorted by the column col_2
.
df_2 = df %>%
group_by(col_1) %>%
slice(n()-1)
Here is how this works:
slice()
.n()
which returns the position of the last row of the group. We can subtract from n()
to refer to rows relative to the bottom of the group. In this code above, n() - 1
to refer to the second last row of the group.arrange()
and drop_na()
function calls, but they can, of course, be added if needed.We wish to return the top n rows from each group (which are often referred to as the head).
Count
We wish to return a specific number of rows from the top of each group.
In this example, we wish to return the top two rows from each group where the groups are defined by the column col_1
and the rows are sorted by the column col_2
.
df_2 = df %>%
group_by(col_7) %>%
arrange(col_5) %>%
slice_head(n = 2)
Here is how this works:
slice_head()
instead of slice()
.slice_head()
, when called on a grouped data frame, operates on the individual groups to return the top rows from each group.slice_head()
has an argument n
which allows us to specify the number of rows to return from the top. In this example, we set n=2
to return the top two rows from each group.Proportion
We wish to return a proportion (percent) of the total number of rows from each group taken from the top of the group.
In this example, we wish to return the top 20% of the rows of each group where the groups are defined by the column col_1
and the rows are sorted by the column col_2
.
df_2 = df %>%
group_by(col_7) %>%
arrange(col_5) %>%
slice_head(prop=0.2)
Here is how this works:
slice_head()
, when called on a grouped data frame, operates on the individual groups to return the top rows from each group.n
which we used in the “Count” scenario above, slice_head()
has another argument prop
which allows us to specify the proportion of rows to return from the top relative to the total number of rows in each group. In this example, we set prop=0.2
to return the top 20% of the rows from each group.We wish to return the top n rows from each group (which are often referred to as the head).
df_2 = df %>%
group_by(col_7) %>%
arrange(col_5) %>%
slice_tail(n = 2)
Here is how this works:
slice_head()
with slice_tail()
.slice_tail()
too can take a prop
value instead of n
to return a proportion (say 20%) of the rows of each group taken from the bottom.We wish to return a range of rows (also known as a slice), between a given start and end row positions, from each group.
In this example, we wish to return the second through to the second last rows (i.e. drop the first and the last) from each group where the groups are defined by the column col_1
and the rows are sorted by the column col_2
.
df_2 = df %>%
group_by(col_1) %>%
arrange(col_2, .by_group = TRUE) %>%
slice(2:(n()-1))
Here is how this works:
slice()
a range slice(2:9)
instead of a single row position slice(1)
.slice(start:end)
, we get the rows at the start
position and at the end
position as well as all the rows in between..by_group=TRUE
to sort within groups so that the output slices are ordered by group (See Grouped Sorting).We wish to obtain specific rows, given their row numbers, from each group.
In this example, we wish to return the first, second, second last, and last rows of each group of the data frame df
where the groups are defined by the column col_1
and the rows are sorted by the column col_2
.
df_2 = df %>%
group_by(col_1) %>%
arrange(col_2, .by_group = TRUE) %>%
slice(1, 2, n() - 1 , n())
Here is how this works:
slice()
a comma separated list of row positions instead of a range.n()
which returns the position of the last row of the group. We can subtract from n()
to refer to rows relative to the bottom of the group e.g. n() - 1
to refer to the second last row of the group. See “From Bottom” under the “nth” scenario above.arrange(col_2)
. The arrange()
function sorts a data frame in ascending order of the column(s) passed to it (See Sorting)..by_group=TRUE
to sort within groups so that the output slices are ordered by group (See Grouped Sorting).Alternatively:
df_2 = df %>%
group_by(col_1) %>%
arrange(col_2, .by_group = TRUE) %>%
filter(row_number() %in% c(1, 2, n() - 1 , n()))
Here is how this works:
row_number()
returns the index of the current row starting at 1 for the top most row.filter()
on a grouped data frame, row_number()
returns the index of the current row for the current group.%in%
operator to check for membership of each row_number()
in the group in the vector holding the row numbers of interest c(1, 2, n() - 1 , n())
.We wish to return a set of rows taken at random from each group.
Count
We wish to return a specific number of rows taken at random positions from each group.
In this example, we wish to return 10 randomly selected rows from each group where the groups are defined by the column col_1
.
set.seed(1234)
df_2 = df %>%
group_by(col_7) %>%
slice_sample(n=10)
Here is how this works:
slice_sample()
which returns a random set of rows from a data frame.slice_sample()
, when called on a grouped data frame, operates on the individual groups to return a random set of rows from each group.slice_sample(n=10)
returns a random sample of 10 rows since we set n=10
. If n
is not set, slice_sample()
returns 1 row by default.1234
to set.seed()
(we could pass any other number).Proportion
We wish to return a proportion (percent) of the total number of rows from each group taken at random positions.
In this example, we wish to return 20% of the rows of each group taken at random positions where the groups are defined by the column col_1
.
set.seed(1234)
df_2 = df %>%
group_by(col_7) %>%
slice_sample(prop=0.2)
Here is how this works:
prop
argument of slice_sample()
instead of the n
argument.prop=0.2
returns a randomly selected set of 20% of the rows of each group.