Filter Groups By Row Position

For each group in a grouped data frame, we wish to filter rows by their position (row number) in the group.

This section is organized to cover the most common "filtering groups by row position" scenarios as follows:

  • First where we cover how to return the first row from each group.
  • Last where we cover how to return the last row from each group.
  • nth where we cover how to return the nth row from each group.
  • Head where we cover how to return the top most rows, defined by a number of rows or a proportion of rows, from each group.
  • Tail where we cover how to return the bottom most rows, defined by a number of rows or a proportion of rows, from each group.
  • Range where we cover how to return a range of rows, defined by a start and an end position, from each group.
  • List where we cover how to return specific rows from each group.
  • Random where we cover how to return some random rows from each group.

First

We wish to return the first row from each group.

In this example, we wish to return the first row of each group where the groups are defined by the column col_1 and the rows are sorted by the column col_2.

df_2 = df %>% 
  group_by(col_1) %>% 
  arrange(col_2) %>% 
  slice(1)

Here is how this works:

  • slice(1) when called on a grouped data frame, returns the first row of each group.
  • slice() is a versatile and powerful function for sub-setting rows by their position. See Inspecting Segments for multiple scenarios of using slice to inspect a data frame.
  • The notion of “first” is relative to whatever the current sorting of the rows is. Typically, we need to sort the rows in a particular way before we pull the first row. In the code above, we do that via arrange(col_2) which sorts the rows of each group in ascending order of the values of the column col_2 (see Sorting). Should the existing sorting be suitable for the task at hand, we can omit the call to arrange().

Alternatively,

df_2 = df %>% 
  group_by(col_1) %>% 
  arrange(col_2) %>% 
  filter(row_number() == 1)

Here is how this works:

  • row_number() returns the index of the current row starting at 1 for the top most row.
  • When used inside filter() on a grouped data frame, row_number() returns the index of the current row for the current group. Therefore, row_number() == 1 is TRUE for the first row of each group.

Last

We wish to return the last row from each group.

In this example, we wish to return the last row of each group where the groups are defined by the column col_1 and the rows are sorted by the column col_2.

df_2 = df %>% 
  group_by(col_7) %>% 
  arrange(col_5) %>% 
  slice(n())

Here is how this works:

  • The code works similarly to the code in “First” scenario above except that we replace slice(1) with slice(n())
  • n(), when called inside slice() on a grouped data frame, returns the number of rows in the current group.
  • slice(n()) is equivalent to filter(row_number() == n()). when called inside filter() on a grouped data frame, row_number() == n() is TRUE for the last row of each group.

nth

We wish to return the nth row from each group.

From Top

In this example, we wish to return the second row of each group where the groups are defined by the column col_1 and the rows are sorted by the column col_2.

df_2 = df %>% 
  group_by(col_1) %>% 
  arrange(col_2) %>% 
  drop_na(col_2) %>%
  slice(2)

Here is how this works:

  • The code works similarly to the code in “First” scenario above except that we replace slice(1) with slice(2) (or whatever the row position of the row we are interested in is).
  • slice(2) is equivalent to filter(row_number() == 2).
  • Often times we wish to exclude rows where the sorting column is NA before extracting the nth column. To do that, we use drop_na(col_2) to exclude rows where the sorting column col_2 is NA . Should this not be a requirement, this expression can be omitted.

From Bottom

In this example, we wish to return the row before the last of each group where the groups are defined by the column col_1 and the rows are sorted by the column col_2.

df_2 = df %>% 
  group_by(col_1) %>% 
  slice(n()-1)

Here is how this works:

  • This code works similarly to the “From Top” scenario above except for how we refer to the row number inside slice().
  • To refer to a row relative to the bottom of each group we use n() which returns the position of the last row of the group. We can subtract from n() to refer to rows relative to the bottom of the group. In this code above, n() - 1 to refer to the second last row of the group.
  • For simplicity, we dropped the arrange() and drop_na() function calls, but they can, of course, be added if needed.

We wish to return the top n rows from each group (which are often referred to as the head).

Count

We wish to return a specific number of rows from the top of each group.

In this example, we wish to return the top two rows from each group where the groups are defined by the column col_1 and the rows are sorted by the column col_2.

df_2 = df %>% 
  group_by(col_7) %>% 
  arrange(col_5) %>% 
  slice_head(n = 2)

Here is how this works:

  • This works similarly to the “First” scenario above except that we use slice_head() instead of slice().
  • slice_head(), when called on a grouped data frame, operates on the individual groups to return the top rows from each group.
  • slice_head() has an argument n which allows us to specify the number of rows to return from the top. In this example, we set n=2 to return the top two rows from each group.

Proportion

We wish to return a proportion (percent) of the total number of rows from each group taken from the top of the group.

In this example, we wish to return the top 20% of the rows of each group where the groups are defined by the column col_1 and the rows are sorted by the column col_2.

df_2 = df %>% 
  group_by(col_7) %>% 
  arrange(col_5) %>% 
  slice_head(prop=0.2)

Here is how this works:

  • slice_head(), when called on a grouped data frame, operates on the individual groups to return the top rows from each group.
  • Besides the argument n which we used in the “Count” scenario above, slice_head() has another argument prop which allows us to specify the proportion of rows to return from the top relative to the total number of rows in each group. In this example, we set prop=0.2 to return the top 20% of the rows from each group.

Tail

We wish to return the top n rows from each group (which are often referred to as the head).

df_2 = df %>% 
  group_by(col_7) %>% 
  arrange(col_5) %>% 
  slice_tail(n = 2)

Here is how this works:

  • The code works similarly to the code in “Head” scenario above except that we replace slice_head() with slice_tail().
  • Like we covered in the “Proportion” subsection of the “Head” section above, slice_tail() too can take a prop value instead of n to return a proportion (say 20%) of the rows of each group taken from the bottom.

Range

We wish to return a range of rows (also known as a slice), between a given start and end row positions, from each group.

In this example, we wish to return the second through to the second last rows (i.e. drop the first and the last) from each group where the groups are defined by the column col_1 and the rows are sorted by the column col_2.

df_2 = df %>% 
  group_by(col_1) %>% 
  arrange(col_2, .by_group = TRUE) %>% 
  slice(2:(n()-1))

Here is how this works:

  • This works similarly to the “First” scenario above except that we pass to slice() a range slice(2:9) instead of a single row position slice(1).
  • When a range is passed to slice(start:end), we get the rows at the start position and at the end position as well as all the rows in between.
  • We set .by_group=TRUE to sort within groups so that the output slices are ordered by group (See Grouped Sorting).

List

We wish to obtain specific rows, given their row numbers, from each group.

In this example, we wish to return the first, second, second last, and last rows of each group of the data frame df where the groups are defined by the column col_1 and the rows are sorted by the column col_2.

df_2 = df %>% 
  group_by(col_1) %>% 
  arrange(col_2, .by_group = TRUE) %>% 
  slice(1, 2, n() - 1 , n())

Here is how this works:

  • This works similarly to the “Range” scenario above except that we pass to slice() a comma separated list of row positions instead of a range.
  • To refer to rows relative to the bottom of each group we use n() which returns the position of the last row of the group. We can subtract from n() to refer to rows relative to the bottom of the group e.g. n() - 1 to refer to the second last row of the group. See “From Bottom” under the “nth” scenario above.
  • Often times when extracting rows by position, we need to first sort the rows in a particular way. We do so here via arrange(col_2). The arrange() function sorts a data frame in ascending order of the column(s) passed to it (See Sorting).
  • We set .by_group=TRUE to sort within groups so that the output slices are ordered by group (See Grouped Sorting).

Alternatively:

df_2 = df %>% 
  group_by(col_1) %>% 
  arrange(col_2, .by_group = TRUE) %>% 
  filter(row_number() %in% c(1, 2, n() - 1 , n()))

Here is how this works:

  • row_number() returns the index of the current row starting at 1 for the top most row.
  • When used inside filter() on a grouped data frame, row_number() returns the index of the current row for the current group.
  • We use the %in% operator to check for membership of each row_number() in the group in the vector holding the row numbers of interest c(1, 2, n() - 1 , n()).

Random

We wish to return a set of rows taken at random from each group.

Count

We wish to return a specific number of rows taken at random positions from each group.

In this example, we wish to return 10 randomly selected rows from each group where the groups are defined by the column col_1.

set.seed(1234)
df_2 = df %>% 
  group_by(col_7) %>% 
  slice_sample(n=10)

Here is how this works:

  • We use the function slice_sample() which returns a random set of rows from a data frame.
  • slice_sample(), when called on a grouped data frame, operates on the individual groups to return a random set of rows from each group.
  • slice_sample(n=10) returns a random sample of 10 rows since we set n=10. If n is not set, slice_sample() returns 1 row by default.
  • It is good practice to always set a seed value before generating random numbers (or using operations that generate random numbers) to ensure reproducibility of results (i.e. that we would get the same results when we run the same code later). We do that here by passing a determined seed of 1234 to set.seed() (we could pass any other number).

Proportion

We wish to return a proportion (percent) of the total number of rows from each group taken at random positions.

In this example, we wish to return 20% of the rows of each group taken at random positions where the groups are defined by the column col_1.

set.seed(1234)
df_2 = df %>% 
  group_by(col_7) %>% 
  slice_sample(prop=0.2)

Here is how this works:

  • This works similarly to the “Count” scenario except that we use the prop argument of slice_sample() instead of the n argument.
  • Setting prop=0.2 returns a randomly selected set of 20% of the rows of each group.
R
I/O