We wish to specify one or more data aggregation functions to apply to each of a selected set of columns without spelling out each data aggregation expression explicitly.
In this section, we cover the following function specification scenarios:
~
, to each of a selected set of columns.This section is complemented by
We wish to apply the same named data aggregation function, e.g. sum()
, to each of a set of selected columns.
In this example, we have a data frame df
and we wish to compute the sum of values of the columns col_2
and col_4
for each group, where the groups are defined by the values of the column col_1
.
df_2 = df %>%
group_by(col_1) %>%
summarise(across(c(col_2, col_4), sum))
Here is how this works:
group_by()
function groups the rows of the data frame df
by the values in the column col_1
. This allows the following summarise()
function to apply a summary operation to each group of rows. See Summary Table.summarise()
we use across()
to describe the implicit data aggregation that we wish to carry out as follows:across()
the selection of columns we wish to aggregate which is col_2
and col_4
.across()
the data aggregation expression that we wish to apply to each column selected in the first argument. In this example, the data aggregation we wish to apply is the function sum()
.df_2
, with one row for each unique value in the col_1
column, and a column for each selected column, i.e. col_2
and col_4
, containing the sum of the values in that column for the corresponding group of rows.Extension: Passing Function Arguments
df_2 = df %>%
group_by(col_1) %>%
summarise(across(c(col_2, col_4), sum, na.rm=TRUE))
Here is how this works:
across()
after the function. across()
will then pass those arguments along to the function.na.rm=TRUE
to across()
which then passes it along to sum()
.~sum(., na.rm = TRUE)
. See Anonymous Function below.We wish to apply the same data aggregating anonymous function to each of a set of selected columns.
In this example, we have a data frame df
and we wish to calcualte the ratio of missing values to all values in each column for each group, where the groups are defined by the values of the column col_1
.
df_2 = df %>%
group_by(col_1) %>%
summarise(across(everything(), ~ (sum(is.na(.)) / n()))
Here is how this works:
group_by()
function groups the rows of the data frame df
by the values in the column col_1
. This allows the following summarise()
function to apply a summary operation to each group of rows. See Summary Table.summarise()
we use across()
to describe the implicit data aggregation that we wish to carry out as follows:across()
the selection of columns we wish to aggregate which is all columns that we select via everything()
.across()
the data aggregation expression that we wish to apply to each column selected in the first argument. In this example, the data aggregation we wish to apply is an anonymous function.~ (sum([is.na](http://is.na/)(.)) / n()
calculates the number of NA
values in a column by applying the is.na()
function to each value in the column and summing the resulting logical vector.df_2
, with one row for each unique value in the col_1
column, and a column for each selected column, i.e. col_2
and col_4
, containing the sum of the values in that column for the corresponding group of rows.We wish to perform multiple data aggregation operations to each of a set of selected columns of a data frame individually without having to spell out each data aggregation explicitly.
In this example, we have a data frame df
and we wish to count the number of unique values and compute the sum of values (ignoring NAs) of the columns col_2
and col_4
for each group, where the groups are defined by the values of the column col_1
.
df_2 = df %>%
group_by(col_1) %>%
summarise(across(
c(col_2, col_4),
list(n_distinct, ~sum(., na.rm = TRUE))))
Here is how this works:
group_by()
function groups the rows of the data frame df
by the values in the column col_1
. This allows the following summarise()
function to apply a summary operation to each group of rows. See Summary Table.summarise()
we use across()
to describe the implicit data aggregation that we wish to carry out as follows:across()
the selection of columns we wish to aggregate which is c(col_2, col_4)
.across()
a list of the data aggregation expressions that we wish to apply to each column selected in the first argument.n_distinct()
returns the number of unique values in the input vector.across()
, one way to pass arguments to one of the functions is to use an anonymous function. In this example, the anonymous function ~sum(., na.rm = TRUE)
allows us to calculate the sum of values of the input vector while setting na.rm = TRUE
to ignore NAs.df_2
, with one row for each unique value in the column col_1
, and two columns for each of the original columns col_2
and col_4
in the original df
. The first of these two new columns will contain the number of unique values in each column, and the second column will contain the sum of values (ignoring NAs) in each column for the corresponding group of rows.col_2_1
, col_2_2
, col_4_1
, and col_4_2
. See Output Naming for how to specify the names of the output columns.Alternative: Wrapper Function
nasum <- function(x) {
sum(x, na.rm = TRUE)
}
df_2 = df %>%
group_by(col_1) %>%
summarise(across(
c(col_2, col_4),
list(n_distinct, nasum)))
Here is how this works:
sum()
function.sum()
with the desired argument values set; i.e. na.rm = TRUE
. We then call the wrapper function from inside agg()
.We wish to have different sets of one or more functions applied to different sets of one or more columns without having to explicitly spell out each aggregation operation.
In this example, we have a data frame df
that we wish to summarise over groups defined by the value of the column col_1
. We wish to compute: (1) the number of unique values of the columns col_2
and col_4
in each group, (2) the sum of values of the columns col_3
and col_5
for each group and (3) the number of rows for each group.
df_2 = df %>%
group_by(col_1) %>%
summarise(
across(c(col_2, col_4), n_distinct),
across(c(col_3, col_5), sum),
n = n(),
)
Here is how this works:
group_by()
function groups the rows of the data frame df
by the values in the column col_1
. This allows the following summarise()
function to apply a summary operation to each group of rows. See Summary Table.across()
from within summarise()
.summarise()
the specifications of three data aggregation operations:across(c(col_2, col_4), n_distinct)
, we specify an implicit data aggregation operation where we apply the function n_distinct()
to count the number of unique values of the columns col_2
and col_4
.across(c(col_3, col_5), sum)
, we specify an implicit data aggregation operation where we apply the function sum()
to count the number of unique values of the columns col_3
and col_5
.n = n()
, we specify an explicit data aggregation operation to count the number of rows in each group. We include this to show that explicit data aggregation operations can be included in the same call to summarise()
.df_2
, with one row for each unique value in the column col_1
, and five columns. The first two columns will contain the number of unique values in the column col_2
and col_4
respectively for the corresponding group of rows. The next two columns will contain the sum of the values in the columns col_3
and col_5
respectively for the corresponding group of rows. The fifth column will contain the total number of rows in each group.