Function Specification

We wish to specify one or more data aggregation functions to apply to each of a selected set of columns without spelling out each data aggregation expression explicitly.

In this section, we cover the following function specification scenarios:

  • Named Function where we cover how to apply a built-in function or a custom function to each of a selected set of columns.
  • Anonymous Function where we cover how to apply an anonymous function, i.e. a one-sided formula defined via ~, to each of a selected set of columns.
  • Multiple functions where we cover how to apply each of a set of functions separately to each of a selected set of columns.
  • Multiple Function Sets where we cover how to have different sets of one or more functions applied to different sets of one or more columns.

This section is complemented by

  • Column Selection where we cover how to select the columns to each of which we will apply data aggregation logic.
  • Output Naming where we cover how to specify the name(s) of output column(s) created by the implicit data aggregation operations.

Named Function

We wish to apply the same named data aggregation function, e.g. sum(), to each of a set of selected columns.

In this example, we have a data frame df and we wish to compute the sum of values of the columns col_2 and col_4 for each group, where the groups are defined by the values of the column col_1.

df_2 = df %>%
  group_by(col_1) %>%
  summarise(across(c(col_2, col_4), sum))

Here is how this works:

  • The group_by() function groups the rows of the data frame df by the values in the column col_1. This allows the following summarise() function to apply a summary operation to each group of rows. See Summary Table.
  • Inside summarise() we use across() to describe the implicit data aggregation that we wish to carry out as follows:
    • We pass to the first argument of across() the selection of columns we wish to aggregate which is col_2 and col_4.
    • We pass to the second argument of across() the data aggregation expression that we wish to apply to each column selected in the first argument. In this example, the data aggregation we wish to apply is the function sum().
  • The output is a new data frame df_2, with one row for each unique value in the col_1 column, and a column for each selected column, i.e. col_2 and col_4, containing the sum of the values in that column for the corresponding group of rows.

Extension: Passing Function Arguments

df_2 = df %>%
    group_by(col_1) %>%
    summarise(across(c(col_2, col_4), sum, na.rm=TRUE))

Here is how this works:

  • In order to pass arguments to the named function, we can pass those arguments to across() after the function. across() will then pass those arguments along to the function.
  • In this example, we pass na.rm=TRUE to across() which then passes it along to sum().
  • Alternatively, we can use an anonymous function to pass parameters to the named function of interest. In this example, the anonymous function solution would be ~sum(., na.rm = TRUE). See Anonymous Function below.

Anonymous Function

We wish to apply the same data aggregating anonymous function to each of a set of selected columns.

In this example, we have a data frame df and we wish to calcualte the ratio of missing values to all values in each column for each group, where the groups are defined by the values of the column col_1.

df_2 = df %>% 
  group_by(col_1) %>%
  summarise(across(everything(), ~ (sum(is.na(.)) / n()))

Here is how this works:

  • The group_by() function groups the rows of the data frame df by the values in the column col_1. This allows the following summarise() function to apply a summary operation to each group of rows. See Summary Table.
  • Inside summarise() we use across() to describe the implicit data aggregation that we wish to carry out as follows:
    • We pass to the first argument of across() the selection of columns we wish to aggregate which is all columns that we select via everything().
    • We pass to the second argument of across() the data aggregation expression that we wish to apply to each column selected in the first argument. In this example, the data aggregation we wish to apply is an anonymous function.
  • The anonymous function ~ (sum([is.na](http://is.na/)(.)) / n() calculates the number of NA values in a column by applying the is.na() function to each value in the column and summing the resulting logical vector.
  • The output is a new data frame df_2, with one row for each unique value in the col_1 column, and a column for each selected column, i.e. col_2 and col_4, containing the sum of the values in that column for the corresponding group of rows.

Multiple Functions

We wish to perform multiple data aggregation operations to each of a set of selected columns of a data frame individually without having to spell out each data aggregation explicitly.

In this example, we have a data frame df and we wish to count the number of unique values and compute the sum of values (ignoring NAs) of the columns col_2 and col_4 for each group, where the groups are defined by the values of the column col_1.

df_2 = df %>%
  group_by(col_1) %>%
  summarise(across(
    c(col_2, col_4),
    list(n_distinct, ~sum(., na.rm = TRUE))))

Here is how this works:

  • The group_by() function groups the rows of the data frame df by the values in the column col_1. This allows the following summarise() function to apply a summary operation to each group of rows. See Summary Table.
  • Inside summarise() we use across() to describe the implicit data aggregation that we wish to carry out as follows:
    • We pass to the first argument of across() the selection of columns we wish to aggregate which is c(col_2, col_4).
    • We pass to the second argument of across() a list of the data aggregation expressions that we wish to apply to each column selected in the first argument.
  • The function n_distinct() returns the number of unique values in the input vector.
  • When passing a list of functions to across(), one way to pass arguments to one of the functions is to use an anonymous function. In this example, the anonymous function ~sum(., na.rm = TRUE) allows us to calculate the sum of values of the input vector while setting na.rm = TRUE to ignore NAs.
  • The output is a new data frame df_2, with one row for each unique value in the column col_1, and two columns for each of the original columns col_2 and col_4 in the original df. The first of these two new columns will contain the number of unique values in each column, and the second column will contain the sum of values (ignoring NAs) in each column for the corresponding group of rows.
  • The default names of the new columns will be col_2_1, col_2_2, col_4_1, and col_4_2. See Output Naming for how to specify the names of the output columns.

Alternative: Wrapper Function

nasum <- function(x) {
  sum(x, na.rm = TRUE)
}

df_2 = df %>%
  group_by(col_1) %>%
  summarise(across(
    c(col_2, col_4),
    list(n_distinct, nasum)))

Here is how this works:

  • This code performs the same data aggregation operation as the primary solution above. However, we use a wrapper function instead of an anonymous function to pass parameters to the sum() function.
  • Our wrapper function here is a regular function that calls the function sum() with the desired argument values set; i.e. na.rm = TRUE. We then call the wrapper function from inside agg().

Multiple Function Sets

We wish to have different sets of one or more functions applied to different sets of one or more columns without having to explicitly spell out each aggregation operation.

In this example, we have a data frame df that we wish to summarise over groups defined by the value of the column col_1. We wish to compute: (1) the number of unique values of the columns col_2 and col_4 in each group, (2) the sum of values of the columns col_3 and col_5 for each group and (3) the number of rows for each group.

df_2 = df %>%
  group_by(col_1) %>%
  summarise(
    across(c(col_2, col_4), n_distinct),
    across(c(col_3, col_5), sum),
    n = n(),
  )

Here is how this works:

  • The group_by() function groups the rows of the data frame df by the values in the column col_1. This allows the following summarise() function to apply a summary operation to each group of rows. See Summary Table.
  • In order to have different sets of functions applied to different sets of columns, we can make multiple calls to across() from within summarise().
  • In this example, we pass to summarise() the specifications of three data aggregation operations:
    • In across(c(col_2, col_4), n_distinct), we specify an implicit data aggregation operation where we apply the function n_distinct() to count the number of unique values of the columns col_2 and col_4.
    • In across(c(col_3, col_5), sum), we specify an implicit data aggregation operation where we apply the function sum() to count the number of unique values of the columns col_3 and col_5.
    • In n = n(), we specify an explicit data aggregation operation to count the number of rows in each group. We include this to show that explicit data aggregation operations can be included in the same call to summarise().
  • The output is a new data frame df_2, with one row for each unique value in the column col_1, and five columns. The first two columns will contain the number of unique values in the column col_2 and col_4 respectively for the corresponding group of rows. The next two columns will contain the sum of the values in the columns col_3 and col_5 respectively for the corresponding group of rows. The fifth column will contain the total number of rows in each group.
R
I/O