In the implicit data aggregation scenarios we covered in Function Specification, the output columns either had the same names as the input columns when we apply one function or multiple new columns with standardized names were created when we apply multiple function. In this section, we cover how to override the default behavior and specify output column names. This is helpful because we often needed to specify output column names that are more appropriate for the domain context.
This section is organized as follows:
and for each scenario we cover two approaches:
col_1
as col_1_sum
.{.col}_{.fn}"
that is a function of the name of the input column {.col}
and function applied {.fn}
.This section is complemented by:
For a deeper coverage of column naming, see Renaming.
We wish to specify the names of the columns resulting from implicitly applying one function to a set of columns instead of the default behavior of using the names of the original columns.
In this example, we wish to compute the sum of values of the columns col_2
and col_4
for each group and to name the output columns col_2_sum
and col_4_sum
. We are summarizing a data frame df
grouped by the values of the column col_1
.
df_2 = df %>%
group_by(col_1) %>%
mutate(across(
c(col_2, col_4),
list(sum = sum)))
Here is how this works:
across()
allows us to pass a list of named functions or anonymous functions where the names would be used as a suffix to the input column name (with an underscore _
between).across()
a list containing one named function list(sum = sum)
.Extension: Naming Template
We wish to name the output columns by applying a template that is a function of the names of the input columns.
In this example, we wish to compute the sum of values of the columns col_2
and col_4
for each group and to name the output columns total_col_2_v2
and total_col_4_v2
. We are summarizing a data frame df
grouped by the values of the column col_1
.
df_2 = df %>%
mutate(across(c(col_1, col_2),
round,
.names = "total_{.col}_v2"))
Here is how this works:
across()
accepts a .names
argument which accepts a template of the form "{.col}_{.fn}"
that we can use to specify how the output columns are named as a function of the names of the input columns {.col}
and functions applied {.fn}
."v2_rnd_{.col}"
for generating output column names. To generate the name of the output column, {.col}
will be replaced by the name of the input column.We wish to specify the names of the columns resulting from implicitly applying an anonymous function to a set of columns instead of the default behavior of using the same names as the input columns.
In this example, we have a data frame df
and we wish to calculate the ratio of missing values to all values in each column for each group, where the groups are defined by the values of the column col_1
. We wish to have the output columns be named with the name of the input column followed by “_na_rate”
.
df_2 = df %>%
group_by(col_1) %>%
summarise(across(
everything(),
list(na_rate = ~(sum(is.na(.))/n()))))
Here is how this works:
across()
accepts a list of named functions and those list keys (names) will be used when naming the output columns. See Named Function above.across()
a named list containing one anonymous function whose name we set as na_rate
.“_na_rate”
.~ (sum(is.na(.)) / n()
calculates the number of NA
values in a column by applying the is.na()
function to each value in the column and summing the resulting logical vector.Extension: Naming Template
We wish to name the output columns by applying a template that is a function of the names of the input columns and the functions applied.
In this example, we have a data frame df
and we wish to calculate the ratio of missing values to all values in each column for each group, where the groups are defined by the values of the column col_1
. We wish to have the output columns be named according to the template “v2_<col>_na_rate”
where “<col>”
is the input column name.
df_2 = df %>%
group_by(col_1) %>%
summarise(across(
everything(),
~sum(is.na(.))/n(),
.names = "v2_{.col}_na_rate"))
Here is how this works:
across()
accepts a naming template to its .names
argument as described in Named Function above."v2_{.col}_na_rate"
for generating output column names. To generate the name of the output column, {.col}
will be replaced by the name of the input column.We wish to specify the names of the columns resulting from implicitly applying multiple functions to a set of selected columns instead of the default behavior of naming the new columns with the name of the original column concatenated with _i
where i
is the index of the function in the input list of functions, i.e. (1, 2, ..)
.
In this example, we have a data frame df
and we wish to count the number of unique values and compute the sum of values (ignoring NAs) of the columns col_2
and col_4
for each group, where the groups are defined by the values of the column col_1
. We wish to have the output columns be named col_2_vals
, col_2_sum
, col_4_vals
, and col_4_sum
respectively.
df_2 = df %>%
group_by(col_1) %>%
summarise(across(
c(col_2, col_4),
list(vals = n_distinct, sum = ~sum(., na.rm=TRUE))))
Here is how this works:
across()
a list of named functions as described in Named Function above.across()
a list of two named functions. The columns generated by the first will have the suffix “_vals”
i.e. col_2_vals
and col_4_vals
, and those created by the second will have the suffix “_sum”
, i.e. col_2_sum
and col_4_sum
.col_2_1
, col_2_2
, col_4_1
, and col_4_2
.Extension: Naming Template
We wish to name the output columns by applying a template that is a function of the names of the input columns and the functions applied.
In this example, we have a data frame df
and we wish to count the number of unique values and compute the sum of values of the columns col_2
and col_4
for each group, where the groups are defined by the values of the column col_1
. We wish to have the output columns be named v2_col_2_vals
, v2_col_2_sum
, v2_col_4_vals
, and v2_col_4_sum
respectively.
df_2 = df %>%
group_by(col_1) %>%
summarise(across(
c(col_2, col_4),
list(vals = n_distinct, sum = ~sum(., na.rm=TRUE))
.names = "v2_{.col}_{.fn}"))
Here is how this works:
across()
accepts a naming template to its .names
argument as described in Named Function above."v2_{.col}_{.fn}"
for generating output column names. To generate the name of the output column, {.col}
will be replaced by the name of the input column and {.fn}
will be replaced by the name of the function.