Output Naming

In the implicit data aggregation scenarios we covered in Function Specification, the output columns either had the same names as the input columns when we apply one function or multiple new columns with standardized names were created when we apply multiple function. In this section, we cover how to override the default behavior and specify output column names. This is helpful because we often needed to specify output column names that are more appropriate for the domain context.

This section is organized as follows:

  • Named Function where we cover how to specify the names of the columns resulting from implicitly applying one aggregating function to a set of columns.
  • Anonymous Function where we cover how to specify the names of the columns resulting from implicitly applying an anonymous aggregating function to a set of columns.
  • Multiple Functions where we cover how to specify the names of the columns resulting from implicitly applying multiple aggregating functions to a set of columns.

and for each scenario we cover two approaches:

  • Suffix Specification where we specify the string to append to the original column name e.g. naming the output column of summing the values of the column col_1 as col_1_sum.
  • Naming Template where we have more flexibility in naming the output columns via a naming template of the form {.col}_{.fn}" that is a function of the name of the input column {.col} and function applied {.fn}.

This section is complemented by:

  • Column Selection where we cover how to select the column(s) on each of which we will apply aggregation operations.
  • Function Specification where we cover how to specify the data aggregation expressions to apply to each of the selected columns.

For a deeper coverage of column naming, see Renaming.

Named Function

We wish to specify the names of the columns resulting from implicitly applying one function to a set of columns instead of the default behavior of using the names of the original columns.

In this example, we wish to compute the sum of values of the columns col_2 and col_4 for each group and to name the output columns col_2_sum and col_4_sum. We are summarizing a data frame df grouped by the values of the column col_1.

df_2 = df %>%
    group_by(col_1) %>%
    mutate(across(
        c(col_2, col_4),
        list(sum = sum)))

Here is how this works:

  • across() allows us to pass a list of named functions or anonymous functions where the names would be used as a suffix to the input column name (with an underscore _ between).
  • In this case we pass to across() a list containing one named function list(sum = sum).

Extension: Naming Template

We wish to name the output columns by applying a template that is a function of the names of the input columns.

In this example, we wish to compute the sum of values of the columns col_2 and col_4 for each group and to name the output columns total_col_2_v2 and total_col_4_v2. We are summarizing a data frame df grouped by the values of the column col_1.

df_2 = df %>%
  mutate(across(c(col_1, col_2), 
                round, 
                .names = "total_{.col}_v2"))

Here is how this works:

  • across() accepts a .names argument which accepts a template of the form "{.col}_{.fn}" that we can use to specify how the output columns are named as a function of the names of the input columns {.col} and functions applied {.fn}.
  • We specify the template "v2_rnd_{.col}" for generating output column names. To generate the name of the output column, {.col} will be replaced by the name of the input column.

Anonymous Function

We wish to specify the names of the columns resulting from implicitly applying an anonymous function to a set of columns instead of the default behavior of using the same names as the input columns.

In this example, we have a data frame df and we wish to calculate the ratio of missing values to all values in each column for each group, where the groups are defined by the values of the column col_1. We wish to have the output columns be named with the name of the input column followed by “_na_rate”.

df_2 = df %>% 
    group_by(col_1) %>%
    summarise(across(
        everything(), 
        list(na_rate = ~(sum(is.na(.))/n()))))

Here is how this works:

  • across() accepts a list of named functions and those list keys (names) will be used when naming the output columns. See Named Function above.
  • In this case we pass to across() a named list containing one anonymous function whose name we set as na_rate.
  • The output columns will have the name of the input column followed by “_na_rate”.
  • Had we not passed a list of named functions, the output column names would have the same names as the input columns.
  • The anonymous function ~ (sum(is.na(.)) / n() calculates the number of NA values in a column by applying the is.na() function to each value in the column and summing the resulting logical vector.

Extension: Naming Template

We wish to name the output columns by applying a template that is a function of the names of the input columns and the functions applied.

In this example, we have a data frame df and we wish to calculate the ratio of missing values to all values in each column for each group, where the groups are defined by the values of the column col_1. We wish to have the output columns be named according to the template “v2_<col>_na_rate” where “<col>” is the input column name.

df_2 = df %>% 
  group_by(col_1) %>%
  summarise(across(
    everything(), 
    ~sum(is.na(.))/n(),
    .names = "v2_{.col}_na_rate"))

Here is how this works:

  • across() accepts a naming template to its .names argument as described in Named Function above.
  • We specify the template "v2_{.col}_na_rate" for generating output column names. To generate the name of the output column, {.col} will be replaced by the name of the input column.
  • Had we not passed a list of named functions, the output column names would have the same names as the input columns.

Multiple Functions

We wish to specify the names of the columns resulting from implicitly applying multiple functions to a set of selected columns instead of the default behavior of naming the new columns with the name of the original column concatenated with _i where i is the index of the function in the input list of functions, i.e. (1, 2, ..).

In this example, we have a data frame df and we wish to count the number of unique values and compute the sum of values (ignoring NAs) of the columns col_2 and col_4 for each group, where the groups are defined by the values of the column col_1. We wish to have the output columns be named col_2_vals, col_2_sum, col_4_vals, and col_4_sum respectively.

df_2 = df %>%
  group_by(col_1) %>%
  summarise(across(
    c(col_2, col_4),
    list(vals = n_distinct, sum = ~sum(., na.rm=TRUE))))

Here is how this works:

  • To specify the suffix to add to output column names, we can pass to across() a list of named functions as described in Named Function above.
  • In this case we pass to across() a list of two named functions. The columns generated by the first will have the suffix “_vals” i.e. col_2_vals and col_4_vals, and those created by the second will have the suffix “_sum”, i.e. col_2_sum and col_4_sum.
  • Had we not passed a list of named functions, the output column names would have been col_2_1, col_2_2, col_4_1, and col_4_2.

Extension: Naming Template

We wish to name the output columns by applying a template that is a function of the names of the input columns and the functions applied.

In this example, we have a data frame df and we wish to count the number of unique values and compute the sum of values of the columns col_2 and col_4 for each group, where the groups are defined by the values of the column col_1. We wish to have the output columns be named v2_col_2_vals, v2_col_2_sum, v2_col_4_vals, and v2_col_4_sum respectively.

df_2 = df %>%
  group_by(col_1) %>%
  summarise(across(
    c(col_2, col_4),
    list(vals = n_distinct, sum = ~sum(., na.rm=TRUE))
    .names = "v2_{.col}_{.fn}"))

Here is how this works:

  • across() accepts a naming template to its .names argument as described in Named Function above.
  • We specify the template "v2_{.col}_{.fn}" for generating output column names. To generate the name of the output column, {.col} will be replaced by the name of the input column and {.fn} will be replaced by the name of the function.
R
I/O