Columns Selection

In this page, we show how to identify the columns on each of which we will apply data aggregation logic.

We will cover the following scenarios:

  • All Columns where we cover how to apply a data aggregation operation to each column in a data frame.
  • Explicit Selection where we cover how to apply a data aggregation operation to each of a set of explicitly selected columns of a data frame e.g. selecting columns by spelling out their names.
  • Implicit Selection where we cover how to apply a data aggregation operation to each of a set of implicitly selected columns of a data frame e.g. selecting columns whose names contain a certain substring.
  • Exclude Columns where we cover how to apply a data aggregation operation to each column of a data frame except for a set of excluded columns.

This section is complemented by

  • Function Specification where we cover how to specify one or more data aggregation operations to apply to the selected set of columns
  • Output Naming where we cover how to specify the name(s) of output column(s) created by the implicit data aggregation operations.

Note: In this page, we demonstrate implicit data aggregation in a context where a data frame is grouped by one column. We can drop the call to group_by() to aggregate over the entire data frame (without grouping). We can also group by multiple columns. See Grouping for a coverage of common grouping scenarios.

All Columns

We wish to apply a data aggregation operation to each column of a data frame.

In this example, we have a data frame df and we wish to count the number of unique values in each column for each group where the groups are defined by the values of the column col_1.

df_2 = df %>% 
  group_by(col_1) %>% 
  summarise(across(everything(), n_distinct))

Here is how this works:

  • The group_by() function groups the rows of the data frame df by the values in the column col_1. This allows the following summarise() function to apply a summary operation to each group of rows. See Summary Table.
  • Inside summarise() we use across() to describe the implicit data aggregation that we wish to carry out which is to apply the function n_distinct() to each column for each group. We do so as follows:
    • We pass to the first argument of across() a selection of columns. In this example, we use everything() because we wish to select and apply the data aggregation to all columns.
    • We pass to the second argument of across() the data aggregation expression that we wish to apply to each column selected in the first argument. In this case the data aggregation we wish to apply is the function n_distinct().
  • The output is a new data frame df_2 with one row for each unique value of the column col_1, and a column for each column in the original df data frame containing the number of unique values in that column for the corresponding group of rows.
  • The output columns have the names of the original columns. See Output Naming for how to append new columns instead.

Explicit Selection

We wish to apply a data aggregation operation to each of a set of explicitly selected columns.

In this example, we have a data frame df and we wish to count the number of unique values of the columns col_2, and col_4 for each group, where the groups are defined by the values of the column col_1.

df_2 = df %>% 
  group_by(col_1) %>%
  summarise(across(c(col_2, col_4), n_distinct))

Here is how this works:

  • The group_by() function groups the rows of the data frame df by the values in the column col_1. This allows the following summarise() function to apply a summary operation to each group of rows. See Summary Table.
  • Inside summarise() we use across() to describe the implicit data aggregation that we wish to carry out which is to apply the function n_distinct() to the column col_2 and col_4 for each group. We do so as follows:
    • We pass to the first argument of across() a selection of columns. In this example, we use c(col_2, col_4) , to identify the columns we wish to select by name. See Basic Selection for a coverage of explicit column selection scenarios, all of which can be used to select columns for implicit aggregation.
    • We pass to the second argument of across() the data aggregation expression that we wish to apply to each column selected in the first argument. In this example, the data aggregation we wish to apply is the function n_distinct().
  • The output is a new data frame df_2 with one row for each unique value of the column col_1, and two columns col_2 and col_4 containing the number of unique values in that column for the corresponding group of rows.

Implicit Selection

We wish to apply a data aggregation operation to each column in a set of implicitly selected columns. Implicit column selection is when we do not spell out the column names or positions explicitly but rather identify the columns via a property of their name or their data.

In this example, we have a data frame df and we wish to compute the sum of values of all numerical columns for each group, where the groups are defined by the values of the column col_1.

df_2 = df %>% 
  group_by(col_1) %>%
  summarise(across(where(is.numeric), sum))

Here is how this works:

  • The group_by() function groups the rows of the data frame df by the values in the column col_1. This allows the following summarise() function to apply a summary operation to each group of rows. See Summary Table.
  • Inside summarise() we use across() to describe the implicit data aggregation that we wish to carry out; which is to apply the function sum() to each numeric column for each group. We do so as follows:
    • We pass to the first argument of across() a selection of columns. In this example, we use where(is.double) to select all columns whose data type is double. See Implicit Selection for a coverage of the most common scenarios of implicit column selection including by name pattern, data type, and Criteria satisfied by the column’s data.
    • We pass to the second argument of across() the data aggregation expression that we wish to apply to each column selected in the first argument. In this example, the data aggregation we wish to apply is the function sum().
  • The output is a new data frame df_2, with one row for each unique value in the col_1 column, and a column for each numeric column in the original df data frame containing the sum of the values in that column for the corresponding group of rows.

Exclude Columns

We wish to apply a data aggregation operation to all but a set of excluded columns.

In this example, we have a data frame df and for each group, we wish to count the number of unique values of all columns except col_2 and col_4, where the groups are defined by the values of the column col_1.

df_2 = df %>% 
  group_by(col_1) %>%
  summarise(across(!c(col_2, col_4), n_distinct))

Here is how this works:

  • The group_by() function groups the rows of the data frame df by the values in the column col_1. This allows the following summarise() function to apply a summary operation to each group of rows. See Summary Table.
  • Inside summarise() we use across() to describe the implicit data aggregation that we wish to carry out; which is to apply the function n_distinct() to all column for each group except the columns col_2 and col_4. We do so as follows:
    • We pass to the first argument of across() a selection of columns. In this example, we use !c(col_1, col_2) to identify the columns we wish to exclude by name. See Exclude Columns for a coverage of column exclusion scenarios, all of which can be used for implicit aggregation.
    • We pass to the second argument of across() the data aggregation expression that we wish to apply to each column selected in the first argument. In this example, the data aggregation we wish to apply is the function n_distinct().
  • The output is a new data frame df_2, with one row for each unique value in the column col_1, and a column for each column in the original data frame df excluding col_2 and col_4 (and col_1) containing the number of unique values in that column for the corresponding group of rows.
R
I/O