In this page, we show how to identify the columns on each of which we will apply data aggregation logic.
We will cover the following scenarios:
This section is complemented by
Note: In this page, we demonstrate implicit data aggregation in a context where a data frame is grouped by one column. We can drop the call to group_by()
to aggregate over the entire data frame (without grouping). We can also group by multiple columns. See Grouping for a coverage of common grouping scenarios.
We wish to apply a data aggregation operation to each column of a data frame.
In this example, we have a data frame df
and we wish to count the number of unique values in each column for each group where the groups are defined by the values of the column col_1
.
df_2 = df %>%
group_by(col_1) %>%
summarise(across(everything(), n_distinct))
Here is how this works:
group_by()
function groups the rows of the data frame df
by the values in the column col_1
. This allows the following summarise()
function to apply a summary operation to each group of rows. See Summary Table.summarise()
we use across()
to describe the implicit data aggregation that we wish to carry out which is to apply the function n_distinct()
to each column for each group. We do so as follows:across()
a selection of columns. In this example, we use everything()
because we wish to select and apply the data aggregation to all columns.across()
the data aggregation expression that we wish to apply to each column selected in the first argument. In this case the data aggregation we wish to apply is the function n_distinct()
.df_2
with one row for each unique value of the column col_1
, and a column for each column in the original df
data frame containing the number of unique values in that column for the corresponding group of rows.We wish to apply a data aggregation operation to each of a set of explicitly selected columns.
In this example, we have a data frame df
and we wish to count the number of unique values of the columns col_2
, and col_4
for each group, where the groups are defined by the values of the column col_1
.
df_2 = df %>%
group_by(col_1) %>%
summarise(across(c(col_2, col_4), n_distinct))
Here is how this works:
group_by()
function groups the rows of the data frame df
by the values in the column col_1
. This allows the following summarise()
function to apply a summary operation to each group of rows. See Summary Table.summarise()
we use across()
to describe the implicit data aggregation that we wish to carry out which is to apply the function n_distinct()
to the column col_2
and col_4
for each group. We do so as follows:across()
a selection of columns. In this example, we use c(col_2, col_4)
, to identify the columns we wish to select by name. See Basic Selection for a coverage of explicit column selection scenarios, all of which can be used to select columns for implicit aggregation.across()
the data aggregation expression that we wish to apply to each column selected in the first argument. In this example, the data aggregation we wish to apply is the function n_distinct()
.df_2
with one row for each unique value of the column col_1
, and two columns col_2
and col_4
containing the number of unique values in that column for the corresponding group of rows.We wish to apply a data aggregation operation to each column in a set of implicitly selected columns. Implicit column selection is when we do not spell out the column names or positions explicitly but rather identify the columns via a property of their name or their data.
In this example, we have a data frame df
and we wish to compute the sum of values of all numerical columns for each group, where the groups are defined by the values of the column col_1
.
df_2 = df %>%
group_by(col_1) %>%
summarise(across(where(is.numeric), sum))
Here is how this works:
group_by()
function groups the rows of the data frame df
by the values in the column col_1
. This allows the following summarise()
function to apply a summary operation to each group of rows. See Summary Table.summarise()
we use across()
to describe the implicit data aggregation that we wish to carry out; which is to apply the function sum()
to each numeric column for each group. We do so as follows:across()
a selection of columns. In this example, we use where(is.double)
to select all columns whose data type is double
. See Implicit Selection for a coverage of the most common scenarios of implicit column selection including by name pattern, data type, and Criteria satisfied by the column’s data.across()
the data aggregation expression that we wish to apply to each column selected in the first argument. In this example, the data aggregation we wish to apply is the function sum()
.df_2
, with one row for each unique value in the col_1
column, and a column for each numeric column in the original df
data frame containing the sum of the values in that column for the corresponding group of rows.We wish to apply a data aggregation operation to all but a set of excluded columns.
In this example, we have a data frame df
and for each group, we wish to count the number of unique values of all columns except col_2
and col_4
, where the groups are defined by the values of the column col_1
.
df_2 = df %>%
group_by(col_1) %>%
summarise(across(!c(col_2, col_4), n_distinct))
Here is how this works:
group_by()
function groups the rows of the data frame df
by the values in the column col_1
. This allows the following summarise()
function to apply a summary operation to each group of rows. See Summary Table.summarise()
we use across()
to describe the implicit data aggregation that we wish to carry out; which is to apply the function n_distinct()
to all column for each group except the columns col_2
and col_4
. We do so as follows:across()
a selection of columns. In this example, we use !c(col_1, col_2)
to identify the columns we wish to exclude by name. See Exclude Columns for a coverage of column exclusion scenarios, all of which can be used for implicit aggregation.across()
the data aggregation expression that we wish to apply to each column selected in the first argument. In this example, the data aggregation we wish to apply is the function n_distinct()
.df_2
, with one row for each unique value in the column col_1
, and a column for each column in the original data frame df
excluding col_2
and col_4
(and col_1
) containing the number of unique values in that column for the corresponding group of rows.