We wish to carry out an individual data aggregation operation e.g. we wish to obtain the sum of the values of a particular numeric column. One common situation where quick individual data aggregation is needed is during interactive inspection of data
We will cover two scenarios:
This section is complemented by:
We wish to perform one data aggregation operation on a data frame that is not grouped.
One Column Function
We wish to perform one data aggregation operation on a column of a data frame (that is not grouped) reducing that one column to a single scalar value.
In this example we wish to compute the sum of the column col_2
of the data frame df
.
df %>% summarize(col_2_sum = sum(col_2, na.rm=T))
Here is how this works:
summarize()
to perform an aggregation operation. In the tidyverse
, summarise()
is the primary approach to perform aggregation.summarize()
expects a data frame, which in this case is df
, and at least one data aggregation expression, which in this case is col_2_sum = sum(col_2, na.rm=T)
.col_2_sum
is the name of the output column.=
is the assignment operator assigning the output of the expression on the right side to the variable on the left side.sum(col_2, na.rm=T)
is the data aggregation to be carried out.summarise()
, we can refer to any column directly by its name like we do here in referring to col_2
.col_2_sum
and one row holding the sum of the values of the column col_2
of the input data frame df
.sum()
to compute the sum of values of col_2
. We set the argument na.rm = TRUE
so sum()
would ignore NA
values and return the sum of the rest. See Common Aggregation Operations for a coverage of the most common data aggregation operations.Alternatively:
sum(df$col_2, na.rm=T)
Here is how this works:
df$col_2
, we use the selection operator $
(from base R) to extract the column col_2
from the data frame df
.df$col_2
to sum()
to perform the summation aggregation operation.df %>% pull(col_2) %>% sum(na.rm=T)
Here is how this works:
df %>% pull(col_2)
, we extract the column we wish to summarize from the data frame df
.pull()
is similar to $
except that it looks better in a chain with %>%
.%>%
passes the values of col_2
to the first argument of sum()
. We can pass arguments to the aggregation function like we do here in passing na.rm=T
to ignore missing values.Multiple Column Function
We wish to perform an aggregation operation on an entire data frame (that is not grouped) reducing multiple columns to a single scalar value.
In this example, we wish to compute the weighted mean of the column col_2
where the weights are given by the column col_3
of the data frame df
.
df %>% summarize(w_mean_2_3 =
weighted.mean(col_2, col_3, na.rm=T))
Here is how this works:
summarise()
we can compose data aggregation expressions that involve any number of columns.weighted.mean()
to compute the weighted mean. We pass to weighted mean the column whose weighted mean we wish to calculate col_2
and the column that holds the weights col_3
. See Common Aggregation Operations for a coverage of the most common data aggregation operations.Alternatively:
weighted.mean(df$col_2, df$col_3, na.rm=T)
Here is how this works:
df$col_2
and df$col_3
, we use the selection operator $
(from base R) to extract the columns col_2
and col_3
as vectors.df$col_2
and df$col_3
to weighted.mean()
to perform the weighted mean aggregation operation.We wish to perform one data aggregation operation on a grouped data frame.
One Column Function
We wish to perform an aggregation on a grouped data frame. The aggregation operation we wish to perform is a function of one column.
In this example, we wish to obtain the sum of the value of the column col_2
for each group where the groups are defined by the column col_1
.
df %>%
group_by(col_1) %>%
summarize(col_2_sum = sum(col_2, na.rm=T))
Here is how this works:
summarize()
executes one or more data aggregation operations on a data frame. See On Data Frame above.group_by()
prior to calling summarize()
, the data aggregation is carried out on each group separately.group_by()
accepts one or more columns and then assigns the rows of the data frame to a group according to the values of the grouping column(s). See Grouping for a coverage of common grouping scenarios.sum()
to compute the sum of the values of the column col_2
(for each group). See Common Aggregation Operations for a coverage of the most common aggregation operations.col_2_sum
and as many rows as there are groups i.e. as many rows as the number of unique values of the grouping column col_1
. Each row of the output data frame holds the sum of the values of the column col_2
of the input data frame df
for the corresponding group.Multiple Column Function
We wish to perform an aggregation on a grouped data frame. The aggregation operation we wish to perform is a function of multiple columns.
In this example, we wish to compute the weighted mean of the column col_2
where the weights are given by the column col_3
. We wish to perform the aggregation for each group where the groups are defined by the column col_1
.
df %>%
group_by(col_1) %>%
summarize(col_2_3_w_mean =
weighted.mean(col_2, col_3, na.rm=T))
Here is how this works:
group_by()
prior to calling summarize()
to have the data aggregation operation carried out on each group separately as described in “One Column Function” above.summarise()
we can compose data aggregation expressions that involve any number of columns.weighted.mean()
to compute the weighted mean. We pass to weighted mean the column whose weighted mean we wish to calculate col_2
and the column that holds the weights col_3
. See Common Aggregation Operations for a coverage of the most common data aggregation operations.