We wish to carry out multiple data aggregations on an input data frame and return an output summary data frame. The output summary data frame would have one column for each data aggregation operation in addition to one column for each grouping variable (if any). As for rows, the output summary data frame will have one row for each combination of values of grouping variables or just one row if we are summarizing a data frame that is not grouped.
In this example, we wish to produce a grouped summary of a data frame df
. The grouping variables are the columns col_1
and col_2
.
df_2 = df %>%
group_by(col_1, col_2) %>%
summarize(
col_3_min = min(col_3),
col_3_max = max(col_4),
col_4_sum = sum(col_4),
col_4_median = median(col_4),
col_3_4_w_mean =
weighted.mean(col_3, col_4, na.rm=TRUE))
Here is how this works:
summarize()
executes one or more data aggregation operations on a data frame. See Individual Aggregation.group_by()
prior to calling summarize()
, the data aggregation is carried out on each group separately.group_by(col_1, col_2)
, we group the data frame df
by two columns col_1
and col_2
.df
, we would simply drop the call to group_by()
.col_1
and col_2
col_3_min
, col_3_max
, col_4_sum
, col_4_median
, and weighted.mean
.col_1
and col_2
.max()
and sum()
. See Common Aggregation Operations for a coverage of the most common data aggregation operations. weighted.mean()
takes two columns col_3
and col_4
as input.