Summary Table

We wish to carry out multiple data aggregations on an input data frame and return an output summary data frame. The output summary data frame would have one column for each data aggregation operation in addition to one column for each grouping variable (if any). As for rows, the output summary data frame will have one row for each combination of values of grouping variables or just one row if we are summarizing a data frame that is not grouped.

In this example, we wish to produce a grouped summary of a data frame df. The grouping variables are the columns col_1 and col_2.

df_2 = df %>% 
  group_by(col_1, col_2) %>%
  summarize(
    col_3_min = min(col_3),
    col_3_max = max(col_4),
    col_4_sum = sum(col_4),
    col_4_median = median(col_4),
    col_3_4_w_mean = 
      weighted.mean(col_3, col_4, na.rm=TRUE))

Here is how this works:

  • summarize() executes one or more data aggregation operations on a data frame. See Individual Aggregation.
  • When we run group_by() prior to calling summarize(), the data aggregation is carried out on each group separately.
  • In group_by(col_1, col_2), we group the data frame df by two columns col_1 and col_2.
    • In general, we could group by any number of columns. See Grouping.
    • If we wish to summarize the original ungrouped data frame df, we would simply drop the call to group_by().
  • The output summary data frame would have
    • the following columns:
      • The grouping columns col_1 and col_2
      • The aggregation outputs col_3_min, col_3_max, col_4_sum, col_4_median, and weighted.mean.
    • As for rows, the output summary data frame will have one row for each combination of values of grouping variables col_1 and col_2.
  • In this example, we carried out a few basic data aggregation operations such as max() and sum(). See Common Aggregation Operations for a coverage of the most common data aggregation operations.
  • The data aggregation operations that we pass to summarize can involve one or more columns. For instance, in this example, weighted.mean() takes two columns col_3 and col_4 as input.
R
I/O