Basic Aggregation

In its simplest form, a typical data aggregation operation involves three components: (1) One or more grouping columns. (2) One or more aggregated columns, and (3) One or more aggregating functions. Data aggregation in R is most commonly carried out by calling group_by() on the input data frame followed by summarize() like so:

df_2 = df %>% 
  group_by(col_1) %>%
  summarize(
    col_2_sum = sum(col_2),
    col_3_mean = mean(col_3)

At a high level, there are two situations where we need to perform data aggregation:

  1. When we wish to perform an individual aggregation operation, often in an interactive inspection setting, e.g. we wish to obtain the sum of the values of one particular numeric column.
  2. When we wish to reduce an input data frame into a summary data frame i.e. a data frame where the columns correspond to summary operations performed on the input data frame.

Moreover, a data aggregation operation is either performed on:

  1. An entire data frame, typically generating a single scalar value.
  2. A grouped data frame, typically generating a vector containing one summary value per group.

Following the above distinctions, this section is organized as follows:

  • Individual Aggregation where we cover how to carry out one aggregation operation on an entire data frame or on a grouped data frame.
  • Summary Table where we cover how to carry out multiple data aggregation operations on a data frame, that is typically grouped by one or more variables, and return a summary data frame.
  • Common Operations where we cover some of the most common data aggregation operations for each data type e.g. sum and mean for numeric columns, and string concatenation for string columns.
R
I/O