Individual Aggregation

We wish to carry out an individual data aggregation operation e.g. we wish to obtain the sum of the values of a particular numeric column. One common situation where quick individual data aggregation is needed is during interactive inspection of data

We will cover two scenarios:

  1. Performing a data aggregation operation on an entire data frame (not grouped) reducing one or more columns to a single scalar value.
  2. Performing a data aggregation operation on a grouped data frame reducing one or more columns into a set of values one for each group.

This section is complemented by:

Data Frame

We wish to perform one data aggregation operation on a data frame that is not grouped.

One Column Function

We wish to perform one data aggregation operation on a column of a data frame (that is not grouped) reducing that one column to a single scalar value.

In this example we wish to compute the sum of the column col_2 of the data frame df.

df %>% summarize(col_2_sum = sum(col_2, na.rm=T))

Here is how this works:

  • We use the function summarize() to perform an aggregation operation. In the tidyverse, summarise() is the primary approach to perform aggregation.
  • summarize() expects a data frame, which in this case is df, and at least one data aggregation expression, which in this case is col_2_sum = sum(col_2, na.rm=T).
  • Let’s take a closer look at the data aggregation expression:
    • col_2_sum is the name of the output column.
    • = is the assignment operator assigning the output of the expression on the right side to the variable on the left side.
    • sum(col_2, na.rm=T) is the data aggregation to be carried out.
  • Inside summarise(), we can refer to any column directly by its name like we do here in referring to col_2.
  • The output is a data frame with one value. The data frame has one column called col_2_sum and one row holding the sum of the values of the column col_2 of the input data frame df.
  • We then use the built-in function sum() to compute the sum of values of col_2. We set the argument na.rm = TRUE so sum() would ignore NA values and return the sum of the rest. See Common Aggregation Operations for a coverage of the most common data aggregation operations.

Alternatively:

sum(df$col_2, na.rm=T)

Here is how this works:

  • In df$col_2, we use the selection operator $ (from base R) to extract the column col_2 from the data frame df.
  • We pass the values of df$col_2 to sum() to perform the summation aggregation operation.
  • This approach is perhaps the one that involves the fewest keystrokes when we wish to aggregate over an entire (not grouped) data frame.
df %>% pull(col_2) %>% sum(na.rm=T)

Here is how this works:

  • In df %>% pull(col_2), we extract the column we wish to summarize from the data frame df.
  • pull() is similar to $ except that it looks better in a chain with %>%.
  • The pipe %>% passes the values of col_2 to the first argument of sum(). We can pass arguments to the aggregation function like we do here in passing na.rm=T to ignore missing values.

Multiple Column Function

We wish to perform an aggregation operation on an entire data frame (that is not grouped) reducing multiple columns to a single scalar value.

In this example, we wish to compute the weighted mean of the column col_2 where the weights are given by the column col_3 of the data frame df.

df %>% summarize(w_mean_2_3 = 
            weighted.mean(col_2, col_3, na.rm=T))

Here is how this works:

  • Inside summarise() we can compose data aggregation expressions that involve any number of columns.
  • In this example, we use the function weighted.mean() to compute the weighted mean. We pass to weighted mean the column whose weighted mean we wish to calculate col_2 and the column that holds the weights col_3. See Common Aggregation Operations for a coverage of the most common data aggregation operations.
  • The rest of the code works as described in “One Column Function” above.

Alternatively:

weighted.mean(df$col_2, df$col_3, na.rm=T)

Here is how this works:

  • In df$col_2 and df$col_3, we use the selection operator $ (from base R) to extract the columns col_2 and col_3 as vectors.
  • We pass the values of df$col_2 and df$col_3 to weighted.mean() to perform the weighted mean aggregation operation.

Grouped Data Frame

We wish to perform one data aggregation operation on a grouped data frame.

One Column Function

We wish to perform an aggregation on a grouped data frame. The aggregation operation we wish to perform is a function of one column.

In this example, we wish to obtain the sum of the value of the column col_2 for each group where the groups are defined by the column col_1.

df %>% 
  group_by(col_1) %>% 
  summarize(col_2_sum = sum(col_2, na.rm=T))

Here is how this works:

  • summarize() executes one or more data aggregation operations on a data frame. See On Data Frame above.
  • When we run group_by() prior to calling summarize(), the data aggregation is carried out on each group separately.
  • group_by() accepts one or more columns and then assigns the rows of the data frame to a group according to the values of the grouping column(s). See Grouping for a coverage of common grouping scenarios.
  • We use function sum() to compute the sum of the values of the column col_2 (for each group). See Common Aggregation Operations for a coverage of the most common aggregation operations.
  • The output is a data frame with one column called col_2_sum and as many rows as there are groups i.e. as many rows as the number of unique values of the grouping column col_1. Each row of the output data frame holds the sum of the values of the column col_2 of the input data frame df for the corresponding group.

Multiple Column Function

We wish to perform an aggregation on a grouped data frame. The aggregation operation we wish to perform is a function of multiple columns.

In this example, we wish to compute the weighted mean of the column col_2 where the weights are given by the column col_3. We wish to perform the aggregation for each group where the groups are defined by the column col_1.

df %>% 
  group_by(col_1) %>% 
  summarize(col_2_3_w_mean = 
              weighted.mean(col_2, col_3, na.rm=T))

Here is how this works:

  • We call group_by() prior to calling summarize() to have the data aggregation operation carried out on each group separately as described in “One Column Function” above.
  • Inside summarise() we can compose data aggregation expressions that involve any number of columns.
  • In this example, we use the function weighted.mean() to compute the weighted mean. We pass to weighted mean the column whose weighted mean we wish to calculate col_2 and the column that holds the weights col_3. See Common Aggregation Operations for a coverage of the most common data aggregation operations.
R
I/O