Individual Aggregation

We wish to carry out an individual data aggregation operation e.g. we wish to obtain the sum of the values of a particular numeric column. One common situation where quick individual data aggregation is needed is during interactive inspection of data

We will cover two scenarios:

Performing a data aggregation operation on an entire data frame (not grouped) reducing one or more columns to a single scalar value.
Performing a data aggregation operation on a grouped data frame reducing one or more columns into a set of values one for each group.

This section is complemented by:

Summary Table where we cover how to carry out multiple aggregations on and return a summary data frame.
Common Aggregation Operations where cover of the most common data aggregation operations.

Data Frame

We wish to perform one data aggregation operation on a data frame that is not grouped.

One Column Function

We wish to perform one data aggregation operation on a column of a data frame (that is not grouped) reducing that one column to a single scalar value.

In this example we wish to compute the sum of the column col_2 of the data frame df.

df %>% summarize(col_2_sum = sum(col_2, na.rm=T))

Here is how this works:

We use the function summarize() to perform an aggregation operation. In the tidyverse, summarise() is the primary approach to perform aggregation.
summarize() expects a data frame, which in this case is df, and at least one data aggregation expression, which in this case is col_2_sum = sum(col_2, na.rm=T).
Let’s take a closer look at the data aggregation expression:
- col_2_sum is the name of the output column.
- = is the assignment operator assigning the output of the expression on the right side to the variable on the left side.
- sum(col_2, na.rm=T) is the data aggregation to be carried out.
Inside summarise(), we can refer to any column directly by its name like we do here in referring to col_2.
The output is a data frame with one value. The data frame has one column called col_2_sum and one row holding the sum of the values of the column col_2 of the input data frame df.
We then use the built-in function sum() to compute the sum of values of col_2. We set the argument na.rm = TRUE so sum() would ignore NA values and return the sum of the rest. See Common Aggregation Operations for a coverage of the most common data aggregation operations.

Alternatively:

sum(df$col_2, na.rm=T)

Here is how this works:

In df$col_2, we use the selection operator $ (from base R) to extract the column col_2 from the data frame df.
We pass the values of df$col_2 to sum() to perform the summation aggregation operation.
This approach is perhaps the one that involves the fewest keystrokes when we wish to aggregate over an entire (not grouped) data frame.

df %>% pull(col_2) %>% sum(na.rm=T)

Here is how this works:

In df %>% pull(col_2), we extract the column we wish to summarize from the data frame df.
pull() is similar to $ except that it looks better in a chain with %>%.
The pipe %>% passes the values of col_2 to the first argument of sum(). We can pass arguments to the aggregation function like we do here in passing na.rm=T to ignore missing values.

Multiple Column Function

We wish to perform an aggregation operation on an entire data frame (that is not grouped) reducing multiple columns to a single scalar value.

In this example, we wish to compute the weighted mean of the column col_2 where the weights are given by the column col_3 of the data frame df.

df %>% summarize(w_mean_2_3 = 
            weighted.mean(col_2, col_3, na.rm=T))

Here is how this works:

Inside summarise() we can compose data aggregation expressions that involve any number of columns.
In this example, we use the function weighted.mean() to compute the weighted mean. We pass to weighted mean the column whose weighted mean we wish to calculate col_2 and the column that holds the weights col_3. See Common Aggregation Operations for a coverage of the most common data aggregation operations.
The rest of the code works as described in “One Column Function” above.

Alternatively:

weighted.mean(df$col_2, df$col_3, na.rm=T)

Here is how this works:

In df$col_2 and df$col_3, we use the selection operator $ (from base R) to extract the columns col_2 and col_3 as vectors.
We pass the values of df$col_2 and df$col_3 to weighted.mean() to perform the weighted mean aggregation operation.

Grouped Data Frame

We wish to perform one data aggregation operation on a grouped data frame.

One Column Function

We wish to perform an aggregation on a grouped data frame. The aggregation operation we wish to perform is a function of one column.

In this example, we wish to obtain the sum of the value of the column col_2 for each group where the groups are defined by the column col_1.

df %>% 
  group_by(col_1) %>% 
  summarize(col_2_sum = sum(col_2, na.rm=T))

Here is how this works:

summarize() executes one or more data aggregation operations on a data frame. See On Data Frame above.
When we run group_by() prior to calling summarize(), the data aggregation is carried out on each group separately.
group_by() accepts one or more columns and then assigns the rows of the data frame to a group according to the values of the grouping column(s). See Grouping for a coverage of common grouping scenarios.
We use function sum() to compute the sum of the values of the column col_2 (for each group). See Common Aggregation Operations for a coverage of the most common aggregation operations.
The output is a data frame with one column called col_2_sum and as many rows as there are groups i.e. as many rows as the number of unique values of the grouping column col_1. Each row of the output data frame holds the sum of the values of the column col_2 of the input data frame df for the corresponding group.

Multiple Column Function

We wish to perform an aggregation on a grouped data frame. The aggregation operation we wish to perform is a function of multiple columns.

In this example, we wish to compute the weighted mean of the column col_2 where the weights are given by the column col_3. We wish to perform the aggregation for each group where the groups are defined by the column col_1.

df %>% 
  group_by(col_1) %>% 
  summarize(col_2_3_w_mean = 
              weighted.mean(col_2, col_3, na.rm=T))

Here is how this works:

We call group_by() prior to calling summarize() to have the data aggregation operation carried out on each group separately as described in “One Column Function” above.
Inside summarise() we can compose data aggregation expressions that involve any number of columns.
In this example, we use the function weighted.mean() to compute the weighted mean. We pass to weighted mean the column whose weighted mean we wish to calculate col_2 and the column that holds the weights col_3. See Common Aggregation Operations for a coverage of the most common data aggregation operations.

Optima.io Reference beta

Individual Aggregation

Data Frame

Grouped Data Frame