Multi-Value Aggregation

Typically, a data aggregation operation accepts one or more vector(s) of values and reduces that to a single summary value e.g. computing the sum of values of a column for a group; see Basic Aggregation. Occasionally though, we need to perform a data summarization operation where the output summary is composed of more than one value.

In this section, we will cover three scenarios of multi-value aggregation:

  • One Column where we reduce each group to one column with any number of rows. Typically, we add a second column that describes those values.
  • One Row where we reduce each group into one row and any number of columns. The column names describe the values.
  • Arbitrary Dimensions where we reduce each group into a data frame of any number of rows and columns.

One Column

We wish to reduce each group to one column with any number of rows.

In this example, we wish to produce a multi-value summary of the column col_2 which is to compute the 25th, 50th and 75th percentile values for each group where the groups are defined by the column col_1. We wish to organize the summary values as a column.

df_2 = df %>% 
  group_by(col_1) %>%   
  summarise(
    q_val = quantile(col_2, c(0.25, 0.5, 0.75)), 
    q_prob = c(0.25, 0.5, 0.75))

Here is how this works:

  • We call group_by(col_1), to group the rows of the data frame df by the values in the column col_1. This way the subsequent call to summarise() will have the data aggregation operations be applied to each group separately. See Summary Table.
  • To reduce a group to multiple values structured as a column, we do the following:
    • We call a function that returns a vector of values from within summarise(). In this case we call quantile(col_2, c(0.25, 0.5, 0.75)) which returns a vector of three values that are the 25th, 50th and 75th percentile values of the column col_2.
    • We assign the output of the function to a label that will be the column name in the output data frame. In this case the column name will be q_val.
  • Typically, we need a way to identify what each value holds, we can do that by adding a second column that holds the quantile values q_prob = c(0.25, 0.5, 0.75). Two notes:
    • This too is a multi-value aggregation since we are generating a multivalued output inside summarize().
    • The number of values must be the same as that of the previously created column (q_val, i.e. 3) or 1 which summarise() “broadcasts” to the size of the previously created summary column i.e. 3.
  • The output data frame df_2 will have three columns; the grouping column col_1, q_val holding three percentile values for each group, and q_prob holding the corresponding percentile labels (0.25, 0.5, or 0.75) for each group. The number of rows will be the product of the number of groups (i.e. the number of unique values of the column col_1) and the number of summary rows for each group (which in this case is 3).
  • Note: The ability to return multiple values inside summarise() was introduced in dplyr 1.0.0 so make sure your dplyr is up-to-date.

Alternative: Via Function

my_quantile <- function(x, probs) {
  tibble(q_val = quantile(x, probs), q_prob = probs)
}

df_2 = df %>%
  group_by(col_1) %>%  
  summarise(my_quantile(col_2, c(0.25, 0.5, 0.75)))

Here is how this works:

  • This code works similarly to the primary solution above except that we isolated the summarization of each group out into a separate custom function my_quantile().
  • The custom function my_quantile() accepts a vector of values and returns a data frame of two columns where the first column q_val holds the percentile values and the second column q_prob holds the corresponding percentile labels.
  • summarize() passes to my_quantile() the values of the column col_2 for each group one by one.
  • summarize() then combines the dataframes returned from my_quantile() for each group by row binding them i.e. stacking them one on top of the other while aligning the columns according to the column names (which are the same for all the data frames). Those column names also become the column names of the output data frame.

One Row

We wish to reduce each group into one row and any number of columns.

In this example, we wish to summarize each group into the coefficient, standard error, and p-value of the linear relationship between the column col_3 (outcome) on col_2 (predictor). The groups are defined by the values of the column col_1.

library(broom)

regression_summary <- function(y, x) {
  s_df = lm_fit = lm(y ~ x) %>% 
    tidy() %>%
    filter(term == 'x') %>% 
    select(estimate, std_error = std.error, p_value = p.value)
  return(s_df)
}

df_2 = df %>% 
  group_by(col_1) %>% 
  summarise(regression_summary(col_3, col_2))

Here is how this works:

  • To reduce a group into a single row of any number of columns, we need to call from within summarize() a function that returns a data frame of one row.
  • In this case, we create a custom function regression_summary() which accepts two vectors x and y, computes the linear regression relationship between them, and returns a data frame of one row and three columns holding the three desired quantities: the regression slope coefficient estimate, the standard error std_error, and the p_value.
  • summarize() passes to regression_summary() the values of the columns col_2 and col_3 for each group one by one.
  • Note that we do not assign the output of regression_summary() to any column name because we wish for its output to generate multiple columns. If we do assign the output of regression_summary() to a label, summarize() will create a nested column with that label as its name in which the data frame returned by regression_summary() will be nested which is not the behavior we want. See Nesting.
  • summarize() then combines the dataframes returned from regression_summary() for each group by row binding them i.e. stacking them one on top of the other while aligning the columns according to the column names (which are the same for all the data frames).
  • The output data frame df_2 will have as many rows as there are unique values in the column col_1 and three columns estimate, std_error, and p_value holding the coefficient slope value, the standard error and the p-value respectively.
  • The regression_summary() function operates as follows:
    • It takes two arguments, y and x, which are the outcome variable and predictor variable, respectively, for the linear regression model we wish to fit.
    • The function then fits a linear regression model using the function lm() from base R
    • We use the very useful function tidy() from the package broom to convert the model attributes into a data frame.
    • We use filter() to extract the row corresponding to the slope.
    • Finally, we use select() to extract the model parameters that we care about.

Arbitrary Dimensions

We wish to reduce each group into a data frame of any number of rows and columns.

In this example, we wish to summarize each group by reducing it to a smaller number of rows (3 in this case) that are representative of the group. We wish to use the notion of centroids as computed by k-means to capture those representative points. The groups are defined by the values of the column col_1.

library(broom)

centroid_summary <- function(p_df, p_centers) {
  s_df = p_df %>% 
    kmeans(centers = p_centers) %>% 
    tidy() %>% 
    select(any_of(colnames(p_df)))
  return(s_df)
}

df_2 = df %>% 
  group_by(col_1) %>% 
  summarise(centroid_summary(cur_data(), 3))

Here is how this works:

  • To reduce a group into a data frame of any number of rows and columns, we need to call from within summarize() a function that returns a data frame of the desired structure.
  • In this case, we create a custom function centroid_summary() which:
    • Accepts a data frame p_df and computes the “centroids” where the rows are treated as observations and the columns as the dimensions.
    • The function accepts an integer value specifying p_centers the number of centroids to compute.
    • The function returns a data frame that has as many columns as the input and as many rows as that specified by the number of centroids p_centers.
  • summarize() passes to centroid_summary() the values of all columns for the current group one by one. We do so via the helper function cur_data(). See Dynamic Aggregation.
  • Note that we do not assign the output of centroid_summary() to any column name because we wish for its output to generate multiple columns. If we do assign the output of centroid_summary() to a label, summarize() will create a nested column with that label as its name in which the data frame returned by centroid_summary() will be nested which is not the behavior we want. See Nesting.
  • summarize() then combines the data frames returned from centroid_summary() for each group by row binding them i.e. stacking them one on top of the other while aligning the columns according to the column names (which are the same for all the data frames).
  • The output data frame df_2 will have as many columns as the input data frame df and as many rows as the product of unique values in the column col_1 and the number of centers p_centers which in this case is 3.
  • The centroid_summary() function operates as follows:
    • It takes two arguments: p_df the data frame to be summarized and p_centers the number of centroids to have as the summary.
    • The function applies k-means clustering to p_df with p_centers as the number of clusters via kmeans() from base R.
    • We use the very useful function tidy() from the package broom to convert the model attributes into a data frame.
    • Finally, we use select() to return just the coordinates of the coordinates along the input columns (without the other model for parameters).
R
I/O