Typically, a data aggregation operation accepts one or more vector(s) of values and reduces that to a single summary value e.g. computing the sum of values of a column for a group; see Basic Aggregation. Occasionally though, we need to perform a data summarization operation where the output summary is composed of more than one value.
In this section, we will cover three scenarios of multi-value aggregation:
We wish to reduce each group to one column with any number of rows.
In this example, we wish to produce a multi-value summary of the column col_2
which is to compute the 25th, 50th and 75th percentile values for each group where the groups are defined by the column col_1
. We wish to organize the summary values as a column.
df_2 = df %>%
group_by(col_1) %>%
summarise(
q_val = quantile(col_2, c(0.25, 0.5, 0.75)),
q_prob = c(0.25, 0.5, 0.75))
Here is how this works:
group_by(col_1)
, to group the rows of the data frame df
by the values in the column col_1
. This way the subsequent call to summarise()
will have the data aggregation operations be applied to each group separately. See Summary Table.summarise()
. In this case we call quantile(col_2, c(0.25, 0.5, 0.75))
which returns a vector of three values that are the 25th, 50th and 75th percentile values of the column col_2
.q_val
.q_prob = c(0.25, 0.5, 0.75)
. Two notes:summarize()
.q_val
, i.e. 3) or 1 which summarise()
“broadcasts” to the size of the previously created summary column i.e. 3.df_2
will have three columns; the grouping column col_1
, q_val
holding three percentile values for each group, and q_prob
holding the corresponding percentile labels (0.25, 0.5, or 0.75) for each group. The number of rows will be the product of the number of groups (i.e. the number of unique values of the column col_1
) and the number of summary rows for each group (which in this case is 3).dplyr 1.0.0
so make sure your dplyr is up-to-date.Alternative: Via Function
my_quantile <- function(x, probs) {
tibble(q_val = quantile(x, probs), q_prob = probs)
}
df_2 = df %>%
group_by(col_1) %>%
summarise(my_quantile(col_2, c(0.25, 0.5, 0.75)))
Here is how this works:
my_quantile()
.my_quantile()
accepts a vector of values and returns a data frame of two columns where the first column q_val
holds the percentile values and the second column q_prob
holds the corresponding percentile labels.summarize()
passes to my_quantile()
the values of the column col_2
for each group one by one.summarize()
then combines the dataframes returned from my_quantile()
for each group by row binding them i.e. stacking them one on top of the other while aligning the columns according to the column names (which are the same for all the data frames). Those column names also become the column names of the output data frame.We wish to reduce each group into one row and any number of columns.
In this example, we wish to summarize each group into the coefficient, standard error, and p-value of the linear relationship between the column col_3
(outcome) on col_2
(predictor). The groups are defined by the values of the column col_1
.
library(broom)
regression_summary <- function(y, x) {
s_df = lm_fit = lm(y ~ x) %>%
tidy() %>%
filter(term == 'x') %>%
select(estimate, std_error = std.error, p_value = p.value)
return(s_df)
}
df_2 = df %>%
group_by(col_1) %>%
summarise(regression_summary(col_3, col_2))
Here is how this works:
summarize()
a function that returns a data frame of one row.regression_summary()
which accepts two vectors x
and y
, computes the linear regression relationship between them, and returns a data frame of one row and three columns holding the three desired quantities: the regression slope coefficient estimate
, the standard error std_error
, and the p_value
.summarize()
passes to regression_summary()
the values of the columns col_2
and col_3
for each group one by one.regression_summary()
to any column name because we wish for its output to generate multiple columns. If we do assign the output of regression_summary()
to a label, summarize()
will create a nested column with that label as its name in which the data frame returned by regression_summary()
will be nested which is not the behavior we want. See Nesting.summarize()
then combines the dataframes returned from regression_summary()
for each group by row binding them i.e. stacking them one on top of the other while aligning the columns according to the column names (which are the same for all the data frames).df_2
will have as many rows as there are unique values in the column col_1
and three columns estimate
, std_error
, and p_value
holding the coefficient slope value, the standard error and the p-value respectively.regression_summary()
function operates as follows:y
and x
, which are the outcome variable and predictor variable, respectively, for the linear regression model we wish to fit.lm()
from base Rtidy()
from the package broom to convert the model attributes into a data frame.filter()
to extract the row corresponding to the slope.select()
to extract the model parameters that we care about.We wish to reduce each group into a data frame of any number of rows and columns.
In this example, we wish to summarize each group by reducing it to a smaller number of rows (3 in this case) that are representative of the group. We wish to use the notion of centroids as computed by k-means to capture those representative points. The groups are defined by the values of the column col_1
.
library(broom)
centroid_summary <- function(p_df, p_centers) {
s_df = p_df %>%
kmeans(centers = p_centers) %>%
tidy() %>%
select(any_of(colnames(p_df)))
return(s_df)
}
df_2 = df %>%
group_by(col_1) %>%
summarise(centroid_summary(cur_data(), 3))
Here is how this works:
summarize()
a function that returns a data frame of the desired structure.centroid_summary()
which:p_df
and computes the “centroids” where the rows are treated as observations and the columns as the dimensions.p_centers
the number of centroids to compute.p_centers
.summarize()
passes to centroid_summary()
the values of all columns for the current group one by one. We do so via the helper function cur_data()
. See Dynamic Aggregation.centroid_summary()
to any column name because we wish for its output to generate multiple columns. If we do assign the output of centroid_summary()
to a label, summarize()
will create a nested column with that label as its name in which the data frame returned by centroid_summary()
will be nested which is not the behavior we want. See Nesting.summarize()
then combines the data frames returned from centroid_summary()
for each group by row binding them i.e. stacking them one on top of the other while aligning the columns according to the column names (which are the same for all the data frames).df_2
will have as many columns as the input data frame df
and as many rows as the product of unique values in the column col_1
and the number of centers p_centers
which in this case is 3.centroid_summary()
function operates as follows:p_df
the data frame to be summarized and p_centers
the number of centroids to have as the summary.p_df
with p_centers
as the number of clusters via kmeans()
from base R.tidy()
from the package broom
to convert the model attributes into a data frame.select()
to return just the coordinates of the coordinates along the input columns (without the other model for parameters).