We wish to aggregate data in a way that can only be produced by carrying out multiple aggregation operations in sequence.
An example is computing the monthly average of number of daily events. To do that we would need to carry out to aggregations in sequence:
We will cover two common scenarios of sequential aggregation:
We wish to aggregate data in a way that can only be produced by carrying out multiple aggregation operations in sequence.
In this example, for each value of the column col_1
in the data frame df
, we wish to obtain the average number of rows and average sum of values of col_3
for groups defined by the combinations of col_1
and col_2
.
df_2 = df %>%
group_by(col_1, col_2) %>%
summarize(
count = n(),
value = sum(col_3)) %>%
group_by(col_1) %>%
summarize(
avg_count = mean(count),
avg_value = mean(value))
Here is how this works:
col_1
and col_2
df
by the values in the col_1
and col_2
columns using the group_by()
function.col_3
column via the first call to summarize()
.col_1
and col_2
, and two columns count
and value
.col_1
col_1
.summarize()
.col_1
and two columns avg_count
and avg_value
.group_by()
and get the same results. We kept it here for clarity. See Grouping.We wish to compute a lower resolution aggregation of a higher resolution aggregation of data.
In this example, we wish to aggregate the data frame df
over a date time column col_1
such that for each day, we obtain the average hourly number of rows and average hourly sum of values of the column col_2
.
df_2 = df %>%
group_by(col_1_hour = round_date(col_1, unit = 'hour')) %>%
summarize(
hour_count = n(),
hour_value = sum(col_2)) %>%
group_by(round_date(col_1_hour, unit = 'day')) %>%
summarize(
avg_count = mean(hour_count),
avg_value = mean(hour_value))
Here is how this works:
df
by the values in the column col_1
, rounded to the nearest hour using the round_date()
function from the lubridate
package.col_1_hour
. We discuss creating new grouping columns inside group_by()
in Grouping.col_2
via the first call to summarize()
.col_1
column, and the values in the hour_count
and hour_value
columns are the count and the sum of values of the column col_2
for each hour, respectively.summarize()
.avg_count
and avg_value
.