We wish to carry out data aggregation operations conditionally. In other words, the data aggregation operation we wish to carry out is not simply applied to all rows, rather rows or groups are treated differently based on one or more conditions.
In this section, we will cover the following conditional aggregation scenarios:
We wish to include only a subset of a column’s values in the aggregation operation
In this example, we wish to summarize the data frame df
grouped by the values of the column col_1
by computing two data aggregations: (1) The sum of positive values of col_2
for each group and (2) the sum of values of col_2
for the rows in each group where the values in the col_3
column are equal to "b"
.
df_2 = df %>%
group_by(col_1) %>%
summarize(
col_2_s1 = col_2[col_2 > 0] %>% sum(),
col_2_s2 = col_2[col_3 == 'b'] %>% sum())
Here is how this works:
group_by()
prior to calling summarize()
to have the data aggregation operation carried out on each group separately. See Summary Table.col_2[col_2 > 0]
, we select only the positive values of column col_2
.col_2[col_3 == 'b']
, we select only the values of column col_2
that have corresponding values of b
in the column col_3
.col_2_s1
and col_2_s2
respectively.Alternatively:
df_2 = df %>%
group_by(col_1) %>%
summarize(
col_2_s1 = sum(col_2 * (col_2 > 0)),
col_2_s2 = sum(col_2 * (col_3 == 'b')))
Here is how this works:
col_2
for each group.col_2
for each group:(col_2 > 0)
and (col_3 == 'b')
returns a logical vector i.e. a vector composed of TRUE
or FALSE
values.col_2
for the group with the respective logical vector to yield a vector of the same length but where values where the logical vector is FALSE
are set to zero.col_2 * (col_2 > 0)
multiplies each value in column col_2
by 1 if the value is greater than 0 and by 0 otherwise, resulting in a new vector with only the positive values of column col_2
and the rest set to zero.col_2 * (col_3 == 'b')
multiplies each value in column col_2
by 1 if the corresponding value in column col_3
is equal to b
and by 0 otherwise, resulting in a new vector with only the values of column col_2
that have corresponding values of b
in column col_3
and the rest set to zero.False
is treated as zero and True
is treated as 1.We wish carry out data aggregation operations on a subset of the rows of each group.
In this example, we wish to summarize the data frame df
grouped by the values of the column col_1
by computing two data aggregations: (1) The sum of the values of the column col_3
for the rows where the column col_2
takes its three largest values and (2) the mean of the values of the column col_4
also for the rows where the column col_2
takes its three largest values.
df_2 = df %>%
group_by(col_1) %>%
slice_max(col_2, n = 3) %>%
summarize(
col_3_sum = sum(col_3),
col_4_mean = mean(col_4))
Here is how this works:
col_2
has its 3 largest values. We call group_by()
followed by slice_max()
to execute the filtering for each group. See Filtering Groups by Rank.summarize()
.col_3_sum
are computed by using the sum()
function to compute the sum of the values in column col_3
for each group.col_4_mean
are computed by using the mean()
function to compute the mean of the values in column col_4
for each group.df_2
.We wish to use a pick the aggregation function to use for a group based on a condition and return the outputs for the groups as a single summary column.
In this example, we wish to summarize the data frame df
grouped by the values of the column col_1
by computing a data aggregation operation that depends on the size of the group. If the group size is less than 3 we return 0.5 times the group size but if the group size is greater than 3 we return the sum of 0.1 times the values of the column col_2
.
df_2 = df %>%
group_by(col_1) %>%
summarize(col_2_s = if_else(n() < 3,
0.5 * n(),
sum(0.1 * col_2)))
Here is how this works:
group_by()
prior to calling summarize()
to have the data aggregation operations carried out on each group separately. See Summary Table.if_else()
function to decide on the appropriate aggregation function to execute for each group according the value of the condition as follows:if_else()
function returns 0.5 * n()
, where n()
is the number of rows in the group.if_else()
function returns sum(0.1 * col_2)
, which is the sum of the values in the col_2
column multiplied by 0.1.if_else()
and other constructs for conditional transformation in Conditional Transformation.col_2_s
in the output dataframe df_2
.We wish to conditionally alter the data and then carry out an aggregation operation on the altered data.
In this example, we wish to summarize the data frame df
grouped by the values of the column col_1
by computing the weighted sum of the values of the column col_2
where we wish to have values smaller than 10 be given a weight of 2.
df_2 = df %>%
group_by(col_1) %>%
summarize(col_2_s = if_else(col_2 < 10,
col_2 * 2,
col_2) %>% sum())
Here is how this works:
group_by()
prior to calling summarize()
to have the data aggregation operations carried out on each group separately. See Summary Table.if_else()
to check if each value is less than 10. If it is, the value is multiplied by 2. Otherwise, the original value is used. We cover this in detail in Conditional Transformation.sum()
method is called to sum up all the values in the group (after they have been multiplied by 2 or not), and this sum is returned as the value for that group.df_2
, will have one row for each group in df
, with the col_1
column containing the group value, and the col_2_s
column containing the sum of the values in col_2
for that group, after applying the if_else()
function to each value.