We wish to carry out data aggregation operations conditionally. In other words, the data aggregation operation we wish to carry out is not simply applied to all rows, rather rows or groups are treated differently based on one or more conditions.

In this section, we will cover the following conditional aggregation scenarios:

**Subset Column**where we show how to include only a subset of a column’s values in the aggregation operation e.g. aggregate a numeric column by computing the sum of the positive values only.**Filter Group**where we show how to combine filtering with aggregation to first filter a group and then carry out aggregation operations on the remaining rows; e.g. compute aggregations for each group only for the n rows where a particular column has its largest values.- Conditional Choice of
**Aggregation Function**where we cover how to pick the aggregation function to use for a group based on a condition and return the outputs for the groups as a single summary column; e.g. use a different aggregation function based on the size of the group. **Conditional Transformation**where we cover how to combine conditional transformation and aggregation to conditionally alter data and then carry out the aggregation operation on the altered data; e.g. apply a multiplier that depends on the value before taking the sum.

We wish to include only a subset of a column’s values in the aggregation operation

In this example, we wish to summarize the data frame `df`

grouped by the values of the column `col_1`

by computing two data aggregations: (1) The sum of positive values of `col_2`

for each group and (2) the sum of values of `col_2`

for the rows in each group where the values in the `col_3`

column are equal to `"b"`

.

```
df_2 = df %>%
group_by(col_1) %>%
summarize(
col_2_s1 = col_2[col_2 > 0] %>% sum(),
col_2_s2 = col_2[col_3 == 'b'] %>% sum())
```

Here is how this works:

- We call
`group_by()`

prior to calling`summarize()`

to have the data aggregation operation carried out on each group separately. See Summary Table. - We use the bracket operator to subset the values of the column being aggregated i.e. to select only the values that meet a particular condition:
- In
`col_2[col_2 > 0]`

, we select only the positive values of column`col_2`

. - In
`col_2[col_3 == 'b']`

, we select only the values of column`col_2`

that have corresponding values of`b`

in the column`col_3`

.

- In
- We then sum the selected values selected by each of the above expressions to produce the new columns
`col_2_s1`

and`col_2_s2`

respectively.

*Alternatively:*

```
df_2 = df %>%
group_by(col_1) %>%
summarize(
col_2_s1 = sum(col_2 * (col_2 > 0)),
col_2_s2 = sum(col_2 * (col_3 == 'b')))
```

Here is how this works:

- The code is the same as the primary solution above except for the way we extract the positive values of the column
`col_2`

for each group. - We use element-wise multiplication with a logical condition to select the specific values of the column
`col_2`

for each group:- Each of
`(col_2 > 0)`

and`(col_3 == 'b')`

returns a logical vector i.e. a vector composed of`TRUE`

or`FALSE`

values. - We multiply the values of the column
`col_2`

for the group with the respective logical vector to yield a vector of the same length but where values where the logical vector is`FALSE`

are set to zero. - In other words:
`col_2 * (col_2 > 0)`

multiplies each value in column`col_2`

by 1 if the value is greater than 0 and by 0 otherwise, resulting in a new vector with only the positive values of column`col_2`

and the rest set to zero.- Similarly,
`col_2 * (col_3 == 'b')`

multiplies each value in column`col_2`

by 1 if the corresponding value in column`col_3`

is equal to`b`

and by 0 otherwise, resulting in a new vector with only the values of column`col_2`

that have corresponding values of`b`

in column`col_3`

and the rest set to zero.

- We then take the sum.

- Each of
- This works because since our aim is to sum values, we can take advantage of two facts
- The sum of any number of zeros is zero.
- We can multiply a numerical value by a logical value in which case
`False`

is treated as zero and`True`

is treated as 1.

We wish carry out data aggregation operations on a subset of the rows of each group.

In this example, we wish to summarize the data frame `df`

grouped by the values of the column `col_1`

by computing two data aggregations: (1) The sum of the values of the column `col_3`

for the rows where the column `col_2`

takes its three largest values and (2) the mean of the values of the column `col_4`

also for the rows where the column `col_2`

takes its three largest values.

```
df_2 = df %>%
group_by(col_1) %>%
slice_max(col_2, n = 3) %>%
summarize(
col_3_sum = sum(col_3),
col_4_mean = mean(col_4))
```

Here is how this works:

- We start by filtering the rows within each group where the column
`col_2`

has its 3 largest values. We call`group_by()`

followed by`slice_max()`

to execute the filtering for each group. See Filtering Groups by Rank. - We then carry out the desired aggregation operations via
`summarize()`

.- The values of
`col_3_sum`

are computed by using the`sum()`

function to compute the sum of the values in column`col_3`

for each group. - The values of
`col_4_mean`

are computed by using the`mean()`

function to compute the mean of the values in column`col_4`

for each group.

- The values of
- The resulting dataframe is stored in
`df_2`

.

We wish to use a pick the aggregation function to use for a group based on a condition and return the outputs for the groups as a single summary column.

In this example, we wish to summarize the data frame `df`

grouped by the values of the column `col_1`

by computing a data aggregation operation that depends on the size of the group. If the group size is less than 3 we return 0.5 times the group size but if the group size is greater than 3 we return the sum of 0.1 times the values of the column `col_2`

.

```
df_2 = df %>%
group_by(col_1) %>%
summarize(col_2_s = if_else(n() < 3,
0.5 * n(),
sum(0.1 * col_2)))
```

Here is how this works:

- We call
`group_by()`

prior to calling`summarize()`

to have the data aggregation operations carried out on each group separately. See Summary Table. - The aggregation we wish to perform for each group depends on the size of the group. We use the
`if_else()`

function to decide on the appropriate aggregation function to execute for each group according the value of the condition as follows:- If the number of rows in a group is less than 3, the
`if_else()`

function returns`0.5 * n()`

, where`n()`

is the number of rows in the group. - If the number of rows is greater than or equal to 3, the
`if_else()`

function returns`sum(0.1 * col_2)`

, which is the sum of the values in the`col_2`

column multiplied by 0.1. - We cover the user
`if_else()`

and other constructs for conditional transformation in Conditional Transformation.

- If the number of rows in a group is less than 3, the
- The resulting values from all aggregations are combined in the column
`col_2_s`

in the output dataframe`df_2`

.

We wish to conditionally alter the data and then carry out an aggregation operation on the altered data.

In this example, we wish to summarize the data frame `df`

grouped by the values of the column `col_1`

by computing the weighted sum of the values of the column `col_2`

where we wish to have values smaller than 10 be given a weight of 2.

```
df_2 = df %>%
group_by(col_1) %>%
summarize(col_2_s = if_else(col_2 < 10,
col_2 * 2,
col_2) %>% sum())
```

Here is how this works:

- We call
`group_by()`

prior to calling`summarize()`

to have the data aggregation operations carried out on each group separately. See Summary Table. - We wish to modify the data based on a condition and then perform the data aggregation operation on the modified data.
- We use
`if_else()`

to check if each value is less than 10. If it is, the value is multiplied by 2. Otherwise, the original value is used. We cover this in detail in Conditional Transformation. - Then, the
`sum()`

method is called to sum up all the values in the group (after they have been multiplied by 2 or not), and this sum is returned as the value for that group.

- We use
- The resulting dataframe,
`df_2`

, will have one row for each group in`df`

, with the`col_1`

column containing the group value, and the`col_2_s`

column containing the sum of the values in`col_2`

for that group, after applying the`if_else()`

function to each value.

R