We wish to carry out data aggregation operations conditionally. In other words, the data aggregation operation we wish to carry out is not simply applied to all rows, rather rows or groups are treated differently based on one or more conditions.

In this section, we will cover the following conditional aggregation scenarios:

**Subset Column**where we show how to include only a subset of a column’s values in the aggregation operation e.g. aggregate a numeric column by computing the sum of the positive values only.**Filter Group**where we show how to combine filtering with aggregation to first filter a group and then carry out aggregation operations on the remaining rows; e.g. compute aggregations for each group only for the n rows where a particular column has its largest values.- Conditional Choice of
**Aggregation Function**where we cover how to pick the aggregation function to use for a group based on a condition and return the outputs for the groups as a single summary column; e.g. use a different aggregation function based on the size of the group. **Conditional Transformation**where we cover how to combine conditional transformation and aggregation to conditionally alter data and then carry out the aggregation operation on the altered data; e.g. apply a multiplier that depends on the value before taking the sum.

On this page, we demonstrate conditional aggregation to summarize a data frame grouped by one column. See Grouping where we cover multiple grouping scenarios.

**Based on Self**

We wish to include only a subset of a column’s values in the aggregation operation based on a condition on the values of the column itself.

In this example, we wish to summarize the data frame `df`

grouped by the values of the column `col_1`

by computing the sum of positive values of `col_2`

.

```
df_2 = (df
.groupby('col_1', as_index=False)
.agg(col_2_s = ('col_2', lambda x: x[x > 0].sum())))
```

Here is how this works:

- We call
`agg()`

on the output of`groupby()`

to have the data aggregation operation be carried out for each group. See Summary Table.- The data frame
`df`

is grouped by the values of the column`col_1`

. - For each group, the lambda function passed to the
`agg()`

method is executed on the values of the column`col_2`

for that group. - The output is saved in a new column called
`col_2_s`

.

- The data frame
- In
`x[x > 0].sum()`

, we filter the values of the passed column`col_2`

for the group to only include values greater than 0, and then sum the remaining values.

*Variant 1: Multiply by Logical*

```
df_2 = (df
.groupby('col_1', as_index=False)
.agg(col_2_s = ('col_2', lambda x: (x * (x > 0)).sum())))
```

Here is how this works:

- The code is the same as the primary solution above except for the way we extract the positive values of the column
`col_2`

for each group. - In
`(x * (x > 0)).sum()`

, we use element-wise multiplication with a logical condition to select the specific values of the column`col_2`

for each group (`x`

refers to the values of`col_2`

for the current group):`(x > 0)`

returns a logical vector i.e. a vector composed of`True`

or`False`

values.- In
`x * (x > 0)`

we multiply the values of the input`x`

with the logical vector to yield a vector of the same length as`x`

but where any zero or negative values of`x`

are set to zero. - We then take the sum.

- This works because since our aim is to sum values, we can take advantage of two facts
- The sum of any number of zeros is zero.
- We can multiply a numerical value by a logical value in which case
`False`

is treated as zero and`True`

is treated as 1.

*Variant 2: Individual Aggregation*

```
(df
.groupby('col_1')['col_2']
.agg(lambda x: x[x > 0].sum())))
```

Here is how this works:

- In the primary solution above we carried out the conditional aggregation operation in the context of summarizing a data frame into a summary data frame. See Table Summary for a description of the pattern.
- In this variant solution, we carried out the conditional aggregation operation as an individual aggregation. See Individual Aggregation for a description of the pattern.

**Based on Another Column**

We wish to include only a subset of a column’s values in the aggregation operation based on a condition on the values of another column.

In this example, we wish to summarize the data frame `df`

grouped by the values of the column `col_1`

by computing the sum of values of `col_2`

for the rows in each group where the values in the `col_3`

column are equal to `"b"`

.

```
df_2 = (df
.groupby('col_1', as_index=False)
.apply(lambda x: pd.Series({
'col_2_s': (x['col_2'].loc[x['col_3'] == 'b'].sum())
})))
```

Here is how this works:

- We create a new data frame
`df_2`

that is a summary of the original data frame`df`

. See detailed description in Summary Table.- We call
`apply()`

on the output of`groupby()`

to have the data aggregation operations passed to`apply()`

be carried out for each group. - We use
`apply()`

because we wish to perform data aggregation operations that involve multiple columns. - We have
`apply()`

return a`Series`

for each group, so we may assign a name to the created column and more generally to create multiple columns.

- We call
- The
`lambda`

function passed to`apply()`

takes a dataframe`x`

as input and returns the sum of the values in the column`col_2`

for the rows in`x`

where the values in the`col_3`

column are equal to`"b"`

. This is done by:- first selecting only the rows in
`x`

where the values in`col_3`

are equal to "b" using the`.loc[]`

method - and then applying the
`sum()`

method to the values in the`col_2`

column for these rows.

- first selecting only the rows in

*Variant 1: Multiply by Logical*

```
df_2 = (df
.groupby('col_1', as_index=False)
.apply(lambda x: (x['col_2']
.prod(x['col_3'] == 'b')
.sum())))
```

Here is how this works:

- The code is the same as the primary solution above except for the way we extract the positive values of the column
`col_2`

for each group. - The
`lambda`

function passed to`apply()`

takes a dataframe`x`

holding the current group as input and returns the sum of the values in the column`col_2`

for the rows in`x`

where the values in the`col_3`

column are equal to`"b"`

. This is done by:`x['col_3'] == 'b'`

returns a logical vector i.e. a vector composed of`True`

or`False`

values.- In
`x['col_2'].prod(x['col_3'] == 'b')`

we use the`prod()`

method to multiply the values of the column`col_2`

with the logical vector to yield a vector of the same length but where any row where the value of`col_3`

is not equal`‘b’`

is set to zero. We could simply use`*`

instead of`prod()`

but the method allows us to chain commands and is arguably more readable. - We then take the sum.

- This works because since our aim is to sum values, we can take advantage of two facts
- The sum of any number of zeros is zero.
- We can multiply a numerical value by a logical value in which case
`False`

is treated as zero and`True`

is treated as 1.

*Variant 2: Individual Aggregation*

```
(df
.groupby('col_1')
.apply(lambda x: (x['col_2']
.loc[x['col_3'] == 'b']
.sum())))
```

Here is how this works:

- In the primary solution above we carried out the conditional aggregation operation in the context of summarizing a data frame into a summary data frame. See Table Summary for a description of the pattern.
- In this variant solution, we carried out the conditional aggregation operation as an individual aggregation. See Individual Aggregation for a description of the pattern.

We wish carry out data aggregation operations on a subset of the rows of each group.

In this example, we wish to summarize the data frame `df`

grouped by the values of the column `col_1`

by computing two data aggregations: (1) The sum of the values of the column `col_3`

for the rows where the column `col_2`

takes its three largest values and (2) the mean of the values of the column `col_4`

also for the rows where the column `col_2`

takes its three largest values.

```
df_2 = (df
.groupby('col_1', as_index=False)
.apply(lambda x: x.nlargest(3, 'col_2'))
.groupby('col_1', as_index=False)
.agg(
col_3_sum = ('col_3', 'sum'),
col_4_mean = ('col_4', 'mean')))
```

Here is how this works:

- We start by filtering the rows within each group where the column
`col_2`

has its 3 largest values:- We call
`groupby()`

followed by`apply()`

to execute the filtering for each group. - We pass to
`apply()`

a lambda function that uses`nlargest()`

to extract the rows where the column`col_2`

has its 3 largest values. See Filtering Groups by Rank.

- We call
- We then carry out the desired data aggregation operations:
- We group the selected rows by the values in the
`col_1`

column again because the output of the first call to apply returns an ungrouped data frame. - We use
`agg()`

because the data aggregations operations we are carrying out are functions of one column. If we wish to carry out data aggregations that involve multiple columns we would need to use`apply()`

. See Summary Table. - The aggregation functions for each group take the values in the
`col_3`

and`col_4`

columns and compute their sum and mean, respectively.

- We group the selected rows by the values in the

*Alternatively:*

```
df_2 = (df
.groupby('col_1', as_index=False)
.apply(lambda x: pd.Series({
'col_3_sum': x.nlargest(3, 'col_2')['col_3'].sum(),
'col_4_mean': x.nlargest(3, 'col_2')['col_4'].mean()
})))
```

Here is how this works:

- This code works similarly to the primary solution above except that we combine filtering and aggregation in the same lambda function.
- The
`lambda`

function accepts a data frame`x`

holding the current group and- first filters the rows where
`col_2`

has its 3 largest values using`nlargest()`

- then carries out the desired data aggregation operations;
`sum()`

and`mean()`

respectively.

- first filters the rows where
- This code is probably easier to write but involves doing the filtering operation multiple times which is okay for small datasets but may not be okay for large datasets.

We wish to use a pick the aggregation function to use for a group based on a condition and return the outputs for the groups as a single summary column.

In this example, we wish to summarize the data frame `df`

grouped by the values of the column `col_1`

by computing a data aggregation operation that depends on the size of the group. If the group size is less than 3 we return 0.5 times the group size but if the group size is greater than 3 we return the sum of 0.1 times the values of the column `col_2`

.

```
df_2 = (df
.groupby('col_1')
.agg(col_2_s = ('col_2',
lambda x: 0.5 * len(x) if len(x) < 3 else 0.1 * x.sum())))
```

Here is how this works:

- We call
`agg()`

on the output of`groupby()`

to have the data aggregation operation be carried out for each group. See Summary Table.- The data frame
`df`

is grouped by the values of the column`col_1`

. - For each group, the lambda function passed to the
`agg()`

method is executed on the values of the column`col_2`

for that group. - The output is saved in a new column called
`col_2_s`

.

- The data frame
- The aggregation we wish to perform for each group depends on the size of the group, therefore the
`lambda`

function receives the values of the column col_2 for the group as input`x`

and uses an`if-else`

structure to check the condition then execute the appropriate aggregation operation. In this example, this works as follows:`len(x) < 3`

: check if the length of the group is less than 3 and if so- return
`0.5 * len(x)`

i.e. 0.5 * the length of the group (or a flat 0.5 per row) - else it returns
`0.1 * x.sum()`

i.e. 0.1 times the sum of the values in the group (or 10% of the value of`col_2`

for each row)

We wish to conditionally alter the data and then carry out an aggregation operation on the altered data.

In this example, we wish to summarize the data frame `df`

grouped by the values of the column `col_1`

by computing the weighted sum of the values of the column `col_2`

where we wish to have values smaller than 10 be given a weight of 2.

```
df_2 = (df
.groupby('col_1')
.agg(col_2_s = ('col_2',
lambda x: np.where(x < 10, x * 2, x).sum())))
```

Here is how this works:

- We call
`agg()`

on the output of`groupby()`

to have the data aggregation operation be carried out for each group separately. See Summary Table.- The data frame
`df`

is grouped by the values of the column`col_1`

. - For each group, the lambda function passed to the
`agg()`

method is executed on the values of the column`col_2`

for that group. - The output is saved in a new column called
`col_2_s`

.

- The data frame
- We wish to modify the data based on a condition and then perform the data aggregation operation on the modified data.
- We use the
`np.where()`

function to check if each value is less than 10. If it is, the value is multiplied by 2. Otherwise, the original value is used. We cover this in detail in Conditional Transformation. - Then, the
`sum()`

method is called to sum up all the values in the group (after they have been multiplied by 2 or not), and this sum is returned as the value for that group.

- We use the

PYTHON