We wish to carry out data aggregation operations conditionally. In other words, the data aggregation operation we wish to carry out is not simply applied to all rows, rather rows or groups are treated differently based on one or more conditions.
In this section, we will cover the following conditional aggregation scenarios:
On this page, we demonstrate conditional aggregation to summarize a data frame grouped by one column. See Grouping where we cover multiple grouping scenarios.
Based on Self
We wish to include only a subset of a column’s values in the aggregation operation based on a condition on the values of the column itself.
In this example, we wish to summarize the data frame df
grouped by the values of the column col_1
by computing the sum of positive values of col_2
.
df_2 = (df
.groupby('col_1', as_index=False)
.agg(col_2_s = ('col_2', lambda x: x[x > 0].sum())))
Here is how this works:
agg()
on the output of groupby()
to have the data aggregation operation be carried out for each group. See Summary Table.df
is grouped by the values of the column col_1
.agg()
method is executed on the values of the column col_2
for that group.col_2_s
.x[x > 0].sum()
, we filter the values of the passed column col_2
for the group to only include values greater than 0, and then sum the remaining values.Variant 1: Multiply by Logical
df_2 = (df
.groupby('col_1', as_index=False)
.agg(col_2_s = ('col_2', lambda x: (x * (x > 0)).sum())))
Here is how this works:
col_2
for each group.(x * (x > 0)).sum()
, we use element-wise multiplication with a logical condition to select the specific values of the column col_2
for each group (x
refers to the values of col_2
for the current group):(x > 0)
returns a logical vector i.e. a vector composed of True
or False
values.x * (x > 0)
we multiply the values of the input x
with the logical vector to yield a vector of the same length as x
but where any zero or negative values of x
are set to zero.False
is treated as zero and True
is treated as 1.Variant 2: Individual Aggregation
(df
.groupby('col_1')['col_2']
.agg(lambda x: x[x > 0].sum())))
Here is how this works:
Based on Another Column
We wish to include only a subset of a column’s values in the aggregation operation based on a condition on the values of another column.
In this example, we wish to summarize the data frame df
grouped by the values of the column col_1
by computing the sum of values of col_2
for the rows in each group where the values in the col_3
column are equal to "b"
.
df_2 = (df
.groupby('col_1', as_index=False)
.apply(lambda x: pd.Series({
'col_2_s': (x['col_2'].loc[x['col_3'] == 'b'].sum())
})))
Here is how this works:
df_2
that is a summary of the original data frame df
. See detailed description in Summary Table.apply()
on the output of groupby()
to have the data aggregation operations passed to apply()
be carried out for each group.apply()
because we wish to perform data aggregation operations that involve multiple columns.apply()
return a Series
for each group, so we may assign a name to the created column and more generally to create multiple columns.lambda
function passed to apply()
takes a dataframe x
as input and returns the sum of the values in the column col_2
for the rows in x
where the values in the col_3
column are equal to "b"
. This is done by:x
where the values in col_3
are equal to "b" using the .loc[]
methodsum()
method to the values in the col_2
column for these rows.Variant 1: Multiply by Logical
df_2 = (df
.groupby('col_1', as_index=False)
.apply(lambda x: (x['col_2']
.prod(x['col_3'] == 'b')
.sum())))
Here is how this works:
col_2
for each group.lambda
function passed to apply()
takes a dataframe x
holding the current group as input and returns the sum of the values in the column col_2
for the rows in x
where the values in the col_3
column are equal to "b"
. This is done by:x['col_3'] == 'b'
returns a logical vector i.e. a vector composed of True
or False
values.x['col_2'].prod(x['col_3'] == 'b')
we use the prod()
method to multiply the values of the column col_2
with the logical vector to yield a vector of the same length but where any row where the value of col_3
is not equal ‘b’
is set to zero. We could simply use *
instead of prod()
but the method allows us to chain commands and is arguably more readable.False
is treated as zero and True
is treated as 1.Variant 2: Individual Aggregation
(df
.groupby('col_1')
.apply(lambda x: (x['col_2']
.loc[x['col_3'] == 'b']
.sum())))
Here is how this works:
We wish carry out data aggregation operations on a subset of the rows of each group.
In this example, we wish to summarize the data frame df
grouped by the values of the column col_1
by computing two data aggregations: (1) The sum of the values of the column col_3
for the rows where the column col_2
takes its three largest values and (2) the mean of the values of the column col_4
also for the rows where the column col_2
takes its three largest values.
df_2 = (df
.groupby('col_1', as_index=False)
.apply(lambda x: x.nlargest(3, 'col_2'))
.groupby('col_1', as_index=False)
.agg(
col_3_sum = ('col_3', 'sum'),
col_4_mean = ('col_4', 'mean')))
Here is how this works:
col_2
has its 3 largest values:groupby()
followed by apply()
to execute the filtering for each group.apply()
a lambda function that uses nlargest()
to extract the rows where the column col_2
has its 3 largest values. See Filtering Groups by Rank.col_1
column again because the output of the first call to apply returns an ungrouped data frame.agg()
because the data aggregations operations we are carrying out are functions of one column. If we wish to carry out data aggregations that involve multiple columns we would need to use apply()
. See Summary Table.col_3
and col_4
columns and compute their sum and mean, respectively.Alternatively:
df_2 = (df
.groupby('col_1', as_index=False)
.apply(lambda x: pd.Series({
'col_3_sum': x.nlargest(3, 'col_2')['col_3'].sum(),
'col_4_mean': x.nlargest(3, 'col_2')['col_4'].mean()
})))
Here is how this works:
lambda
function accepts a data frame x
holding the current group andcol_2
has its 3 largest values using nlargest()
sum()
and mean()
respectively.We wish to use a pick the aggregation function to use for a group based on a condition and return the outputs for the groups as a single summary column.
In this example, we wish to summarize the data frame df
grouped by the values of the column col_1
by computing a data aggregation operation that depends on the size of the group. If the group size is less than 3 we return 0.5 times the group size but if the group size is greater than 3 we return the sum of 0.1 times the values of the column col_2
.
df_2 = (df
.groupby('col_1')
.agg(col_2_s = ('col_2',
lambda x: 0.5 * len(x) if len(x) < 3 else 0.1 * x.sum())))
Here is how this works:
agg()
on the output of groupby()
to have the data aggregation operation be carried out for each group. See Summary Table.df
is grouped by the values of the column col_1
.agg()
method is executed on the values of the column col_2
for that group.col_2_s
.lambda
function receives the values of the column col_2 for the group as input x
and uses an if-else
structure to check the condition then execute the appropriate aggregation operation. In this example, this works as follows:len(x) < 3
: check if the length of the group is less than 3 and if so0.5 * len(x)
i.e. 0.5 * the length of the group (or a flat 0.5 per row)0.1 * x.sum()
i.e. 0.1 times the sum of the values in the group (or 10% of the value of col_2
for each row)We wish to conditionally alter the data and then carry out an aggregation operation on the altered data.
In this example, we wish to summarize the data frame df
grouped by the values of the column col_1
by computing the weighted sum of the values of the column col_2
where we wish to have values smaller than 10 be given a weight of 2.
df_2 = (df
.groupby('col_1')
.agg(col_2_s = ('col_2',
lambda x: np.where(x < 10, x * 2, x).sum())))
Here is how this works:
agg()
on the output of groupby()
to have the data aggregation operation be carried out for each group separately. See Summary Table.df
is grouped by the values of the column col_1
.agg()
method is executed on the values of the column col_2
for that group.col_2_s
.np.where()
function to check if each value is less than 10. If it is, the value is multiplied by 2. Otherwise, the original value is used. We cover this in detail in Conditional Transformation.sum()
method is called to sum up all the values in the group (after they have been multiplied by 2 or not), and this sum is returned as the value for that group.