We wish to specify one or more data aggregation functions to apply to each of a selected set of columns without spelling out each data aggregation expression explicitly.
In this section, we cover the following function specification scenarios:
lambda
function to each of a selected set of columns.This section is complemented by
We wish to apply the same named data aggregation function, e.g. sum()
, to each of a set of selected columns.
In this example, we have a data frame df
and we wish to compute the sum of values of the columns col_2
and col_4
for each group, where the groups are defined by the values of the column col_1
.
df_2 = (df
.groupby('col_1')[['col_2', 'col_4']]
.agg('sum'))
Here is how this works:
groupby()
method groups the rows of the data frame df
by the values in the column col_1
. The data aggregation operations applied by the subsequent call to agg()
are, therefore, applied to each group separately. See Summary Table.agg()
applies the function sum()
to each of the selected columns, col_2
and col_4
, for each group.Pandas
GroupBy
object by passing to agg()
a string holding their name like we do here for ‘sum’
. See Basic Aggregation.df_2
with one row for each unique value of the column col_1
, and two columns col_2
and col_4
containing the sum of values of that column for the corresponding group of rows.Alternative: Via Pandas Convenience Function
df_2 = df.groupby('col_1')[['col_2', 'col_4']].sum()
Here is how this works:
sum()
instead of agg()
.groupby()
(a DataFrameGroupBy
object) has a set of convenience methods to carry out the most common data aggregation operations which can be used instead of passing a function reference to agg()
. Pandas
applies the method to each column for each group behind the scenes. One such method is sum()
which we use in this example. See Common Aggregation Operations for a coverage of most Pandas
convenience data aggregation functions.Extension: Passing Function Arguments
We wish to apply a data aggregation function that requires one or more parameter arguments to each of a set of selected columns.
df_2 = (df
.groupby('col_1')[['col_2', 'col_4']]
.agg('quantile', q=0.9))
Here is how this works:
agg()
after the function. agg()
will then pass those arguments along to the function.q=0.9
to agg()
which then passes it along to quantile()
.lambda
lambda function to pass parameters to the named function of interest. In this example, the anonymous function solution would be lambda x: quantile(x, q=0.9)
. See Anonymous Function below.We wish to apply the same data aggregating lambda function to each of a set of selected columns.
In this example, we have a data frame df
and we wish to calculate the ratio of missing values to all values in each column for each group, where the groups are defined by the values of the column col_1
.
df_2 = (df
.groupby('col_1')[['col_2', 'col_4']]
.agg(lambda x: x.isna().sum() / len(x)))
Here is how this works:
groupby()
method groups the rows of the data frame df
by the values in the column col_1
. The data aggregation operations applied by the subsequent call to agg()
are, therefore, applied to each group separately. See Summary Table.agg()
applies the lambda
function to each of the selected columns (col_2
and col_4
) for each group.lambda x: x.isna().sum() / len(x)
calculates the number of NA
values in a column by applying the isna()
method to each value in the column and summing the resulting logical vector. It then divides by the number of values of the current column for the current group to obtain the ratio of missing values.df_2
, with one row for each unique value in the col_1
column, and a column for each selected column, i.e. col_2
and col_4
, containing the ratio of missing values for that column for the corresponding group of rows.We wish to perform multiple data aggregation operations to each of a set of selected columns of a data frame individually without having to spell out each data aggregation explicitly.
In this example, we have a data frame df
and we wish to count the number of unique values and compute the 90th percentile value of the columns col_2
and col_4
for each group, where the groups are defined by the values of the column col_1
.
df_2 = (df
.groupby('col_1')[['col_2', 'col_4']]
.agg(['nunique', lambda x: x.quantile(0.9)]))
Here is how this works:
groupby()
method groups the rows of the data frame df
by the values in the column col_1
. The data aggregation operations applied by the subsequent call to agg()
are, therefore, applied to each group separately. See Summary Table.['col_2', 'col_4']
we select the columns that we wish to aggregate over.agg()
a list of the aggregating functions or lambda functions that we wish to carry out.nunique()
returns the number of unique values in the input vector.agg()
, one way to pass arguments to one of the functions is to use a lambda
function. In this example, the lambda function lambda x: x.quantile(0.9)
allows us to calculate the 90th percentile of values of the input vector which requires us to set the argument q
to 0.9
.agg()
with a list of functions on a grouped data frame is a data frame with a hierarchical column MultiIndex
where:col_2
and col_4
nunique
and <lambda_0>
. In general, columns resulting from lambda functions are called <lambda_0>
, <lambda_1>
, etc..MultiIndex
.df_2
, with one row for each unique value in the column col_1
, and two columns for each of the original columns col_2
and col_4
in the original df
. The first of these two new columns will contain the number of unique values in each column, and the second column will contain the 90th percentile of the values in each column for the corresponding group of rows.Alternative: Wrapper Function
def quantile_90(x):
return x.quantile(0.9)
df_2 = (df
.groupby('col_1')[['col_2', 'col_4']]
.agg(['nunique', quantile_90]))
Here is how this works:
quantile()
method.quantile()
with the desired argument values set; i.e. q=0.9
. We then call the wrapper function from inside agg()
.We wish to have different sets of one or more functions applied to different columns.
In this example, we have a data frame df
that we wish to summarise over groups defined by the value of the column col_1
. We wish to compute: (1) the sum of values of the columns col_2
and col_3
for each group, (2) the mean and standard deviation of values of the column col_4
for each group.
df_2 = (df
.groupby('col_1')
.agg({'col_2': 'sum',
'col_3': 'sum',
'col_4': ['mean', 'std']}))
Here is how this works:
groupby()
method groups the rows of the data frame df
by the values in the column col_1
. The data aggregation operations applied by the subsequent call to agg()
are, therefore, applied to each group separately. See Summary Table.agg()
where the keys are the columns names and the values are either a single function or a list of functions. This option to pass a dictionary to agg()
allows us to generate extensive summaries with few keystrokesagg()
a dictionary with three key-value pairs as follows:'col_2': 'sum'
specifies that we wish to compute the sum of values of the column col_2
for each group. Similarly, for 'col_3': 'sum'
.'col_4': ['mean', 'std']
specifies that we wish to compute the mean and standard deviation of values of the column col_4
for each group.df_2
, with one row for each unique value in the column col_1
, and four columns. The first two columns will contain the sum of values in the columns col_2
and col_3
respectively for the corresponding group of rows. The next two columns will mean and standard deviation of values of the column col_4
for the corresponding group of rows.Alternative: Join Summary Tables
df_a = (df
.groupby('col_1')[['col_2', 'col_3']]
.agg('sum'))
df_b = (df
.groupby('col_1')['col_4']
.agg(['mean', 'std']))
df_2 = pd.concat([df_a, df_b], axis=1)
Here is how this works:
agg()
, we generate multiple summary data frames and then join them (column bind them) via concat()
.col_1
.df_a
holds the sum of values of the columns col_2
and col_3
for each group.df_b
holds the mean and standard deviation of values of the column col_4
for each group.pd.concat([df_a, df_b], axis=1)
we specify that we wish to concatenate the columns of df_b
next to the columns of df_a
(hence axis=1
) matching rows via the index (which is the groups). See Reshaping.~~