We wish to carry out an individual data aggregation operation e.g. we wish to obtain the sum of the values of a particular numeric column. One common situation where quick individual data aggregation is needed is during interactive inspection of data
We will cover two scenarios:
This section is complemented by:
We wish to perform one data aggregation operation on a data frame that is not grouped.
One Column Function
We wish to perform one data aggregation operation on a column of a data frame (that is not grouped) reducing that one column to a single scalar value.
In this example we wish to compute the sum of the column ‘col_2'
of the data frame df
.
df['col_2'].sum()
Here is how this works:
df['col_2']
, we select the column that we wish to summarize via the bracket operator returning a Series
. See Selecting.sum()
of Pandas Series
to compute the sum of values of the column col_2
returning a single Scalar value.Alternatively,
from scipy.stats import gmean
gmean(df['col_2'])
Here is how this works:
Series
resulting from df['col_2']
to the aggregation function that we wish to apply.Series
convenience function such as the geometric mean function gmean()
from scipy.stats
in this example.Multiple Column Function
We wish to perform an aggregation operation on an entire data frame (that is not grouped) reducing multiple columns to a single scalar value.
In this example, we wish to compute the weighted average of the column ‘col_2'
where the weights are given by the column ‘col_3'
of the data frame df
.
np.average(df['col_2'], weights=df['col_3'])
Here is how this works:
average()
function from NumPy
package.df['col_2']
to the first argument of average()
and the column to use as weights to the weights argument in weights=df['col_3']
.Alternatively:
df.pipe(lambda x: np.average(x['col_2'], weights=x['col_3']))
Here is how this works:
pipe()
method to pass the data frame df to a lambda
function that computes the weighted average via the average(
) function from NumPy
.We wish to perform one data aggregation operation on a grouped data frame.
One Column Function
We wish to perform an aggregation on a grouped data frame. The aggregation operation we wish to perform is a function of one column.
In this example, we wish to perform two individual aggregations on a data frame df
grouped by the column ‘col_1’
. Those aggregations are: (1) Obtain the sum of the value of the column ‘col_2'
for each group. (2) Obtain the 90th percentile value of the column ‘col_2'
for each group.
df.groupby('col_1')['col_2'].agg('sum')
df.groupby('col_1')['col_2'].agg('quantile', 0.9)
Here is how this works:
df.groupby('col_1')
, we group the data frame df
by the values of the column ‘col_1’
.'col_2'
returning a grouped Series
(more precisely a SeriesGroupBy
object).agg()
method of grouped Series
(SeriesGroupBy
objects) to compute the aggregate value of the Series
for each group.agg()
.np.mean()
or any custom function by its name. We can refer to Pandas convenience functions via a string of their name e.g. ‘quantile’
instead of its qualified name SeriesGroupBy.quantile()
. We cover many examples of this in Common Aggregation Operations.Series
.agg()
like we do in agg(’quantile’, 0.9)
.Series
where each element is the output of aggregation for each group and where the Index
is the group identifier.sqrt()
an exception is raised ValueError: function does not reduce
.Alternatively:
df.groupby('col_1')['col_2'].sum()
df.groupby('col_1')['col_2'].quantile(0.9)
Here is how this works:
Pandas
provides convenience methods for the most common aggregation functions such as sum()
and quantile()
. We cover most of those convenience methods in Common Aggregation Operations.SeriesGroupBy
object) like we do here.agg()
i.e. a Series
where each element is the output of aggregation for each group and where the Index
is the group identifier.Multiple Column Function
We wish to perform an aggregation on a grouped data frame. The aggregation operation we wish to perform is a function of multiple columns.
In this example, we wish to compute the weighted mean of the column ‘col_2'
where the weights are given by the column ‘col_3'
. We wish to perform the aggregation for each group where the groups are defined by the column ‘col_1'
.
df.groupby(col_1).apply(lambda x: np.average(x['col_2'], weights=x['col_3']))
Here is how this works:
agg()
which we used for one column aggregation can’t be used to execute data aggregation operations that involve multiple columns. The reason being that agg()
acts on one column at a time.apply()
.DataFrameGroupBy
object), apply()
receives each sub data frame (for each group) as a data frame allowing access to all columns for that group.df.groupby(col_1)
, we group the data frame df by the values of the column col_1
.lambda x: np.average(x['col_2'], weights=x['col_3'])
, we built a lambda
function that passes the values of the columns ‘col_2’
and ‘col_3’
to the values and weights arguments of the weighted average function average()
from NumPy
.