Individual Aggregation

We wish to carry out an individual data aggregation operation e.g. we wish to obtain the sum of the values of a particular numeric column. One common situation where quick individual data aggregation is needed is during interactive inspection of data

We will cover two scenarios:

  1. Performing a data aggregation operation on an entire data frame (not grouped) reducing one or more columns to a single scalar value.
  2. Performing a data aggregation operation on a grouped data frame reducing one or more columns into a set of values one for each group.

This section is complemented by:

Data Frame

We wish to perform one data aggregation operation on a data frame that is not grouped.

One Column Function

We wish to perform one data aggregation operation on a column of a data frame (that is not grouped) reducing that one column to a single scalar value.

In this example we wish to compute the sum of the column ‘col_2' of the data frame df.

df['col_2'].sum()

Here is how this works:

  • In df['col_2'], we select the column that we wish to summarize via the bracket operator returning a Series. See Selecting.
  • We use the convenience method sum() of Pandas Series to compute the sum of values of the column col_2 returning a single Scalar value.
  • See Common Aggregation Operations for a coverage of the most common data aggregation operations.

Alternatively,

from scipy.stats import gmean
gmean(df['col_2'])

Here is how this works:

  • We pass the Series resulting from df['col_2'] to the aggregation function that we wish to apply.
  • The aggregation function in question should accept a Series or an array like object and return a summary value.
  • This approach is appropriate when the function we wish to apply doesn't exist as a Pandas Series convenience function such as the geometric mean function gmean() from scipy.stats in this example.

Multiple Column Function

We wish to perform an aggregation operation on an entire data frame (that is not grouped) reducing multiple columns to a single scalar value.

In this example, we wish to compute the weighted average of the column ‘col_2' where the weights are given by the column ‘col_3' of the data frame df.

np.average(df['col_2'], weights=df['col_3'])

Here is how this works:

  • There are no convenience Pandas data frame methods that operate on multiple columns.
  • To compute the weighted average, we use the average() function from NumPy package.
  • We pass the column whose average we wish to compute df['col_2'] to the first argument of average() and the column to use as weights to the weights argument in weights=df['col_3'].

Alternatively:

df.pipe(lambda x: np.average(x['col_2'], weights=x['col_3']))

Here is how this works:

  • We use the pipe() method to pass the data frame df to a lambda function that computes the weighted average via the average() function from NumPy.
  • This approach may be appropriate if we wish to compute the data aggregation operation as part of a larger chain of data transformation operations.

Grouped Data Frame

We wish to perform one data aggregation operation on a grouped data frame.

One Column Function

We wish to perform an aggregation on a grouped data frame. The aggregation operation we wish to perform is a function of one column.

In this example, we wish to perform two individual aggregations on a data frame df grouped by the column ‘col_1’. Those aggregations are: (1) Obtain the sum of the value of the column ‘col_2' for each group. (2) Obtain the 90th percentile value of the column ‘col_2' for each group.

df.groupby('col_1')['col_2'].agg('sum')
df.groupby('col_1')['col_2'].agg('quantile', 0.9)

Here is how this works:

  • In df.groupby('col_1'), we group the data frame df by the values of the column ‘col_1’.
  • We then select the column 'col_2' returning a grouped Series (more precisely a SeriesGroupBy object).
  • We use the agg() method of grouped Series (SeriesGroupBy objects) to compute the aggregate value of the Series for each group.
  • Any aggregating function, i.e. one that takes multiple individual values and returns a single summary value, can be passed to agg().
  • We can pass a qualified function reference, be it a built-in function e.g. np.mean() or any custom function by its name. We can refer to Pandas convenience functions via a string of their name e.g. ‘quantile’ instead of its qualified name SeriesGroupBy.quantile(). We cover many examples of this in Common Aggregation Operations.
  • Pandas implicitly passes the values of the aggregated column for each group to the aggregating function as a Series.
  • We can pass any number of positional arguments or named arguments to the function by simply passing them to agg() like we do in agg(’quantile’, 0.9).
  • The output is a Series where each element is the output of aggregation for each group and where the Index is the group identifier.
  • If we pass a non aggregating function i.e. the output per group has more than one scalar value e.g. sqrt() an exception is raised ValueError: function does not reduce.

Alternatively:

df.groupby('col_1')['col_2'].sum()
df.groupby('col_1')['col_2'].quantile(0.9)

Here is how this works:

  • Pandas provides convenience methods for the most common aggregation functions such as sum() and quantile(). We cover most of those convenience methods in Common Aggregation Operations.
  • We can call those convenience aggregation methods directly on a grouped Series (SeriesGroupBy object) like we do here.
  • The output is exactly the same as using agg() i.e. a Series where each element is the output of aggregation for each group and where the Index is the group identifier.

Multiple Column Function

We wish to perform an aggregation on a grouped data frame. The aggregation operation we wish to perform is a function of multiple columns.

In this example, we wish to compute the weighted mean of the column ‘col_2' where the weights are given by the column ‘col_3'. We wish to perform the aggregation for each group where the groups are defined by the column ‘col_1'.

df.groupby(col_1).apply(lambda x: np.average(x['col_2'], weights=x['col_3']))

Here is how this works:

  • agg() which we used for one column aggregation can’t be used to execute data aggregation operations that involve multiple columns. The reason being that agg() acts on one column at a time.
  • The way to go when we wish to perform a grouped aggregation that involves multiple columns is to use apply().
  • When called on a grouped data frame (a DataFrameGroupBy object), apply() receives each sub data frame (for each group) as a data frame allowing access to all columns for that group.
  • In df.groupby(col_1), we group the data frame df by the values of the column col_1.
  • In lambda x: np.average(x['col_2'], weights=x['col_3']), we built a lambda function that passes the values of the columns ‘col_2’ and ‘col_3’ to the values and weights arguments of the weighted average function average() from NumPy.
  • Instead of a lambda function, we can create a custom function and call it from within apply(). We show an example of that in Common Aggregation Operations.
PYTHON
I/O