Basic Aggregation

At a high level, there are two situations where we need to perform data aggregation:

  1. When we wish to perform an individual aggregation operation often in an interactive inspection setting e.g. we wish to obtain the sum of the values of one particular numeric column.
  2. When we wish to reduce an input data frame into a summary data frame i.e. we wish to obtain a data frame where the columns are certain summary operations e.g. sum or mean.

Moreover, a data aggregation operation is either performed on:

  1. An entire data frame typically generating a single scalar value.
  2. A grouped data frame typically generating one summary value per group.

Following the above distinctions, this section is organized as follows:

  • Individual Aggregation where we cover how to carry out one aggregation operation on an Entire Data Frame or on a Grouped Data Frame.
  • Summary Table where we cover how to carry out multiple data aggregation operations on a data frame, that is typically grouped by one or more variables, and return a summary data frame.
  • Common Operations where we cover some of the most common data aggregation operations for each data type e.g. sum and mean for numeric columns, and string concatenation for string columns.

Generally speaking, in Pandas there are two modes of computing data aggregation operations:

  1. Using the agg() or apply() (which are referred to as the aggregation API).
    • The advantages are general applicability and, hence, consistency as well as the ability to apply multiple aggregation functions (which we cover in Implicit Aggregation).
    • The drawback is it might be a bit too verbose especially in a data inspection context.
  2. Using a set of convenience methods provided by Pandas that can be applied directly to a data frame or to a column of a data frame i.e. Series as well as a grouped data frame or a grouped Series. The most common aggregation operations, such as count, sum, mean, and median are covered. See Common Operations.
    • The advantage is brevity which makes this a great option for data inspection.
    • While making for succinct convenient code, this mode’s applicability is limited to the set of methods that the Panda’s authors implemented and if we need to apply an operation that is not covered, we would revert to the aggregation API.

In this section we will show solutions that implement both approaches.

PYTHON
I/O