Aggregating

One of the most common data manipulation tasks is data aggregation where, in its simplest form, we carry out operations that reduce multiple values of a column into a single value e.g. compute the sum of values of a numeric column. Typically, we perform data aggregation on mutually exclusive groups of rows then combine the results vertically into a new column or Data Frame.

In its simplest form, a typical data aggregation operation in Pandas looks like so:

df_2 = (df
  .groupby('col_1')
  .agg(
    col_2_min = ('col_2', 'min'),
    col_3_75th = ('col_3', 'sum')))

where we use groupby() to create mutually exclusive groups of rows, and we use agg() to carry out data aggregation operations on each group.

This section is organized as follows:

  • Basic Aggregation where we cover how to perform basic data aggregation operations in two contexts: (1) Individual aggregation operations typically as part of interactive data inspection and (2) Generating a Summary Table by carrying out multiple data aggregation operations on a data frame.
  • Conditional Aggregation where we cover how to carry out data aggregation operations that are not simply applied to all rows, rather rows or groups are treated differently based on one or more conditions. For example, compute the sum of values of a column where another column has a particular value.
  • Sequential Aggregation where we cover how to aggregate data in a way that can only be produced by carrying out multiple aggregation operations in sequence. For example, compute the monthly average of number of daily events.
  • Implicit Aggregation where we cover how to succinctly apply one or more data aggregation expressions to a selected set of columns without having to spell out each data aggregation operation explicitly. For example, compute the sum, mean, and standard deviation of all numeric columns.
  • Multi-Value Aggregation where we cover data aggregation operations that return multiple values (as opposed to the common scenario of a data aggregation operation returning a single scalar value).
  • Pivot Aggregation where we cover generating data summaries in the form of pivot tables. Oftentimes, a pivot table organization makes it easier to understand and draw insights from a data summary.
  • Dynamic Aggregation where we cover how to specify aspects of data aggregation dynamically i.e. through environment variables or function arguments. For instance, we wish to pass to a function the names of columns to which a set of data aggregations would be applied. This is especially useful when creating reusable functions of packages.
  • Grouping where we cover aspects of groupings and working with groups including grouping by a function of existing columns, order of groups, and iterating over groups.
PYTHON
I/O