Summary Table

We wish to carry out multiple data aggregations on an input data frame and return an output summary data frame. The output summary data frame would have one column for each data aggregation operation in addition to one column for each grouping variable (if any). As for rows, the output summary data frame will have one row for each combination of values of grouping variables or just one row if we are summarizing a data frame that is not grouped.

This section is complemented by:

  • Individual Aggregations where we cover how to perform an individual aggregation operation e.g. we wish to obtain the sum of the values of one particular numeric column.
  • Common Aggregation Operations where we cover some of the most common data aggregation operations.

One Column Functions

We wish to summarize a data frame over groups where the aggregation functions we wish to perform are all functions of one column i.e. functions that act on a single column.

In this example, we wish to produce a grouped summary of a data frame df. The grouping variables are the columns col_1 and col_2. The aggregation operations we wish to perform are: (1) Get the minimum value of col_3 per group. (2) Get the 75th percentile value of col_3 per group and (3) Get the geometric mean of col_4 per group.

df_2 = (df
        .groupby(['col_1', 'col_2'], as_index=False)
        .agg(col_3_min = ('col_3', 'min'),
             col_3_75th = ('col_3', lambda x: np.percentile(x, 75)),
             col_4_gmean=('col_4', sp.stats.gmean)))

Here is how this works:

  • In df.groupby(['col_1', 'col_2'], as_index=False), we group the data frame df by the values of the column ‘col_1’ and 'col_2'. We set as_index=False to retain the original index of the data frame and keep the grouping columns as regular columns rather than the default behavior of moving them into the Index. See Grouping.
  • We use the agg() method of grouped data frames (DataFrameGroupBy objects) to compute the aggregations we need to produce the summary data frame.
  • agg() implicitly passes the values of the aggregated column for each group to the aggregating function as a Series.
  • Inside agg(), each aggregation is specified as an expression of the form output_column_name = (’input_column_name’, function_refernce):
    • output_column_name is the name of the column that will be created specified in an unquoted manner.
    • ’input_column_name’ is the name of the input column specified as a string.
    • function_refernce is a reference to the aggregation function.
  • A few points to note about functions passed to agg():
    • Any aggregating function, i.e. one that takes multiple individual values and returns a single summary value, can be passed to agg().
    • We can pass a qualified function reference, be it a built-in function e.g. np.mean() or any custom function by its name. We can refer to Pandas convenience functions via a string holding their name i.e. ‘mean’ instead of pd.Series.mean(). We cover many examples of this in Common Aggregation Operations
    • We use a lambda function to pass function params. If we try to pass into the tuple we get error TypeError: Must provide 'func' or tuples of '(column, aggfunc).
    • If we pass a non aggregating function i.e. the output per group has more than one scalar value e.g. sqrt() an exception is raised ValueError: function does not reduce.
  • In col_3_min = ('col_3', 'min'), we create a summary column ‘col_3_min’ where each value is the minimum of the values of the column ‘col_3’ for the current group.
  • In col_3_75th = ('col_3', lambda x: np.percentile(x, 75)), we use a lambda function to pass arguments to the function percentile (from NumPy.)
  • In col_4_gmean=('col_4', sp.stats.gmean), we refer to a non Pandas convenience function by its qualified name.
  • The output is a data frame where each row represents a group and each column holds the output of an aggregation function for each group.

Alternatively:

df_2 = pd.DataFrame()
df_g = df.groupby(['col_1', 'col_2'])
df_2['col_3_min'] = df_g['col_3'].min()
df_2['col_3_75th'] = df_g['col_3'].agg(np.percentile, 75)
df_2['col_4_gmean'] = df_g['col_4'].agg(sp.stats.gmean)
df_2 = df_2.reset_index()

Here is how this works:

  • We can follow this solution if chaining is not a concern, and we would rather carry out the aggregations individually.
  • When passing a function to agg instead of a named aggregating like we did above, we can skip the lambda function and pass function params directly to agg().
  • In df.groupby(['col_1', 'col_2']), we allow grouping to move the grouping column into the index so the Series we work with ,e.g. df_g['col_3'] , would all have the grouping represented in their Index and can be consistently aggregated then brought back into the output summary data frame df_2.
  • The output is the same as above; a data frame where each row represents a group and each column holds the output of an aggregation function for each group.

Multiple Column Functions

We wish to summarize a data frame where some aggregation functions we wish to perform are functions of more than one column i.e. functions that act on multiple columns at once.

In this example, we wish to produce a grouped summary of a data frame df. The grouping variables are the columns col_1 and col_2. The aggregation operations we wish to perform are: (1) Get the minimum value of col_3 per group. (2) Get the 75th percentile value of col_3 per group. (3) Get the geometric mean of col_4 per group. (4) Get the weighted average of col_3 where the weights are given by the values of col_4 per group.

df_2 = (df
        .groupby(['col_1', 'col_2'], as_index=False)
        .apply(lambda x: pd.Series({
            'col_3_min': x['col_3'].min(),
            'col_3_75th': np.percentile(x['col_3'], 75),
            'col_4_gmean': sp.stats.gmean(x['col_4']),
            'col_3_4_wavg': np.average(x['col_3'], weights=x['col_4'])}))
        )

Here is how this works:

  • While agg(), as used in One Column Functions above provides a good solution to data aggregation, it is not possible to apply aggregation functions that involve more than one input column. The reason being that agg() acts on one column at a time.
  • If we need to apply data aggregation functions that involve the interaction between multiple columns, we need to use apply() which when called on a grouped data frame (a DataFrameGroupBy object), receives each sub data frame (for each group) as a data frame allowing access to all columns for that group.
  • To use apply() to create multiple columns:
    • We must return a Series for each group and
    • the Index of the Series should hold the output names of the aggregations.
  • Those Series (one for each group) are combined to form a data frame where the column names of the output data frame are inherited from the Index of the Series.
  • In the code above, column name labels go through three steps: dictionary keys → Series Index → Data Frame columns names.
  • The first three data aggregations are one column aggregations just like we had in One Column Functions above.
  • The fourth aggregation np.average(x['col_3'], weights=x['col_4']) involves two columns and would not be possible to carry out via agg(). We use the weighted average function average() from NumPy.
  • While this approach of using apply() after groupby() to perform data aggregation gives maximum flexibility, it should be used with care because:
    • The data types of the output may be cast in unexpected ways requiring a further step of data type setting. See Type Casting.
    • The apply() function is relatively slower compared to agg() for the same operations.

Alternative 1:

df_2 = pd.DataFrame()
df_g = df.groupby(['col_1', 'col_2'])
df_2['col_3_min'] = df_g['col_3'].min()
df_2['col_3_75th'] = df_g['col_3'].agg(lambda x: np.percentile(x, 75))
df_2['col_4_gmean'] = df_g['col_4'].agg(sp.stats.gmean)
df_2['col_3_4_wavg'] = df_g.apply(lambda x: np.average(x['col_3'], weights=x['col_4']))
df_2 = df_2.reset_index()

Here is how this works:

  • We can follow this solution if chaining is not a concern, and we would rather carry out the aggregations individually.
  • Since we are working with the grouped data frame (DataFrameGroupBy object) df_g, we have full axis to all columns.
  • We describe this code in more detail above under the alternative solution in One Column Functions. Also, we describe the weighted average aggregation (col_3_4_wavg) under the “Multiple Column Function” in Individual Aggregation.

Alternative 2:

def summarize_df(g_df):
    col_3_min = g_df['col_3'].min()
    col_3_75th = np.percentile(g_df['col_3'], 75)
    col_4_gmean = sp.stats.gmean(g_df['col_4'])
    col_3_4_wavg = np.average(g_df['col_3'], weights=g_df['col_4'])
    summary_s = pd.Series({'col_3_min': col_3_min,
                           'col_3_75th': col_3_75th,
                           'col_4_gmean': col_4_gmean,
                           'col_3_4_wavg': col_3_4_wavg})
    return summary_s

df_5 = (df
        .groupby(['col_1', 'col_2'], as_index=False)
        .apply(summarize_df))

Here is how this works:

  • This is the same solution as the primary solution above except that we moved the aggregation operations out of apply() and into a custom function summarize_df() that returns a Series with the values of all aggregations for the given group.
  • This approach is preferable when:
    • the same set of aggregations need to be done more than once in a project
    • to simplify the primary data manipulation chain by moving the aggregation logic out into a separate function
    • the individual aggregations are complex we can be prone to making mistakes when composing elaborate data aggregation logic inside a dictionary definition.

Data Frame (Not Grouped)

We wish to summarize a data frame that is not grouped into a one row summary data frame.

In this example, we wish to produce a one row data frame df_2 that is a summary of a data frame df. The aggregation operations we wish to perform are: (1) Get the minimum value of ‘col_3'. (2) Get the 75th percentile value of ‘col_3'. (3) Get the geometric mean of ‘col_4’. (4) Get the weighted average of ‘col_3’ where the weights are given by the values of ‘col_4’ per group

df_2 = df\
    .pipe(lambda x: pd.DataFrame({
        'col_3_min': [x['col_3'].min()],
        'col_3_75th': np.percentile(x['col_3'], 75),
        'col_4_sum': [x['col_4'].sum()],
        'col_3_4_wavg': [np.average(x['col_3'], weights=x['col_4'])]
    }))

Here is how this works:

  • We pipe() the data frame df to a lambda function that produces the one row summary data frame.
  • Inside the lambda function we create an empty data frame and add to it one single value column for the outcome of each data aggregation operation.
  • In order to create a data frame, Pandas expects a dictionary of lists, therefore we wrap each operation into [].
  • While a data frame has an agg() function, we found it be at times unreliable (try df.agg(np.size)) and it can’t perform operations that involve more than one column. The apply() method of data frames too is limited to acting on one column at a time, hence our preference for this approach.

Alternatively:

df_2 = pd.DataFrame()
df_2['col_3_min'] = [df['col_3'].min()]
df_2['col_3_75th'] = [np.percentile(df['col_3'], 75)]
df_2['col_4_sum'] = [df['col_4'].sum()]
df_2['col_3_4_wavg'] = [np.average(df['col_3'], weights=df['col_4'])]

Here is how this works:

  • We can follow this solution if chaining is not a concern, and we would rather carry out the aggregations individually.
  • See Individual Aggregation for a coverage of carrying out individual data aggregation operations on a data frame (that is not grouped).
PYTHON
I/O