Multi-Value Aggregation

Typically a data aggregation operation accepts one or more vector(s) of values and reduces that to a single summary value e.g. computing the sum of values of a column for a group; see Basic Aggregation. Occasionally though, we need to perform a data summarization operation where the output summary is composed of more than one value.

In this section, we will cover three scenarios of multi-value aggregation:

  • One Row where we reduce each group into one row and any number of columns. Typically, column names describe the values.
  • Two Column where we reduce each group into two columns with any number of rows. Typically, one column holds summary values and the other column describes those values.
  • Arbitrary Dimensions where we reduce each group into a data frame of any number of rows and columns.

One Row

We wish to reduce each group into one row and any number of columns.

In this example, we wish to obtain the min, max, and median values for the column col_2 for each group. The groups are defined by the values of the column col_1.

def my_summary(x):
    summary_s = pd.Series({'min': x.min(),
                      'max': x.max(),
                      'median': x.median()})
    return summary_s

df_2 = (df.groupby('col_1')
        .apply(lambda x: my_summary(x['col_2']))
        .reset_index())

Here is how this works:

  • We call groupby('col_1'), to group the rows of the data frame df by the values in the column col_1. This way the subsequent call to apply() will have the data aggregation operations be applied to each group separately. See Summary Table.
  • To reduce a group into multiple values organized as a row, we do the following:
    • We use apply() to preform the data aggregation and not agg().
    • From within apply() we call a function that returns the desired summary values as a Series.
    • apply() will take care of combining the individual Series outputs for each group into a single data frame.
    • The columns of the output summary data frame will match the index of the Series returned by the applied function.
  • In the function my_summary(x):
    • We create a dictionary of key value pairs where the keys are the desired summary column names and the values are the summary values.
    • We then convert that dictionary to a Series via pd.Series(). The Series Index will be the dictionary keys which will eventually become the column names of the resulting data frame.
  • The output data frame df_2 will have as many rows as the number of unique values of the column col_1 and will have four columns col_1, and three columns named min, max, and median holding the min, max, and median values of the column col_2 for each group.

Alternative: Via Pandas Convenience Method

df_2 = df.groupby('col_1')['col_2'].describe()

Here is how this works:

  • For the particular case of computing summary statistics like in the primary solution above, we can use the built-in Pandas method describe().
  • By default describe(), returns the following: count, mean, standard deviation, min, 25th percentile, 50th percentile, 75th percentile, and max.
  • describe() is defined for Pandas GroupBy objects, we can therefore call it directly as in the code above which is quite convenient.
  • The output is a data frame df_2 with as many rows as the number of unique values of the column col_1 and one column for each of the summary stats returned by describe() for the corresponding group.

One Column

We wish to reduce each group to one column with any number of rows.

In this example, we wish to produce a multi-value summary of the column col_2 which is to compute the 25th, 50th and 75th percentile values for each group where the groups are defined by the column col_1. We wish to organize the summary values as a column.

def my_quantile(x, probs):
    o_df = pd.DataFrame({
        'q_val': np.percentile(x, probs),
        'q_prob': probs})
    return o_df

df_2 = (df
        .groupby('col_1')
        .apply(lambda x: my_quantile(x['col_2'],
                                     [0.25, 0.5, 0.75]))
        .reset_index(level=0))

Here is how this works:

  • We call groupby('col_1'), to group the rows of the data frame df by the values in the column col_1. This way the subsequent call to apply() will have the data aggregation operations be applied to each group separately. See Summary Table.
  • To reduce a group into multiple values organized as a column, we do the following:
    • We use apply() to preform the data aggregation and not agg().
    • From within apply() we call a function that returns the desired summary values as a data frame. If the function returns a Series, apply() structures that as a row in the output df. Therefore, in order to construct the output as a column, we return a data frame.
    • apply() will take care of combining the individual data frame outputs for each group into a single data frame. The columns will be aligned to each other.
    • The columns of the output summary data frame will match the columns of the group summary data frames returned by the applied function.
  • Typically, we need a way to identify what each value holds, we can do that by adding a second column that holds the quantile values q_prob = c(0.25, 0.5, 0.75).
  • The output data frame df_2 will have three columns; the grouping column col_1, q_val holding three percentile values for each group, and q_prob holding the corresponding percentile labels (0.25, 0.5, or 0.75) for each group. The number of rows will be the product of the number of groups (i.e. the number of unique values of the column col_1) and the number of summary rows for each group (which in this case is 3).

Alternative: Via Pandas Convenience Method

df.groupby('col_1')['col_2'].quantile([0.25, 0.5, 0.75])

Here is how this works:

  • For the particular case of computing percentile values, we can use the built-in Pandas method quantile().
  • quantile() is defined for Pandas GroupBy objects, we can therefore call it directly as in the code above which is quite convenient.
  • The output is a Series with a MultiIndex where the first level of the index holds the group identifiers which are the unique values of the column col_1 and the second level holds the percentile labels (0.25, 0.5, or 0.75).
  • To return a data frame, we would need to chain a call to reset_index() and optionally a call to rename() to set appropriate column names such as q_val and q_prob.

Arbitrary Dimensions

We wish to reduce each group into a data frame of any number of rows and columns.

In this example, we wish to summarize each group by reducing it to a smaller number of rows (3 in this case) that are representative of the group. We wish to use the notion of centroids as computed by k-means to capture those representative points. The groups are defined by the values of the column col_1.

from sklearn.cluster import KMeans

def centroid_summary(p_df, p_centers):
    kmeans = (KMeans(n_clusters=p_centers,
                     random_state=0,
                     n_init="auto")
              .fit(p_df))
    o_df = (pd.DataFrame(kmeans.cluster_centers_)
            .set_axis(p_df.columns, axis=1))
    return o_df

df_2 = (df
        .groupby('col_1')
        [df.columns.difference(['col_1'])]
        .apply(centroid_summary, p_centers=3)
        .reset_index(level=0))

Here is how this works:

  • To reduce a group into a data frame of any number of rows and columns, we need to call from within apply() a function that returns a data frame of the desired structure.
  • In this case, we create a custom function centroid_summary() which:
    • Accepts a data frame p_df and computes the “centroids” where the rows are treated as observations and the columns as the dimensions.
    • The function accepts an integer value specifying p_centers the number of centroids to compute.
    • The function returns a data frame that has as many columns as the input and as many rows as that specified by the number of centroids p_centers.
  • apply() passes to centroid_summary():
    • The values of all columns for the current group one by one. Note that prior to calling apply we excluded the grouping column via [df.columns.difference(['col_1'])] so it is not passed to centroid_summary().
    • The desired number of centroids p_centers=3 since apply() passes any arguments passed to it to the function being applied.
  • apply() then combines the data frames returned from centroid_summary() for each group by row binding them i.e. stacking them one on top of the other while aligning the columns according to the column names (which are the same for all the data frames).
  • The output data frame df_2 will have as many columns as the input data frame df and as many rows as the product of unique values in the column col_1 and the number of centers p_centers which in this case is 3.
  • The centroid_summary() function operates as follows:
    • It takes two arguments: p_df the data frame to be summarized and p_centers the number of centroids to have as the summary.
    • The function applies k-means clustering to p_df with p_centers as the number of clusters using KMeans from scikit-learn.
    • We extract the centroids via kmeans.cluster_centers_ and convert that to a data frame via pd.DataFrame().
    • Finally, we use set_axis() to set the names of the output columns to be the same as the names of the input columns. See Renaming.
PYTHON
I/O