Typically a data aggregation operation accepts one or more vector(s) of values and reduces that to a single summary value e.g. computing the sum of values of a column for a group; see Basic Aggregation. Occasionally though, we need to perform a data summarization operation where the output summary is composed of more than one value.

In this section, we will cover three scenarios of multi-value aggregation:

**One Row**where we reduce each group into one row and any number of columns. Typically, column names describe the values.**Two Column**where we reduce each group into two columns with any number of rows. Typically, one column holds summary values and the other column describes those values.**Arbitrary Dimensions**where we reduce each group into a data frame of any number of rows and columns.

We wish to reduce each group into one row and any number of columns.

In this example, we wish to obtain the min, max, and median values for the column `col_2`

for each group. The groups are defined by the values of the column `col_1`

.

```
def my_summary(x):
summary_s = pd.Series({'min': x.min(),
'max': x.max(),
'median': x.median()})
return summary_s
df_2 = (df.groupby('col_1')
.apply(lambda x: my_summary(x['col_2']))
.reset_index())
```

Here is how this works:

- We call
`groupby('col_1')`

, to group the rows of the data frame`df`

by the values in the column`col_1`

. This way the subsequent call to`apply()`

will have the data aggregation operations be applied to each group separately. See Summary Table. - To reduce a group into multiple values organized as a row, we do the following:
- We use
`apply()`

to preform the data aggregation and not`agg()`

. - From within
`apply()`

we call a function that returns the desired summary values as a`Series`

. `apply()`

will take care of combining the individual Series outputs for each group into a single data frame.- The columns of the output summary data frame will match the index of the
`Series`

returned by the applied function.

- We use
- In the function
`my_summary(x)`

:- We create a dictionary of key value pairs where the keys are the desired summary column names and the values are the summary values.
- We then convert that dictionary to a Series via
`pd.Series()`

. The`Series`

`Index`

will be the dictionary keys which will eventually become the column names of the resulting data frame.

- The output data frame
`df_2`

will have as many rows as the number of unique values of the column`col_1`

and will have four columns`col_1`

, and three columns named`min`

,`max`

, and`median`

holding the min, max, and median values of the column`col_2`

for each group.

Alternative: Via Pandas Convenience Method

```
df_2 = df.groupby('col_1')['col_2'].describe()
```

Here is how this works:

- For the particular case of computing summary statistics like in the primary solution above, we can use the built-in Pandas method
`describe()`

. - By default
`describe()`

, returns the following: count, mean, standard deviation, min, 25th percentile, 50th percentile, 75th percentile, and max. `describe()`

is defined for Pandas GroupBy objects, we can therefore call it directly as in the code above which is quite convenient.- The output is a data frame
`df_2`

with as many rows as the number of unique values of the column`col_1`

and one column for each of the summary stats returned by`describe()`

for the corresponding group.

We wish to reduce each group to one column with any number of rows.

In this example, we wish to produce a multi-value summary of the column `col_2`

which is to compute the 25th, 50th and 75th percentile values for each group where the groups are defined by the column `col_1`

. We wish to organize the summary values as a column.

```
def my_quantile(x, probs):
o_df = pd.DataFrame({
'q_val': np.percentile(x, probs),
'q_prob': probs})
return o_df
df_2 = (df
.groupby('col_1')
.apply(lambda x: my_quantile(x['col_2'],
[0.25, 0.5, 0.75]))
.reset_index(level=0))
```

Here is how this works:

- We call
`groupby('col_1')`

, to group the rows of the data frame`df`

by the values in the column`col_1`

. This way the subsequent call to`apply()`

will have the data aggregation operations be applied to each group separately. See Summary Table. - To reduce a group into multiple values organized as a column, we do the following:
- We use
`apply()`

to preform the data aggregation and not`agg()`

. - From within
`apply()`

we call a function that returns the desired summary values as a data frame. If the function returns a`Series`

,`apply()`

structures that as a row in the output`df`

. Therefore, in order to construct the output as a column, we return a data frame. `apply()`

will take care of combining the individual data frame outputs for each group into a single data frame. The columns will be aligned to each other.- The columns of the output summary data frame will match the columns of the group summary data frames returned by the applied function.

- We use
- Typically, we need a way to identify what each value holds, we can do that by adding a second column that holds the quantile values
`q_prob = c(0.25, 0.5, 0.75)`

. - The output data frame
`df_2`

will have three columns; the grouping column`col_1`

,`q_val`

holding three percentile values for each group, and`q_prob`

holding the corresponding percentile labels (0.25, 0.5, or 0.75) for each group. The number of rows will be the product of the number of groups (i.e. the number of unique values of the column`col_1`

) and the number of summary rows for each group (which in this case is 3).

Alternative: Via Pandas Convenience Method

```
df.groupby('col_1')['col_2'].quantile([0.25, 0.5, 0.75])
```

Here is how this works:

- For the particular case of computing percentile values, we can use the built-in Pandas method
`quantile()`

. `quantile()`

is defined for Pandas GroupBy objects, we can therefore call it directly as in the code above which is quite convenient.- The output is a
`Series`

with a`MultiIndex`

where the first level of the index holds the group identifiers which are the unique values of the column col_1 and the second level holds the percentile labels (0.25, 0.5, or 0.75). - To return a data frame, we would need to chain a call to
`reset_index()`

and optionally a call to`rename()`

to set appropriate column names such as`q_val`

and`q_prob`

.

We wish to reduce each group into a data frame of any number of rows and columns.

In this example, we wish to summarize each group by reducing it to a smaller number of rows (3 in this case) that are representative of the group. We wish to use the notion of centroids as computed by k-means to capture those representative points. The groups are defined by the values of the column `col_1`

.

```
from sklearn.cluster import KMeans
def centroid_summary(p_df, p_centers):
kmeans = (KMeans(n_clusters=p_centers,
random_state=0,
n_init="auto")
.fit(p_df))
o_df = (pd.DataFrame(kmeans.cluster_centers_)
.set_axis(p_df.columns, axis=1))
return o_df
df_2 = (df
.groupby('col_1')
[df.columns.difference(['col_1'])]
.apply(centroid_summary, p_centers=3)
.reset_index(level=0))
```

Here is how this works:

- To reduce a group into a data frame of any number of rows and columns, we need to call from within
`apply()`

a function that returns a data frame of the desired structure. - In this case, we create a custom function
`centroid_summary()`

which:- Accepts a data frame
`p_df`

and computes the “centroids” where the rows are treated as observations and the columns as the dimensions. - The function accepts an integer value specifying
`p_centers`

the number of centroids to compute. - The function returns a data frame that has as many columns as the input and as many rows as that specified by the number of centroids
`p_centers`

.

- Accepts a data frame
`apply()`

passes to`centroid_summary()`

:- The values of all columns for the current group one by one. Note that prior to calling apply we excluded the grouping column via
`[df.columns.difference(['col_1'])]`

so it is not passed to`centroid_summary()`

. - The desired number of centroids
`p_centers=3`

since`apply()`

passes any arguments passed to it to the function being applied.

- The values of all columns for the current group one by one. Note that prior to calling apply we excluded the grouping column via
`apply()`

then combines the data frames returned from`centroid_summary()`

for each group by row binding them i.e. stacking them one on top of the other while aligning the columns according to the column names (which are the same for all the data frames).- The output data frame
`df_2`

will have as many columns as the input data frame`df`

and as many rows as the product of unique values in the column`col_1`

and the number of centers`p_centers`

which in this case is 3. - The
`centroid_summary()`

function operates as follows:- It takes two arguments:
`p_df`

the data frame to be summarized and`p_centers`

the number of centroids to have as the summary. - The function applies k-means clustering to
`p_df`

with`p_centers`

as the number of clusters using`KMeans`

from`scikit-learn`

. - We extract the centroids via
`kmeans.cluster_centers_`

and convert that to a data frame via`pd.DataFrame()`

. - Finally, we use
`set_axis()`

to set the names of the output columns to be the same as the names of the input columns. See Renaming.

- It takes two arguments:

PYTHON