Typically a data aggregation operation accepts one or more vector(s) of values and reduces that to a single summary value e.g. computing the sum of values of a column for a group; see Basic Aggregation. Occasionally though, we need to perform a data summarization operation where the output summary is composed of more than one value.
In this section, we will cover three scenarios of multi-value aggregation:
We wish to reduce each group into one row and any number of columns.
In this example, we wish to obtain the min, max, and median values for the column col_2
for each group. The groups are defined by the values of the column col_1
.
def my_summary(x):
summary_s = pd.Series({'min': x.min(),
'max': x.max(),
'median': x.median()})
return summary_s
df_2 = (df.groupby('col_1')
.apply(lambda x: my_summary(x['col_2']))
.reset_index())
Here is how this works:
groupby('col_1')
, to group the rows of the data frame df
by the values in the column col_1
. This way the subsequent call to apply()
will have the data aggregation operations be applied to each group separately. See Summary Table.apply()
to preform the data aggregation and not agg()
.apply()
we call a function that returns the desired summary values as a Series
.apply()
will take care of combining the individual Series outputs for each group into a single data frame.Series
returned by the applied function.my_summary(x)
:pd.Series()
. The Series
Index
will be the dictionary keys which will eventually become the column names of the resulting data frame.df_2
will have as many rows as the number of unique values of the column col_1
and will have four columns col_1
, and three columns named min
, max
, and median
holding the min, max, and median values of the column col_2
for each group.Alternative: Via Pandas Convenience Method
df_2 = df.groupby('col_1')['col_2'].describe()
Here is how this works:
describe()
.describe()
, returns the following: count, mean, standard deviation, min, 25th percentile, 50th percentile, 75th percentile, and max.describe()
is defined for Pandas GroupBy objects, we can therefore call it directly as in the code above which is quite convenient.df_2
with as many rows as the number of unique values of the column col_1
and one column for each of the summary stats returned by describe()
for the corresponding group.We wish to reduce each group to one column with any number of rows.
In this example, we wish to produce a multi-value summary of the column col_2
which is to compute the 25th, 50th and 75th percentile values for each group where the groups are defined by the column col_1
. We wish to organize the summary values as a column.
def my_quantile(x, probs):
o_df = pd.DataFrame({
'q_val': np.percentile(x, probs),
'q_prob': probs})
return o_df
df_2 = (df
.groupby('col_1')
.apply(lambda x: my_quantile(x['col_2'],
[0.25, 0.5, 0.75]))
.reset_index(level=0))
Here is how this works:
groupby('col_1')
, to group the rows of the data frame df
by the values in the column col_1
. This way the subsequent call to apply()
will have the data aggregation operations be applied to each group separately. See Summary Table.apply()
to preform the data aggregation and not agg()
.apply()
we call a function that returns the desired summary values as a data frame. If the function returns a Series
, apply()
structures that as a row in the output df
. Therefore, in order to construct the output as a column, we return a data frame.apply()
will take care of combining the individual data frame outputs for each group into a single data frame. The columns will be aligned to each other.q_prob = c(0.25, 0.5, 0.75)
.df_2
will have three columns; the grouping column col_1
, q_val
holding three percentile values for each group, and q_prob
holding the corresponding percentile labels (0.25, 0.5, or 0.75) for each group. The number of rows will be the product of the number of groups (i.e. the number of unique values of the column col_1
) and the number of summary rows for each group (which in this case is 3).Alternative: Via Pandas Convenience Method
df.groupby('col_1')['col_2'].quantile([0.25, 0.5, 0.75])
Here is how this works:
quantile()
.quantile()
is defined for Pandas GroupBy objects, we can therefore call it directly as in the code above which is quite convenient.Series
with a MultiIndex
where the first level of the index holds the group identifiers which are the unique values of the column col_1 and the second level holds the percentile labels (0.25, 0.5, or 0.75).reset_index()
and optionally a call to rename()
to set appropriate column names such as q_val
and q_prob
.We wish to reduce each group into a data frame of any number of rows and columns.
In this example, we wish to summarize each group by reducing it to a smaller number of rows (3 in this case) that are representative of the group. We wish to use the notion of centroids as computed by k-means to capture those representative points. The groups are defined by the values of the column col_1
.
from sklearn.cluster import KMeans
def centroid_summary(p_df, p_centers):
kmeans = (KMeans(n_clusters=p_centers,
random_state=0,
n_init="auto")
.fit(p_df))
o_df = (pd.DataFrame(kmeans.cluster_centers_)
.set_axis(p_df.columns, axis=1))
return o_df
df_2 = (df
.groupby('col_1')
[df.columns.difference(['col_1'])]
.apply(centroid_summary, p_centers=3)
.reset_index(level=0))
Here is how this works:
apply()
a function that returns a data frame of the desired structure.centroid_summary()
which:p_df
and computes the “centroids” where the rows are treated as observations and the columns as the dimensions.p_centers
the number of centroids to compute.p_centers
.apply()
passes to centroid_summary()
:[df.columns.difference(['col_1'])]
so it is not passed to centroid_summary()
.p_centers=3
since apply()
passes any arguments passed to it to the function being applied.apply()
then combines the data frames returned from centroid_summary()
for each group by row binding them i.e. stacking them one on top of the other while aligning the columns according to the column names (which are the same for all the data frames).df_2
will have as many columns as the input data frame df
and as many rows as the product of unique values in the column col_1
and the number of centers p_centers
which in this case is 3.centroid_summary()
function operates as follows:p_df
the data frame to be summarized and p_centers
the number of centroids to have as the summary.p_df
with p_centers
as the number of clusters using KMeans
from scikit-learn
.kmeans.cluster_centers_
and convert that to a data frame via pd.DataFrame()
.set_axis()
to set the names of the output columns to be the same as the names of the input columns. See Renaming.