We wish to carry out multiple data aggregations on an input data frame and return an output summary data frame. The output summary data frame would have one column for each data aggregation operation in addition to one column for each grouping variable (if any). As for rows, the output summary data frame will have one row for each combination of values of grouping variables or just one row if we are summarizing a data frame that is not grouped.
This section is complemented by:
We wish to summarize a data frame over groups where the aggregation functions we wish to perform are all functions of one column i.e. functions that act on a single column.
In this example, we wish to produce a grouped summary of a data frame df
. The grouping variables are the columns col_1
and col_2
. The aggregation operations we wish to perform are: (1) Get the minimum value of col_3
per group. (2) Get the 75th percentile value of col_3
per group and (3) Get the geometric mean of col_4
per group.
df_2 = (df
.groupby(['col_1', 'col_2'], as_index=False)
.agg(col_3_min = ('col_3', 'min'),
col_3_75th = ('col_3', lambda x: np.percentile(x, 75)),
col_4_gmean=('col_4', sp.stats.gmean)))
Here is how this works:
df.groupby(['col_1', 'col_2'], as_index=False)
, we group the data frame df
by the values of the column ‘col_1’
and 'col_2'.
We set as_index=False
to retain the original index of the data frame and keep the grouping columns as regular columns rather than the default behavior of moving them into the Index. See Grouping.agg()
method of grouped data frames (DataFrameGroupBy
objects) to compute the aggregations we need to produce the summary data frame.agg()
implicitly passes the values of the aggregated column for each group to the aggregating function as a Series
.agg()
, each aggregation is specified as an expression of the form output_column_name = (’input_column_name’, function_refernce)
:output_column_name
is the name of the column that will be created specified in an unquoted manner.’input_column_name’
is the name of the input column specified as a string.function_refernce
is a reference to the aggregation function.agg()
:agg()
.np.mean()
or any custom function by its name. We can refer to Pandas convenience functions via a string holding their name i.e. ‘mean’
instead of pd.Series.mean()
. We cover many examples of this in Common Aggregation Operationslambda
function to pass function params. If we try to pass into the tuple we get error TypeError: Must provide 'func' or tuples of '(column, aggfunc)
.sqrt()
an exception is raised ValueError: function does not reduce
.col_3_min = ('col_3', 'min')
, we create a summary column ‘col_3_min’
where each value is the minimum of the values of the column ‘col_3’
for the current group.col_3_75th = ('col_3', lambda x: np.percentile(x, 75))
, we use a lambda function to pass arguments to the function percentile
(from NumPy
.)col_4_gmean=('col_4', sp.stats.gmean)
, we refer to a non Pandas convenience function by its qualified name.Alternatively:
df_2 = pd.DataFrame()
df_g = df.groupby(['col_1', 'col_2'])
df_2['col_3_min'] = df_g['col_3'].min()
df_2['col_3_75th'] = df_g['col_3'].agg(np.percentile, 75)
df_2['col_4_gmean'] = df_g['col_4'].agg(sp.stats.gmean)
df_2 = df_2.reset_index()
Here is how this works:
lambda
function and pass function params directly to agg()
.df.groupby(['col_1', 'col_2'])
, we allow grouping to move the grouping column into the index so the Series
we work with ,e.g. df_g['col_3']
, would all have the grouping represented in their Index and can be consistently aggregated then brought back into the output summary data frame df_2
.We wish to summarize a data frame where some aggregation functions we wish to perform are functions of more than one column i.e. functions that act on multiple columns at once.
In this example, we wish to produce a grouped summary of a data frame df
. The grouping variables are the columns col_1
and col_2
. The aggregation operations we wish to perform are: (1) Get the minimum value of col_3
per group. (2) Get the 75th percentile value of col_3
per group. (3) Get the geometric mean of col_4
per group. (4) Get the weighted average of col_3 where the weights are given by the values of col_4 per group.
df_2 = (df
.groupby(['col_1', 'col_2'], as_index=False)
.apply(lambda x: pd.Series({
'col_3_min': x['col_3'].min(),
'col_3_75th': np.percentile(x['col_3'], 75),
'col_4_gmean': sp.stats.gmean(x['col_4']),
'col_3_4_wavg': np.average(x['col_3'], weights=x['col_4'])}))
)
Here is how this works:
agg()
, as used in One Column Functions above provides a good solution to data aggregation, it is not possible to apply aggregation functions that involve more than one input column. The reason being that agg()
acts on one column at a time.apply()
which when called on a grouped data frame (a DataFrameGroupBy
object), receives each sub data frame (for each group) as a data frame allowing access to all columns for that group.apply()
to create multiple columns:Series
for each group andIndex
of the Series
should hold the output names of the aggregations.Series
(one for each group) are combined to form a data frame where the column names of the output data frame are inherited from the Index of the Series.np.average(x['col_3'], weights=x['col_4'])
involves two columns and would not be possible to carry out via agg()
. We use the weighted average function average()
from NumPy
.apply()
after groupby()
to perform data aggregation gives maximum flexibility, it should be used with care because:apply()
function is relatively slower compared to agg()
for the same operations.Alternative 1:
df_2 = pd.DataFrame()
df_g = df.groupby(['col_1', 'col_2'])
df_2['col_3_min'] = df_g['col_3'].min()
df_2['col_3_75th'] = df_g['col_3'].agg(lambda x: np.percentile(x, 75))
df_2['col_4_gmean'] = df_g['col_4'].agg(sp.stats.gmean)
df_2['col_3_4_wavg'] = df_g.apply(lambda x: np.average(x['col_3'], weights=x['col_4']))
df_2 = df_2.reset_index()
Here is how this works:
DataFrameGroupBy
object) df_g
, we have full axis to all columns.col_3_4_wavg
) under the “Multiple Column Function” in Individual Aggregation.Alternative 2:
def summarize_df(g_df):
col_3_min = g_df['col_3'].min()
col_3_75th = np.percentile(g_df['col_3'], 75)
col_4_gmean = sp.stats.gmean(g_df['col_4'])
col_3_4_wavg = np.average(g_df['col_3'], weights=g_df['col_4'])
summary_s = pd.Series({'col_3_min': col_3_min,
'col_3_75th': col_3_75th,
'col_4_gmean': col_4_gmean,
'col_3_4_wavg': col_3_4_wavg})
return summary_s
df_5 = (df
.groupby(['col_1', 'col_2'], as_index=False)
.apply(summarize_df))
Here is how this works:
apply()
and into a custom function summarize_df()
that returns a Series with the values of all aggregations for the given group.We wish to summarize a data frame that is not grouped into a one row summary data frame.
In this example, we wish to produce a one row data frame df_2
that is a summary of a data frame df
. The aggregation operations we wish to perform are: (1) Get the minimum value of ‘col_3'
. (2) Get the 75th percentile value of ‘col_3'
. (3) Get the geometric mean of ‘col_4’
. (4) Get the weighted average of ‘col_3’
where the weights are given by the values of ‘col_4’
per group
df_2 = df\
.pipe(lambda x: pd.DataFrame({
'col_3_min': [x['col_3'].min()],
'col_3_75th': np.percentile(x['col_3'], 75),
'col_4_sum': [x['col_4'].sum()],
'col_3_4_wavg': [np.average(x['col_3'], weights=x['col_4'])]
}))
Here is how this works:
pipe()
the data frame df
to a lambda
function that produces the one row summary data frame.Pandas
expects a dictionary of lists, therefore we wrap each operation into []
.agg()
function, we found it be at times unreliable (try df.agg(np.size)
) and it can’t perform operations that involve more than one column. The apply()
method of data frames too is limited to acting on one column at a time, hence our preference for this approach.Alternatively:
df_2 = pd.DataFrame()
df_2['col_3_min'] = [df['col_3'].min()]
df_2['col_3_75th'] = [np.percentile(df['col_3'], 75)]
df_2['col_4_sum'] = [df['col_4'].sum()]
df_2['col_3_4_wavg'] = [np.average(df['col_3'], weights=df['col_4'])]
Here is how this works: