Quite often we need to apply data transformation operations to sub data frames or groups of a data frame individually. We call this Grouped Transformation. A common grouped transformation scenario is replacing missing values with the mean or the median for the group. Another common grouped transformation scenario is scaling the data by subtracting the group’s mean and dividing by the group’s standard deviation.
df_2 = df
.groupby('col_1', group_keys=False)
.apply(lambda x:
x.assign(
col_4=(x['col_2'] - x['col_2'].mean()) / x['col_2'].std(),
col_5=x['col_3'].max(),
col_6=x['col_2'] - x['col_3'].mean(),
col_7=lambda y: max(y['col_4']) - max(y['col_5'])))
Here is how this works:
assign()
as described
in Basic Transformation.assign()
to execute data transformations on a grouped data frame (a DataFrameGroupBy
object)
. We get an AttributeError: 'DataFrameGroupBy' object has no attribute 'assign'
.assign()
o perform grouped data transformation, and benefit from its
flexibility and maintain a consistent approach to data transformation, is to call assign()
through a lambda
function inside apply()
.apply()
is applied to a grouped data frame (a DataFrameGroupBy
object), all columns for each group are passed
to apply()
as a data frame. Note that this is contrast to applying apply()
to an ungrouped data frame (
a DataFrame
object) in which case the columns or the rows are passed one at a time (as a Series
).group_keys=False
to retain the original index of the data frame because our aim is to perform a
transformation and not an aggregation operation.assign()
, we can carry out all the typical data transformation scenarios, such as those we cover
in Basic Transformation. In particular:col_5 = x['col_3'].max()
, the scalar value will be replicated
as many times as the number of rows in the group.assign()
, we need to employ a second lambda
function like we do in col_7 = lambda y: max(y['col_4']) - max(y['col_5'])))
.Alternatively:
Should we not wish to use assign()
for any reason, the approach that we need to follow to perform grouped data
transformation operations in Pandas
depends on the nature of the input and the output of the operation as follows:
Series
) of the same size as the input (same number of rows as the group)This alternative solutions are, therefore, organized as follows:
Series
of the same size as the input.Series
of the same size as the input.We wish to perform a grouped data transformation operation where the input is one column and the output is a Series
of
the same size as the input.
In this example, we wish to scale the values of the column ‘col_2'
by subtracting the mean for the group and dividing
by the group’s standard deviation where the groups are defined by the values of the column ‘col_1'
.
df['col_2_scaled'] = df
.groupby('col_1')['col_2']
.transform(lambda x: (x - x.mean()) / x.std())
Here is how this works:
'col_1'
then select 'col_2'
to be transformed.transform()
. In this case the only selected column
is col_2
.transform()
applies the function passed to it (in this case a lambda
function) to each column individually.transform()
method has the the same number of rows and the same index as the input. Therefore, we
can assign the output of transform()
to create a new column (or overwrite an existing column) in the input data
frame.transform()
method is, therefore, a useful approach when we need to perform operations in a grouped context and
then add that back to the original data frame.df.groupby('col_1').transform(fun)['col_2']
but that is inefficient. We would
be transforming all columns then discarding all that output except for the transformed col_2
.We wish to perform a grouped data transformation operation where the input is one column and the output is a scalar value which is replicated to the size as the group.
In this example, we wish to create a new column ‘col_2_max'
that holds the max value of the column ‘col_2'
for each
group where the groups are defined by the column ‘col_1'
.
df['col_2_max'] = df
.groupby('col_1')['col_2']
.transform(max)
Here is how this works:
'col_1'
then select 'col_2'
to be transformed.col_2
is passed to transform()
.transform()
automatically "broadcasts" the scaler value to the size of
the group. In other words, transform repeats the scaler to match the number of rows in the group.col_2
. More generally: For each group, each selected column
is passed separately to transform()
.
See Implicit Grouped Transformation for applying the same
grouped transformation operation to multiple columns.We wish to perform a grouped data transformation operation where the input is multiple columns and the output is
a Series
of the same size as the input.
In this example, we wish to create a new column col_4
that is the value of the column col_2
minus the mean of the
values of the column col_3
for the group where the groups are defined by the column col_1
.
df['col_4'] = df
.groupby('col_1', group_keys=False)
.apply(lambda x: x['col_2'] - x['col_3'].mean())
Here is how this works:
transform()
can only act on one column at a time. Therefore, we didn't use transform()
because the transformation
we wish to carry out involves the interaction of more than one column. Instead we use apply()
.apply()
is applied to a grouped data frame, all columns for each group are passed to apply()
at once as a
data frame. Note that this is contrast to applying apply()
to an ungrouped data frame in which case the columns or
the rows are passed one at a time (as a Series
).group_keys=Flase
to return a single index corresponding to the column being transformed without an index
layer corresponding to the group so we may assign the output to df['col_4']
(and to avoid
a TypeError: incompatible index of inserted column with frame index
error).We wish to perform a grouped data transformation operation where the input is multiple columns and the output is a single scalar value which we wish to "broadcast" to the size of the group.
In this example, we wish to create a new column ‘col_7’
that is the difference between the max values for the
columns ‘col_2’
and ‘col_3’
for each group where the groups are defined by the value of the column ‘col_1’
.
df_t2 = df
.groupby('col_1', as_index=False)
.apply(lambda x:
pd.Series({
'col_7': max(x['col_2']) - max(x['col_3'])
}))
df_3 = pd.merge(df, df_t2, on='col_1', how='left')
Here is how this words:
transform()
because the transformation we wish to carry out involves more than one column. Instead we
use apply()
. See the description of “Multiple Columns to Vector “ above.transform()
can return a scalar which transform()
broadcasts to the size of the
group, apply()
doesn't broadcast the scalar output to the size of the group. Instead, apply()
returns an aggregate
data frame (see Aggregation).Series
per group (in this case, a series of one element).Series
returned by the applied function,
which in this case is ‘col_7’
.as_index=False
so the grouping variable is returned as a regular column not as an index. We will use that as
the join column.pd.merge()
to join the tables on='col_1'
the join column and how='left'
for a left join (
See Reshaping).Usually we have a long list of grouped transformation operations that do not necessarily have the same input types ( single column vs multiple columns) or output types (vector vs scalar).
def custom_transform(df):
df_c = df.copy()
df_c['col_4'] = df_c
.groupby('col_1', group_keys=False)['col_2']
.transform(lambda x: (x - x.mean()) / x.std())
df_c['col_5'] = df_c
.groupby('col_1', group_keys=False)['col_2']
.transform(max)
df_c['col_6'] = df_c
.groupby('col_1', group_keys=False)
.apply(lambda x: x['col_2'] - x['col_3'].mean())
df_t = df_c
.groupby('col_1', group_keys=False)
.apply(lambda x: pd.Series({
'col_7': x['col_4'].max() - x['col_5'].max(),
'col_8': (x['col_4'] * x['col_5']).sum()}))
df_c = pd.merge(df_c, df_t, on='col_1', how='left')
return df_c
df_2 = df
.pipe(custom_transform)
Here is how this works:
custom_transform()
.pipe()
method to pass the data frame df
to the custom function custom_transform()
.custom_transform()
function starts by creating a deep copy of the
passed data frame via df_c = df.copy()
.custom_transform()
are the different examples we covered earlier
in this section.