In this section, we cover implicit grouped transformation i.e. applying one or more data transformation expressions to one or more columns of a grouped data frame in a succinct manner without having to spell out each data transformation explicitly.
We will cover two scenarios:
DataFrameGroupBy
object).DataFrameGroupBy
object).df_n = (df
.groupby('col_3')
[df.columns[(df.dtypes == 'float64')]]
.transform(lambda x: (x - x.mean()) / x.std())
.add_suffix('_' + 'scaled'))
df_2 = pd.concat([df, df_n], axis=1)
Here is how this works:
'col_3'
in groupby('col_3')
.[df.columns[(df.dtypes == 'float64')]]
. All implict column selection approaches described Columns Selection for Implicit Transforamtion can be applied.transform()
which applies the given function to each of them.concat()
step and possibly the add_suffix()
step too.def m_transform(df_g, apply_fns):
df_n = pd.concat([
df_g
.transform(fn)
.add_suffix('_' + fn_name)
for fn_name, fn in apply_fns.items()], axis=1)
return df_n
df_n = (df.groupby('col_3')
[df.columns[(df.dtypes == 'float64')]]
.pipe(m_transform,
{'max': max,
'scaled': lambda x: (x - x.mean()) / x.std()}))
df_2 = pd.concat([df, df_n], axis=1)
Here is how this works:
transform()
method of DataFrameGroupBy
objects only accepts one function. It doesn't accept a list of functions.m_transform()
which takes a grouped dataframe df_g
whose columns are the columns we wish to transform and a dictionary apply_fns
where the values hold the functions we wish to apply and the keys hold the names we wish to attach to the column names for the corresponding transformation.m_transform()
returns a dataframe of the same number of rows as the input and with one column for each input column function pair; i.e. if we have two selected columns and two transformations that we wish to apply we would expect the output to have 4 columns.m_transform()
and store the output in a new data frame df_n.
pd.contact()
function to column bind (hence axis=1
) the data frame of new columns df_n
returned by m_transform()
to the original data frame df
.m_transform()
works as follows:df_g.transform(fn)
.df_g.transform(fn)
for each function in apply_fns
.add_suffix()
function to rename the output columns by adding the function name to the column name col_fn
.pd.contact()
function to column bind the data frames resulting from all transformations.lambda
functions have no name. So whereas for named functions we could use a simple list of named functions, as opposed to a dictionary, then use fn.__name__
to extract the function name and use that as an argument to add_suffix()
for renaming the output columns of applying lambda
functions, we need to explicitly pass the names to use for the lambda
functions.Alternatively:
df_g = df.groupby('col_3')
selected_columns = df.columns[(df.dtypes == 'float64')]
df_n_1 = (df_g[selected_columns]
.transform(max)
.add_suffix('_' + 'max'))
df_n_2 = (df_g[selected_columns]
.transform(lambda x: (x - x.mean()) / x.std())
.add_suffix('_' + 'scaled'))
df_2 = pd.concat([df, df_n_1, df_n_2], axis=1)
Here is how this works:
transform()
when applied to a grouped data frame can only take a single function at a time. Therefore, we need to call transform()
separately for each transformation we wish to carry out.DataFrameGroupBy
object df_g
and selecting the target columns selected_columns
.pd.contact()
function to column bind (hence axis=1
) the data frames df_n_1
and df_n_2
to the original data frame df
.