Implicit Grouped Transformation

In this section, we cover implicit grouped transformation i.e. applying one or more data transformation expressions to one or more columns of a grouped data frame in a succinct manner without having to spell out each data transformation explicitly.

We will cover two scenarios:

  • Single Function where we cover how to apply one function or lambda function to multiple columns of a grouped data frame (a DataFrameGroupBy object).
  • Multiple Functions where we cover how to apply multiple functions or lambda functions to multiple columns of a grouped data frame (a DataFrameGroupBy object).

Single Function

df_n = (df
        .groupby('col_3')
        [df.columns[(df.dtypes == 'float64')]]
        .transform(lambda x: (x - x.mean()) / x.std())
        .add_suffix('_' + 'scaled'))
df_2 = pd.concat([df, df_n], axis=1)

Here is how this works:

  • We group by 'col_3' in groupby('col_3').
  • We then select all columns of a double data type via [df.columns[(df.dtypes == 'float64')]]. All implict column selection approaches described Columns Selection for Implicit Transforamtion can be applied.
  • The selected columns are then each passed individually to transform() which applies the given function to each of them.
  • If we do not wish to append the new columns to the original data frame, we can skip the concat() step and possibly the add_suffix() step too.

Multiple Functions

def m_transform(df_g, apply_fns):
    df_n = pd.concat([
        df_g
        .transform(fn)
        .add_suffix('_' + fn_name)
        for fn_name, fn in apply_fns.items()], axis=1)
    return df_n

df_n = (df.groupby('col_3')
        [df.columns[(df.dtypes == 'float64')]]
        .pipe(m_transform,
              {'max': max,
               'scaled': lambda x: (x - x.mean()) / x.std()}))
df_2 = pd.concat([df, df_n], axis=1)

Here is how this works:

  • The transform() method of DataFrameGroupBy objects only accepts one function. It doesn't accept a list of functions.
  • We defined a custom function m_transform() which takes a grouped dataframe df_g whose columns are the columns we wish to transform and a dictionary apply_fns where the values hold the functions we wish to apply and the keys hold the names we wish to attach to the column names for the corresponding transformation.
  • The custom function m_transform() returns a dataframe of the same number of rows as the input and with one column for each input column function pair; i.e. if we have two selected columns and two transformations that we wish to apply we would expect the output to have 4 columns.
  • We pipe the grouped data frame and the function dictionary to m_transform() and store the output in a new data frame df_n.
  • We use the Pandas pd.contact() function to column bind (hence axis=1) the data frame of new columns df_n returned by m_transform() to the original data frame df.
  • m_transform() works as follows:
    • A dictionary comprehension is used to iterate over the input function dictionary and apply each function to the grouped data frame via df_g.transform(fn).
    • The output is a list of data frames corresponding to applying df_g.transform(fn) for each function in apply_fns.
    • We use the add_suffix() function to rename the output columns by adding the function name to the column name col_fn.
    • We use the Pandas pd.contact() function to column bind the data frames resulting from all transformations.
  • Worth noting that lambda functions have no name. So whereas for named functions we could use a simple list of named functions, as opposed to a dictionary, then use fn.__name__ to extract the function name and use that as an argument to add_suffix() for renaming the output columns of applying lambda functions, we need to explicitly pass the names to use for the lambda functions.

Alternatively:

df_g = df.groupby('col_3')
selected_columns = df.columns[(df.dtypes == 'float64')]
df_n_1 = (df_g[selected_columns]
          .transform(max)
          .add_suffix('_' + 'max'))
df_n_2 = (df_g[selected_columns]
          .transform(lambda x: (x - x.mean()) / x.std())
          .add_suffix('_' + 'scaled'))
df_2 = pd.concat([df, df_n_1, df_n_2], axis=1)

Here is how this works:

  • transform() when applied to a grouped data frame can only take a single function at a time. Therefore, we need to call transform() separately for each transformation we wish to carry out.
  • To avoid redundancy, we start by creating a DataFrameGroupBy object df_g and selecting the target columns selected_columns.
  • We use the Pandas pd.contact() function to column bind (hence axis=1) the data frames df_n_1 and df_n_2 to the original data frame df.
PYTHON
I/O