Grouped Transformation

Quite often we need to apply data transformation operations to sub data frames or groups of a data frame individually. We call this Grouped Transformation. A common grouped transformation scenario is replacing missing values with the mean or the median for the group. Another common grouped transformation scenario is scaling the data by subtracting the group’s mean and dividing by the group’s standard deviation.

General Solution

df_2 = df
    .groupby('col_1', group_keys=False)
    .apply(lambda x:
           x.assign(
               col_4=(x['col_2'] - x['col_2'].mean()) / x['col_2'].std(),
               col_5=x['col_3'].max(),
               col_6=x['col_2'] - x['col_3'].mean(),
               col_7=lambda y: max(y['col_4']) - max(y['col_5'])))

Here is how this works:

  • The preferred approach to data transformation is via assign() as described in Basic Transformation.
  • However, we can't use assign() to execute data transformations on a grouped data frame (a DataFrameGroupBy object) . We get an AttributeError: 'DataFrameGroupBy' object has no attribute 'assign'.
  • A solution that allows us to still use assign() o perform grouped data transformation, and benefit from its flexibility and maintain a consistent approach to data transformation, is to call assign() through a lambda function inside apply().
  • When apply() is applied to a grouped data frame (a DataFrameGroupBy object), all columns for each group are passed to apply() as a data frame. Note that this is contrast to applying apply() to an ungrouped data frame ( a DataFrame object) in which case the columns or the rows are passed one at a time (as a Series).
  • We set group_keys=False to retain the original index of the data frame because our aim is to perform a transformation and not an aggregation operation.
  • Inside the call to assign(), we can carry out all the typical data transformation scenarios, such as those we cover in Basic Transformation. In particular:
    • We can use any one or more columns as inputs to a data transformation operation.
    • If an operation returns a single scalar value e.g. col_5 = x['col_3'].max(), the scalar value will be replicated as many times as the number of rows in the group.
    • If we wish to use columns that were created in the same call to assign(), we need to employ a second lambda function like we do in col_7 = lambda y: max(y['col_4']) - max(y['col_5']))).

Alternatively:

Should we not wish to use assign() for any reason, the approach that we need to follow to perform grouped data transformation operations in Pandas depends on the nature of the input and the output of the operation as follows:

  • Input i.e. whether the grouped transformation operation acts on:
    • One input column
    • More than one input column
  • Output i.e. whether the output of the grouped data transformation is:
    • A vector (Series) of the same size as the input (same number of rows as the group)
    • A single scalar value which we wish to replicate as many times as there are rows in the group often referred to in Pandas lingo as: "broadcast" to the size of the group.

This alternative solutions are, therefore, organized as follows:

  1. One Column to Vector where we cover how to perform grouped data transformation operations where the input is one column and the output is a Series of the same size as the input.
  2. One Column to Scalar where we cover how to perform grouped data transformation operations where the input is one column and the output is a single scalar value which we wish to “broadcast” to the size of the group.
  3. Multiple Columns to Vector where we cover how to perform grouped data transformation operations where the input is multiple columns and the output is a Series of the same size as the input.
  4. Multiple Columns to Scalar where we cover how to perform grouped data transformation operations where the input is multiple columns and the output is a single scalar value which we wish to "broadcast" to the size of the group.
  5. Multiple Grouped Transformations where we show how to carry out multiple grouped data transformations in a chained and more succinct form.

One Column to Vector

We wish to perform a grouped data transformation operation where the input is one column and the output is a Series of the same size as the input.

In this example, we wish to scale the values of the column ‘col_2' by subtracting the mean for the group and dividing by the group’s standard deviation where the groups are defined by the values of the column ‘col_1'.

df['col_2_scaled'] = df
    .groupby('col_1')['col_2']
    .transform(lambda x: (x - x.mean()) / x.std())

Here is how this works:

  • We group by 'col_1' then select 'col_2' to be transformed.
  • For each group, each selected column is passed separately to transform(). In this case the only selected column is col_2.
  • transform() applies the function passed to it (in this case a lambda function) to each column individually.
  • The output of the transform() method has the the same number of rows and the same index as the input. Therefore, we can assign the output of transform() to create a new column (or overwrite an existing column) in the input data frame.
  • The transform() method is, therefore, a useful approach when we need to perform operations in a grouped context and then add that back to the original data frame.
  • We could achieve the same result via df.groupby('col_1').transform(fun)['col_2'] but that is inefficient. We would be transforming all columns then discarding all that output except for the transformed col_2.

One Column to Scalar

We wish to perform a grouped data transformation operation where the input is one column and the output is a scalar value which is replicated to the size as the group.

In this example, we wish to create a new column ‘col_2_max' that holds the max value of the column ‘col_2' for each group where the groups are defined by the column ‘col_1'.

df['col_2_max'] = df
    .groupby('col_1')['col_2']
    .transform(max)

Here is how this works:

  • We group by 'col_1' then select 'col_2' to be transformed.
  • For each group, the selected column col_2 is passed to transform().
  • If the result is a scaler, as in this case, transform() automatically "broadcasts" the scaler value to the size of the group. In other words, transform repeats the scaler to match the number of rows in the group.
  • In this case we only have one selected column which is col_2. More generally: For each group, each selected column is passed separately to transform(). See Implicit Grouped Transformation for applying the same grouped transformation operation to multiple columns.
  • See “One Column to Vector” above for more on transform().

Multiple Columns to Vector

We wish to perform a grouped data transformation operation where the input is multiple columns and the output is a Series of the same size as the input.

In this example, we wish to create a new column col_4 that is the value of the column col_2 minus the mean of the values of the column col_3 for the group where the groups are defined by the column col_1.

df['col_4'] = df
    .groupby('col_1', group_keys=False)
    .apply(lambda x: x['col_2'] - x['col_3'].mean())

Here is how this works:

  • transform() can only act on one column at a time. Therefore, we didn't use transform() because the transformation we wish to carry out involves the interaction of more than one column. Instead we use apply().
  • When apply() is applied to a grouped data frame, all columns for each group are passed to apply() at once as a data frame. Note that this is contrast to applying apply() to an ungrouped data frame in which case the columns or the rows are passed one at a time (as a Series).
  • We set group_keys=Flase to return a single index corresponding to the column being transformed without an index layer corresponding to the group so we may assign the output to df['col_4'] (and to avoid a TypeError: incompatible index of inserted column with frame index error).

Multiple Columns to Scalar

We wish to perform a grouped data transformation operation where the input is multiple columns and the output is a single scalar value which we wish to "broadcast" to the size of the group.

In this example, we wish to create a new column ‘col_7’ that is the difference between the max values for the columns ‘col_2’ and ‘col_3’ for each group where the groups are defined by the value of the column ‘col_1’.

df_t2 = df
    .groupby('col_1', as_index=False)
    .apply(lambda x:
           pd.Series({
               'col_7': max(x['col_2']) - max(x['col_3'])
           }))
df_3 = pd.merge(df, df_t2, on='col_1', how='left')

Here is how this words:

  • We didn't use transform() because the transformation we wish to carry out involves more than one column. Instead we use apply(). See the description of “Multiple Columns to Vector “ above.
  • While a function passed to transform() can return a scalar which transform() broadcasts to the size of the group, apply() doesn't broadcast the scalar output to the size of the group. Instead, apply() returns an aggregate data frame (see Aggregation).
  • An acceptable approach when we need to perform a grouped transformation that involves multiple columns and returns a scaler for each group is to generate an aggregate data frame that we then left join with the original data frame.
  • We have the lambda function return one Series per group (in this case, a series of one element).
    • Pandas row binds these series into a data frame.
    • The column names of the resulting data frame match the index of the Series returned by the applied function, which in this case is ‘col_7’.
  • We set as_index=False so the grouping variable is returned as a regular column not as an index. We will use that as the join column.
  • We use pd.merge() to join the tables on='col_1' the join column and how='left' for a left join ( See Reshaping).

Multiple Transformations

Usually we have a long list of grouped transformation operations that do not necessarily have the same input types ( single column vs multiple columns) or output types (vector vs scalar).

def custom_transform(df):
    df_c = df.copy()

    df_c['col_4'] = df_c
        .groupby('col_1', group_keys=False)['col_2']
        .transform(lambda x: (x - x.mean()) / x.std())

    df_c['col_5'] = df_c
        .groupby('col_1', group_keys=False)['col_2']
        .transform(max)

    df_c['col_6'] = df_c
        .groupby('col_1', group_keys=False)
        .apply(lambda x: x['col_2'] - x['col_3'].mean())

    df_t = df_c
        .groupby('col_1', group_keys=False)
        .apply(lambda x: pd.Series({
        'col_7': x['col_4'].max() - x['col_5'].max(),
        'col_8': (x['col_4'] * x['col_5']).sum()}))
    df_c = pd.merge(df_c, df_t, on='col_1', how='left')

    return df_c


df_2 = df
    .pipe(custom_transform)

Here is how this works:

  • We isolate the grouped transformation operations in a custom function custom_transform().
  • We use the pipe() method to pass the data frame df to the custom function custom_transform().
  • Standard practice for chaining is that the function would not alter the passed data frame but would return a new data frame with the operations executed. Therefore, our custom_transform() function starts by creating a deep copy of the passed data frame via df_c = df.copy().
  • The grouped data transformation operations inside custom_transform() are the different examples we covered earlier in this section.
  • In the sub section “Multiple Columns to Scalar“ we showed how to create one new column as an output of a grouped data transformation operation involving multiple input columns and returning a scalar output (replicated for each row of the group on joining). In this code snippet, we see how to extend that scenario to creating multiple columns (each of which returning a scalar that is then replicated for the group on joining).
PYTHON
I/O