Grouped Transformation

Quite often we need to apply data transformation operations to sub data frames or groups of a data frame individually. We call this Grouped Transformation. A common grouped transformation scenario is replacing missing values with the mean or the median for the group. Another common grouped transformation scenario is scaling the data by subtracting the group’s mean and dividing by the group’s standard deviation.

df_2 = df %>%
  group_by(col_1) %>%
  mutate(
    col_4 = (col_2 - mean(col_2)) / sd(col_2),
    col_5 = max(col_3), 
    col_6 = col_2 - mean(col_3),
    col_7 = max(col_4) - max(col_5)
  )

Here is how this works:

  • To perform grouped data transformation operations, we simply execute group_by() prior to executing mutate().
  • The data transformations passed to mutate(), when called after group_by(), would be executed for each group individually.
  • The output of executing the data transformation operations for each group is a data frame of the same number of rows as the group and with the newly created columns added on the right (or existing columns over written).
  • Inside the call to mutate(), we can carry out all the typical data transformation scenarios, such as those we cover in Basic Transformation. In particular:
    • We can use any one or more columns as inputs to a data transformation operation.
    • If an operation returns a single scalar value e.g. col_5 = max(col_3), the scalar value will be replicated as many times as the number of rows in the group.
    • We can use columns created earlier in the same mutate() statement as inputs to data transformation expressions.
R
I/O