Function Specification for Implicit Transformation

We wish to specify one or more data transformation functions to apply to each of the selected columns without spelling out each data transformation expression explicitly.

In this section we cover how to apply each of the following to one or more selected columns:

  • One Named Function
  • Multiple Named Functions
  • One Lambda Function
  • Multiple Lambda Functions
  • One Aggregating Function
  • Multiple Aggregating Functions

In General Solution, we present a generic and chainable custom solution for implicit data transformation operations.

This section is complemented by

  • Column Selection where we cover how to select the columns to each of which we will apply data transformation logic.
  • Output Naming where we cover how to specify the name(s) of output column(s) created by the implicit data transformation operations.

One Named Function

We wish to transform a set of columns of a data frame by applying the same data transformation function to each column.

In this example, we wish to apply the function round() to round the values of each of the columns ‘col_1', ‘col_2', and ‘col_4' to the nearest integer.

selected_cols = ['col_1', 'col_2', 'col_4']
df_n = df[selected_cols].apply(round)
df_2 = df.copy()
df_2[selected_cols] = df_n

Here is how this works:

  • assign(), which we used in Basic Transformation, can create only one column at a time.
  • In order to apply one data transformation function to multiple columns: We simply call apply() on a data frame of the selected columns and pass the desired function to apply() which we do in df[selected_cols].apply(round).
  • In df_2 = df.copy(), we create a deep copy of the data frame df via df.copy().
  • In df_2[selected_columns] = df_n, we overwrite the original columns with the transformed columns while keeping other columns in df_2 unaffected.
  • In this solution the new columns overwrite the original columns in df_2 which is a copy of the input data frame df.
    • We can skip the last two steps if the desired output is a data frame containing only the transformed columns which in this solution is df_n.
    • See Output Naming if we want to add the new columns as additional columns with new names to the input data frame.
  • See “General Solution” below for a chainable solution.

Multiple Named Functions

We wish to perform multiple data transformation operations to each of a set of columns of a data frame without having to spell out each data transformation explicitly.

In this example, we wish to apply two data transformations to each of the columns ‘col_1' and ‘col_2' of the data frame df. The two transformations are: round() to round the values to the nearest integer and abs() to obtain the absolute value

df_n = df[['col_1', 'col_2']].apply([round, abs])
df_2 = pd.concat([df, df_n], axis=1)

Here is how this works:

  • In order to apply multiple data transformation functions to one or more columns: We call apply() on a data frame of the selected columns and pass to apply() a list of the functions which we wish to apply which we do in df[['col_1', 'col_2']].apply([round, abs]).
  • The transformation df.loc[:, ['col_1', 'col_2']].apply([round, abs]) returns four columns with a Pandas MultiIndex where level 0 is the column names and level 1 is the function names.
  • In pd.concat([df, df_n], axis=1), we column bind (left join) the new columns from df_n to the original data frame df. See Reshaping.
  • On joining the two data frames, the index of df_n is converted from a MultiIndex to a single Index with columns labels: ('col_1', 'round'), (col_1, 'abs'), ('col_2', 'round'), ('col_2', 'abs'). See Output Naming for how to name the output columns.
  • See “General Solution” below for a chainable solution.

One Lambda Function

We wish to transform a set of columns of a data frame by applying the same data transformation lambda function to each column.

In this example, we wish to apply a lambda function that applies an arithmatic operation to the values of each of the columns ‘col_1', ‘col_2', and ‘col_4'.

selected_cols = ['col_1', 'col_2', 'col_4']
df_n = df[selected_cols].apply(lambda x: x * 22 / 7)
df_2 = df.copy()
df_2[selected_cols] = df_n    

Here is how this works:

  • In order to apply one data transformation lambda function to multiple columns: We simply call apply() on a data frame of the selected columns and pass the desired lambda function to apply() which we do in df[selected_cols].apply(lambda x: x * 22 / 7).
  • The rest of the code works similarly to the “Named Function” scenario described above.
  • See “General Solution” below for a chainable solution.

Multiple Lambda Functions

We wish to apply multiple data transformation lambda functions to each of a set of columns of a data frame without having to spell out each data transformation explicitly.

selected_cols = ['col_1', 'col_2']
df_n = df[selected_cols].apply([lambda x: round(x, 2),
                                lambda x: x - x.shift(1)])
df_2 = pd.concat([df, df_n], axis=1)

Here is how this works:

  • We get duplicated output columns names which in this case are: ('col_1', '<lambda>'), ('col_1', '<lambda>'), ('col_2', '<lambda>'), ('col_2', '<lambda>'). The duplication is because all lambda functions are given the name lambda by default). See Output Naming for how to deal with this.
  • pd.concat() allows us to work with columns of duplicate names.
  • Assignment via e.g. df_2[selected_cols] = df_n won’t work. We would get “ValueError: Columns must be same length as key”.

One Aggregating Function

We wish to add to a data frame columns of a constant value produced by applying one aggregating function to one or more selected columns.

In this example, we wish to add to the data frame df constant value columns each holding the max value of one of a selected set of columns.

df_2 = df.copy()
s_n = df[['col_1', 'col_2']].apply('max')
df_2[s_n.index] = s_n

Here is how this works:

  • When given an aggregating function such as max(), apply() returns a Series with as many values as the columns (or rows if axis=1).
  • The index labels of the output Series are the column labels of the input columns.
  • The output of df[['col_1', 'col_2']].apply('max') is, therefore, a Series of two elements; the max value of col_1 and the max value of col_2.
  • In df_2[s_n.index] we select the columns of df_2 whose names correspond to the index of the Series which selects the columns col_1 and col_2.
  • In df_2[s_n.index] = s_n, we overwrite the original columns with the values in s_n. On assignment, the values in s_n are replicated to all rows of df_2.

Multiple Aggregating Functions

We wish to add to a data frame columns of a constant value produced by applying multiple aggregating functions to one or more selected columns.

In this example, we wish to add to the data frame df two constant value columns for each of a selected set of columns; the first holding the min value of the column and the second holding the max value.

df_2 = df.copy()
s_n = df[['col_1', 'col_2']].apply(['max', 'min']).stack()
df_2[s_n.index] = s_n

Here is how this works:

  • The output of apply() when given multiple aggregating functions is a data frame where the columns are the column names of the input data frame and the rows are the names of the functions passed to apply().
  • We use stack() to convert the data frame to a Series with a MultiIndex where level 0 holds the function names and level 1 holds the column names.
  • On assigning df_2[df_n.index] = df_n, two things happen:
    • New columns are created whose names correspond to the default flattening of the MultiIndex and in this case are ('max', 'col_1'), ('max', 'col_2'), ('min', 'col_1'), ('min', 'col_2'). See Output Naming for how to assign the output column names.
    • The values in s_n are replicated to all rows of df_2.

Generic Solution

We present a generic and chainable custom solution for implicit data transformation operations.

def m_assign(df, select_fn, apply_fn):
    df_c = df.copy()
    selected_cols = select_fn(df_c)
    df_n = df_c[selected_cols].apply(apply_fn)
    df_c[df_n.columns] = df_n
    return df_c

df_2 = df\
    .pipe(m_assign,
          lambda x: ['col_1', 'col_2'],
          [round, abs])

Here is how this works:

  • We isolate the implicit data transformation logic in a custom function m_assign() which can be called in a chained manner via pipe().
  • m_assign() expects three inputs:
    • df: a data frame passed to it here via the pipe() method.
    • select_fn: a lambda function that given a data frame, returns the names of the columns to be selected.
    • apply_fn: the function or list of function to be applied to the selected columns.
  • In selected_cols = select_fn(df_c) we apply the lambda function passed to select_fn to the data frame df_c and expect to get back a list of the column names to be transformed.
  • In df_c[selected_cols].apply(apply_fn), we apply the function or list of function passed to apply_fn on the selected columns.
  • In df_c[df_n.columns] = df_n, we assign df_n to df_c[df_n.columns] which
    • overwrites the original columns with the transformed columns if m_assign() gets a single function passed to apply_fn
    • creates as many new columns in df_c as the columns of df_n if m_assign() gets a list of functions passed to apply_fn
  • See Output Naming for an extension to m_assign() that allows us to assign better names to output columns.
PYTHON
I/O