Implicit Transformation Output Naming

In the implicit data transformation scenarios we covered in Function Specification, the output columns either had the same names as the input columns (overwriting them) or multiple new columns with standardized names were created. In this section, we cover how to override the default behavior and specify output column names. It is often needed to specify output column names that are more appropriate for the domain context.

In this section we cover how to name the columns resulting from applying each of the following to one or more selected columns:

  • One Named Function
  • Multiple Named Functions
  • One Lambda Function
  • Multiple Lambda Functions
  • One Aggregating Function
  • Multiple Aggregating Functions

We then present a generic and chainable custom solution to naming the columns resulting from implicit data transformation operations.

This section is complemented by

  • Column Selection where we cover how to select the columns to each of which we will apply data transformation logic.
  • Function Specification where we cover how to specify one or more data transformation operations to apply to the selected set of columns.

One Named Function

We wish to specify the names of the columns resulting from applying one named function to one or more selected columns.

In this example, we wish to apply the function round() to columns 'col_1' and 'col_2' and to name the output columns 'col_1_rnd', 'col_2_rnd’. The output should be a copy of the original data frame df with the new column added.

selected_columns = ['col_1', 'col_2']
df_n = (df
        .loc[:, selected_columns]
        .apply(round)
        .add_suffix('_rnd'))
df_2 = pd.concat([df, df_n], axis=1)

Here is how this works:

  • We use the add_suffix() method of data frames to add the suffix ‘_rnd’ to the name of each of the new columns resulting from apply(). See Renaming.
  • In df_2 = pd.concat([df, df_n], axis=1), we use pd.concat() to append the new columns df_n to the original data frame df returning a new data frame df_2. We set axis=1 to append columns not rows. See Reshaping.
  • See “Generic Solution” below for a generic and chainable solution.

Alternatively:

df_2 = df.copy()
selected_cols = ['col_1', 'col_2']
renamed_cols = ['{col}_rnd'.format(col=col) for col in selected_cols]
df_2[renamed_cols] = df[selected_cols].apply(round)

Here is how this works:

  • We use a list comprehension to generate the new column names renamed_cols from the original selected column names selected_cols.
  • To generate the new column names, we use the function format() to insert the names of the original columns in the template '{col}_rnd'. See String Operations.
  • We assign the results of the data transformation operation to df_2[renamed_cols] which has the effect of creating new columns in df_2 with the names given in renamed_cols and assigning the new columns to them.
  • It’s necessary for this assignment operation to work that the number of columns being assigned is equal to the number of elements in renamed_cols which is true in this case.
  • We can drop df.copy() should we wish to override the original data frame.

Multiple Named Functions

We wish to specify the names of the columns resulting from applying multiple named function to one or more selected columns.

In this example, we wish to apply the functions abs() to yield the absolute value and round() to yield the rounded value of the columns ‘col_1' and ‘col_2' and to name the output columns ‘col_1_mag', ‘col_1_rnd', ‘col_2_mag', and ‘col_2_rnd'.

df_n = (df
        .loc[:, ['col_1', 'col_2']]
        .apply([round, abs])
        .rename(columns={'round': 'rnd', 'abs': 'mag'},
                level=1))
df_n.columns = ['_'.join(col) for col in df_n.columns.values]
df_2 = pd.concat([df, df_n], axis=1)

Here is how this works:

  • The output of apply([abs, round]) has a MultiIndex of two levels: level 0 is the input column names and level 1 is the applied function names.
  • We use use the rename() method to set the names that we wish to use instead of the function names. See Renaming.
  • We pass two arguments to rename():
    • a dictionary mapping current names to desired names {'round': 'rnd', 'abs': 'mag'}.
    • Since, we wish to modify the names of 1 level in a MultiIndex we set level=1.
  • In df_n.columns = ['_'.join(col) for col in df_n.columns.values] we use a list comprehension to flatten the labels of the MultiIndex into labels (names) of the form {col}_{fun} where {col} is the original column name (level 0) and {fun} is the function name (level 1). This returns: ‘col_1_mag', ‘col_1_rnd', ‘col_2_mag', and ‘col_2_rnd'.
  • In df_2 = pd.concat([df, df_n], axis=1), we use pd.concat() to append the new columns df_n to the original data frame df returning a new data frame df_2. We set axis=1 to append columns not rows. See Reshaping.
  • Three variations to this solution are worth mentioning:
    • Should we wish to use the original function names we can drop the call to rename(). That would return column names ‘col_1_abs', ‘col_1_round', ‘col_2_abs', and ‘col_2_round'.
    • Should we not wish to merge the new columns back into the original data frame, and rather return a data frame of the newly created columns with a MultiIndex, we can skip the last two steps.
    • For a chainable solution, see “Generic Solution” below.

One Lambda Function

We wish to specify the names of the columns resulting from applying one lambda function to one or more selected columns.

In this example, we wish to apply a lambda function to columns 'col_1' and 'col_2' and to name the output columns 'col_1_rnd', 'col_2_rnd’. The output should be a copy of the original data frame df with the new column added.

selected_columns = ['col_1', 'col_2']
df_n = (df
        .loc[:, selected_columns]
        .apply(lambda x: round(x, 2))
        .add_suffix('_rnd'))
df_2 = pd.concat([df, df_n], axis=1)

Here is how this works:

  • This works in the same way as the “Named Function” solution above.
  • See “Generic Solution” below for a generic and chainable solution.

Multiple Lambda Functions

We wish to specify the names of the columns resulting from applying multiple named function to one or more selected columns.

In this example, we wish to apply two anonymous functions to the columns ‘col_1' and ‘col_2' and to name the output columns ‘col_1_rnd', ‘col_1_dlt', ‘col_2_rnd', and ‘col_2_dlt' where ‘rnd' represents the first anonymous function and ‘dlt' represents the second.

selected_cols = ['col_1', 'col_2']
df_n = (df
        .loc[:, selected_cols]
        .apply([lambda x: round(x, 2),
                lambda x: x - x.shift(1)]))
transformation_names = ['rnd', 'dlt']
col_names = ['{col}_{fn}'.format(col=col, fn=fn)
             for col in selected_cols
             for fn in transformation_names]
df_n.columns = col_names
df_2 = pd.concat([df, df_n], axis=1)

Here is how this works:

  • We pass to apply() the two lambda functions that we wish to apply to the selected columns as a list.
  • The output of apply() is a data frame df_n with four columns and a MultiIndex of two levels:
    • level 0 has the column names of the input columns ‘col_1’ and ‘col_2’
    • level 1 has the names of the functions which in this case are all set to ‘<lambda>’ by default (because a lambda function is anonymous i.e. has no assigned name).
  • rename(), which we used above when dealing with multiple named functions, does not work here because the labels are not unique.
  • We use a list comprehension with two loops to create new column names of the format {col}_{fun} where {col} is the original column name from selected_cols and {fun} is the name we wish to use for a lambda function from transformation_names. This returns: ‘col_1_rnd', ‘col_1_dlt', ‘col_2_rnd', and ‘col_2_dlt'.
  • In df_n.columns = col_names we overwrite the MultiIndex with a single Index containing the column names we created in col_names.
  • Need to be careful that the order of lambda functions passed to apply() matches the order of names in transformation_names.

Alternatively:

Should we wish for the output to be a data frame of the newly created columns with a MultiIndex

selected_cols = ['col_1', 'col_2']
transformation_names = ['rnd', 'dlt']
df_n = (df
        .loc[:, selected_cols]
        .apply([lambda x: round(x, 2),
                lambda x: x - x.shift(1)]))
df_n.columns = pd.MultiIndex\
    .from_product([selected_cols, transformation_names])

Here is how this works:

  • rename(), which we used above when dealing with multiple named functions, does not work here because the labels are not unique.
  • We create and assign a new MultiIndex from the selected column names and transformation names.

One Aggregating Function

We wish to name the columns resulting from applying an aggregating function to one or more selected columns.

df_2 = df.copy()
s_n = df[['col_1', 'col_2']].apply('max').add_suffix('_max')
df_2[s_n.index] = s_n

Here is how this works:

  • See description of “Aggregating Function” under Function Specification.
  • We use add_suffix() to add the suffix ‘_max’ to the labels of the Series resulting from df[['col_1', 'col_2']].apply('max').

Multiple Aggregating Functions

We wish to name the columns resulting from applying multiple aggregating functions to one or more selected columns.

df_2 = df.copy()
s_n = df[selected_cols].apply(['max', 'min']).stack()
s_n.index = ['_'.join(col) for col in s_n.index.values]
df_2[s_n.index] = s_n

Here is how this works:

  • See description of “Aggregating Function” under Function Specification.
  • In s_n.index = ['_'.join(col) for col ins_n.index.values] we use a list comprehension to flatten the labels of the MultiIndex of the Series s_n into labels (names) of the form {col}_{fun} where {col} is the original column name (level 0) and {fun} is the function name (level 1). This returns: ‘col_1_max', ‘col_1_min', ‘col_2_max', and ‘col_2_min'.

Generic Solution

We present a generic and chainable custom solution to naming the columns resulting from implicit data transformation operations.

def m_assign(p_df, p_select_fn, p_apply_fn, p_names='{col}_{fn}'):
    df_c = p_df.copy()
    selected_col = p_select_fn(df_c)
    if isinstance(p_apply_fn, dict):
        apply_fn_name = p_apply_fn.keys()
        apply_fn = p_apply_fn.values()
    elif isinstance(p_apply_fn, list):
        apply_fn_name = [fn.__name__ for fn in p_apply_fn]
        apply_fn = p_apply_fn
    else:
        apply_fn_name = [p_apply_fn.__name__]
        apply_fn = p_apply_fn
    df_n = df_c[selected_col].apply(apply_fn)
    if not callable(p_apply_fn):
        df_n.columns = [p_names.format(col=col, fn=fn) for col in selected_col for fn in apply_fn_name]
    df_c[df_n.columns] = df_n
    return df_c

df_2 = df \
    .pipe(m_assign,
          lambda x: x.columns[x.dtypes == 'float64'],
          {'rnd': round, 'dlt': lambda x: x - x.shift(1)},
          '{fn}_{col}')

Here is how this works:

  • We will extend the function m_assign() that we introduced in Function Specification to handle renaming the new columns resulting from data transformation operations.
  • The output of m_assign() is a copy of the input data frame with one or more data transformations carried out as specified by the input parameters.
  • The function has been extended so apply_fn can take one of three things:
    • a single function in which case the transformed columns will overwrite the original columns.
    • a list of functions in which case the transformed columns will be named following the default template '{col}_{fn}'.
    • a dictionary where the keys would be used instead of the function names. This is especially useful for lambda functions.
  • In addition, the function can take an optional names argument of the form '{col}_{fn}' where {col} refers to the original column names and {fn} refers to the original function names or the names provided as keys of the dictionary.
  • We use the __name__ attribute of a function object to obtain its name.
  • We use a list comprehension that loops through column names and function names to generate the new column names following the template passed in names.
  • All the scenarios in this section can be carried out via this solution.
PYTHON
I/O