Columns Selection for Implicit Transformation

We wish to identify the columns on each of which we will apply data transformation logic.

We will cover the following scenarios:

  • All Columns where we cover how to apply a data transformation operation to all columns of a data frame.
  • Explicit Selection where we cover how to apply a data transformation operation to each of a set of explicitly selected columns of a data frame e.g. by spelling out their names.
  • Implicit Selection where we cover how to apply a data transformation operation to each of a set of implicitly selected columns of a data frame e.g. by selecting columns whose names contain a certain substring.
  • Exclude Columns where we cover how to apply a data transformation operation to each column of a data frame except for a set of excluded columns.

In General Solution, we present a generic and chainable custom solution for implicit data transformation operations.

This section is complemented by

  • Function Specification where we cover how to specify one or more data transformation operations to apply to the selected set of columns
  • Output Naming where we cover how to specify the name(s) of output column(s) created by the implicit data transformation operations.

All Columns

We wish to apply a data transformation operation to each column of a data frame.

In this example, we have a data frame df of numeric columns, and we wish to round all columns.

df_2 = df.apply(round)

Here is how this works:

  • In df.apply(round), apply() executes the function round() on each column of the data frame df.
  • apply() executes the data transformation operation(s) passed to it to each column of the data frame on which it is called one column at a time (since axis=0 by default).
  • The output columns overwrite the original columns. See Output Naming for how to append new columns instead.
  • assign(), which we used in Basic Transformation, can only operate in an explicit manner creating one explicitly specified column at a time. Therefore, to execute a data transformation on multiple columns without having to explicitly spell out each transformation, we use apply().

Explicit Selection

We wish to apply a data transformation operation to each of a set of explicitly selected columns.

In this example, we wish to apply the round() function to columns ‘col_1', ‘col_2', and ‘col_4' of a data frame df.

selected_cols = ['col_1', 'col_2', 'col_4']
df_n = df[selected_cols].apply(round)
df_2 = df.copy()
df_2[selected_cols] = df_n

Here is how this works:

  • In df[selected_cols], we select the columns that we wish to transform by name. See Basic Selection for a coverage of explicit column selection scenarios, all of which can be used to select columns for implicit transformation.
  • In df[selected_cols].apply(round), we call apply() on the data frame of selected columns while passing to apply() the function round() which apply() executes on each of the selected columns.
  • In df_2 = df.copy(), we create a deep copy of the data frame df via df.copy().
  • In df_2[selected_cols] = df_n, we overwrite the original columns with the transformed columns while keeping other columns in df_2 unaffected.
  • We can skip the last two steps if the desired output is a data frame containing only the transformed columns which in this example is df_n.

Implicit Selection

We wish to apply a data transformation operation to each of a set of implicitly selected columns. Implicit column selection is when we do not spell out the column names or positions explicitly but rather identify the columns via a property of their name or their data.

In this example, we wish to apply the round() function to each column of the data frame df of a double data type.

selected_cols = df.select_dtypes('float').columns
df_n = df[selected_cols].apply(round)
df_2 = df.copy()
df_2[selected_cols] = df_n

Here is how this works:

  • In df.select_dtypes('float').columns, we obtain the names of the columns whose data type is float. See Implicit Selection for a coverage of the most common scenarios of implicit column selection including by name pattern, data type, and Criteria satisfied by the column’s data.
  • The rest of the code works similarly to the “Explicit Selection” scenario described above.

Exclude Columns

We wish to apply a data transformation operation to all but a set of columns.

In this example, we wish to apply the round() function to each column of the data frame df except the columns ‘col_3’ and ‘col_4'.

selected_cols = df.columns.difference(['col_3', 'col_4'])
df_n = df[selected_cols].apply(round)
df_2 = df.copy()
df_2[selected_cols] = df_n

Here is how this works:

  • In df.columns.difference(['col_3', 'col_4']) we identify the columns we wish to exclude by name. See Exclude Columns for a coverage of column exclusion scenarios, all of which can be used for implicit filtering.
  • The rest of the code works similarly to the “Explicit Selection” scenario described above.

General Solution

We present a generic and chainable custom solution for implicit data transformation operations.

def m_assign(df, select_fn, apply_fn):
    df_c = df.copy()
    selected_cols = select_fn(df_c)
    df_n = df_c[selected_cols].apply(apply_fn)
    df_c[df_n.columns] = df_n
    return df_c

df_2 = df \
    .pipe(m_assign,
          lambda x: x.select_dtypes('float').columns,
          round)

Here is how this works:

  • m_assign() carries out a data transformation operation on selected columns and can be called in a chained manner via pipe().
  • m_assign() expects three inputs:
    • df: a data frame passed to it here via the pipe() method.
    • select_fn: a lambda function that given a data frame, returns the names of the columns to be selected.
    • apply_fn: the function or list of function to be applied to the selected columns.
  • The column selection lambda function which we pass to the select_fn argument of m_assign(), accepts a data frame and returns the names of the columns whose data types is float.
  • Each of the scenarios described above can be carried out via m_assign().
PYTHON
I/O