Columns Selection for Implicit Transformation

We wish to identify the columns on each of which we will apply data transformation logic.

We will cover the following scenarios:

All Columns where we cover how to apply a data transformation operation to all columns of a data frame.
Explicit Selection where we cover how to apply a data transformation operation to each of a set of explicitly selected columns of a data frame e.g. by spelling out their names.
Implicit Selection where we cover how to apply a data transformation operation to each of a set of implicitly selected columns of a data frame e.g. by selecting columns whose names contain a certain substring.
Exclude Columns where we cover how to apply a data transformation operation to each column of a data frame except for a set of excluded columns.

In General Solution, we present a generic and chainable custom solution for implicit data transformation operations.

This section is complemented by

Function Specification where we cover how to specify one or more data transformation operations to apply to the selected set of columns
Output Naming where we cover how to specify the name(s) of output column(s) created by the implicit data transformation operations.

All Columns

We wish to apply a data transformation operation to each column of a data frame.

In this example, we have a data frame df of numeric columns, and we wish to round all columns.

df_2 = df.apply(round)

Here is how this works:

In df.apply(round), apply() executes the function round() on each column of the data frame df.
apply() executes the data transformation operation(s) passed to it to each column of the data frame on which it is called one column at a time (since axis=0 by default).
The output columns overwrite the original columns. See Output Naming for how to append new columns instead.
assign(), which we used in Basic Transformation, can only operate in an explicit manner creating one explicitly specified column at a time. Therefore, to execute a data transformation on multiple columns without having to explicitly spell out each transformation, we use apply().

Explicit Selection

We wish to apply a data transformation operation to each of a set of explicitly selected columns.

In this example, we wish to apply the round() function to columns ‘col_1', ‘col_2', and ‘col_4' of a data frame df.

selected_cols = ['col_1', 'col_2', 'col_4']
df_n = df[selected_cols].apply(round)
df_2 = df.copy()
df_2[selected_cols] = df_n

Here is how this works:

In df[selected_cols], we select the columns that we wish to transform by name. See Basic Selection for a coverage of explicit column selection scenarios, all of which can be used to select columns for implicit transformation.
In df[selected_cols].apply(round), we call apply() on the data frame of selected columns while passing to apply() the function round() which apply() executes on each of the selected columns.
In df_2 = df.copy(), we create a deep copy of the data frame df via df.copy().
In df_2[selected_cols] = df_n, we overwrite the original columns with the transformed columns while keeping other columns in df_2 unaffected.
We can skip the last two steps if the desired output is a data frame containing only the transformed columns which in this example is df_n.

We wish to apply a data transformation operation to each of a set of implicitly selected columns. Implicit column selection is when we do not spell out the column names or positions explicitly but rather identify the columns via a property of their name or their data.

In this example, we wish to apply the round() function to each column of the data frame df of a double data type.

selected_cols = df.select_dtypes('float').columns
df_n = df[selected_cols].apply(round)
df_2 = df.copy()
df_2[selected_cols] = df_n

Here is how this works:

In df.select_dtypes('float').columns, we obtain the names of the columns whose data type is float. See Implicit Selection for a coverage of the most common scenarios of implicit column selection including by name pattern, data type, and Criteria satisfied by the column’s data.
The rest of the code works similarly to the “Explicit Selection” scenario described above.

Exclude Columns

We wish to apply a data transformation operation to all but a set of columns.

In this example, we wish to apply the round() function to each column of the data frame df except the columns ‘col_3’ and ‘col_4'.

selected_cols = df.columns.difference(['col_3', 'col_4'])
df_n = df[selected_cols].apply(round)
df_2 = df.copy()
df_2[selected_cols] = df_n

Here is how this works:

In df.columns.difference(['col_3', 'col_4']) we identify the columns we wish to exclude by name. See Exclude Columns for a coverage of column exclusion scenarios, all of which can be used for implicit filtering.
The rest of the code works similarly to the “Explicit Selection” scenario described above.

General Solution

We present a generic and chainable custom solution for implicit data transformation operations.

def m_assign(df, select_fn, apply_fn):
    df_c = df.copy()
    selected_cols = select_fn(df_c)
    df_n = df_c[selected_cols].apply(apply_fn)
    df_c[df_n.columns] = df_n
    return df_c

df_2 = df \
    .pipe(m_assign,
          lambda x: x.select_dtypes('float').columns,
          round)

Here is how this works:

m_assign() carries out a data transformation operation on selected columns and can be called in a chained manner via pipe().
m_assign() expects three inputs:
- df: a data frame passed to it here via the pipe() method.
- select_fn: a lambda function that given a data frame, returns the names of the columns to be selected.
- apply_fn: the function or list of function to be applied to the selected columns.
The column selection lambda function which we pass to the select_fn argument of m_assign(), accepts a data frame and returns the names of the columns whose data types is float.
Each of the scenarios described above can be carried out via m_assign().

Optima.io Reference beta

Columns Selection for Implicit Transformation

All Columns

Explicit Selection

Implicit Selection

Exclude Columns

General Solution