We wish to identify the columns on each of which we will apply data transformation logic.
We will cover the following scenarios:
In General Solution, we present a generic and chainable custom solution for implicit data transformation operations.
This section is complemented by
We wish to apply a data transformation operation to each column of a data frame.
In this example, we have a data frame df
of numeric columns, and we wish to round all columns.
df_2 = df.apply(round)
Here is how this works:
df.apply(round)
, apply()
executes the function round()
on each column of the data frame df
.apply()
executes the data transformation operation(s) passed to it to each column of the data frame on which it is called one column at a time (since axis=0
by default).assign()
, which we used in Basic Transformation, can only operate in an explicit manner creating one explicitly specified column at a time. Therefore, to execute a data transformation on multiple columns without having to explicitly spell out each transformation, we use apply()
.We wish to apply a data transformation operation to each of a set of explicitly selected columns.
In this example, we wish to apply the round()
function to columns ‘col_1'
, ‘col_2'
, and ‘col_4'
of a data frame df
.
selected_cols = ['col_1', 'col_2', 'col_4']
df_n = df[selected_cols].apply(round)
df_2 = df.copy()
df_2[selected_cols] = df_n
Here is how this works:
df[selected_cols]
, we select the columns that we wish to transform by name. See Basic Selection for a coverage of explicit column selection scenarios, all of which can be used to select columns for implicit transformation.df[selected_cols].apply(round)
, we call apply()
on the data frame of selected columns while passing to apply()
the function round()
which apply()
executes on each of the selected columns.df_2 = df.copy()
, we create a deep copy of the data frame df
via df.copy()
.df_2[selected_cols] = df_n
, we overwrite the original columns with the transformed columns while keeping other columns in df_2
unaffected.df_n
.We wish to apply a data transformation operation to each of a set of implicitly selected columns. Implicit column selection is when we do not spell out the column names or positions explicitly but rather identify the columns via a property of their name or their data.
In this example, we wish to apply the round()
function to each column of the data frame df
of a double data type.
selected_cols = df.select_dtypes('float').columns
df_n = df[selected_cols].apply(round)
df_2 = df.copy()
df_2[selected_cols] = df_n
Here is how this works:
df.select_dtypes('float').columns
, we obtain the names of the columns whose data type is float
. See Implicit Selection for a coverage of the most common scenarios of implicit column selection including by name pattern, data type, and Criteria satisfied by the column’s data.We wish to apply a data transformation operation to all but a set of columns.
In this example, we wish to apply the round()
function to each column of the data frame df
except the columns ‘col_3’
and ‘col_4'
.
selected_cols = df.columns.difference(['col_3', 'col_4'])
df_n = df[selected_cols].apply(round)
df_2 = df.copy()
df_2[selected_cols] = df_n
Here is how this works:
df.columns.difference(['col_3', 'col_4'])
we identify the columns we wish to exclude by name. See Exclude Columns for a coverage of column exclusion scenarios, all of which can be used for implicit filtering.We present a generic and chainable custom solution for implicit data transformation operations.
def m_assign(df, select_fn, apply_fn):
df_c = df.copy()
selected_cols = select_fn(df_c)
df_n = df_c[selected_cols].apply(apply_fn)
df_c[df_n.columns] = df_n
return df_c
df_2 = df \
.pipe(m_assign,
lambda x: x.select_dtypes('float').columns,
round)
Here is how this works:
m_assign()
carries out a data transformation operation on selected columns and can be called in a chained manner via pipe()
.m_assign()
expects three inputs:df
: a data frame passed to it here via the pipe()
method.select_fn
: a lambda function that given a data frame, returns the names of the columns to be selected.apply_fn
: the function or list of function to be applied to the selected columns.lambda
function which we pass to the select_fn
argument of m_assign()
, accepts a data frame and returns the names of the columns whose data types is float
.m_assign()
.