We wish to specify one or more data transformation functions to apply to each of the selected columns without spelling out each data transformation expression explicitly.
In this section we cover how to apply each of the following to one or more selected columns:
In General Solution, we present a generic and chainable custom solution for implicit data transformation operations.
This section is complemented by
We wish to transform a set of columns of a data frame by applying the same data transformation function to each column.
In this example, we wish to apply the function round()
to round the values of each of the columns ‘col_1'
, ‘col_2'
, and ‘col_4'
to the nearest integer.
selected_cols = ['col_1', 'col_2', 'col_4']
df_n = df[selected_cols].apply(round)
df_2 = df.copy()
df_2[selected_cols] = df_n
Here is how this works:
assign()
, which we used in Basic Transformation, can create only one column at a time.apply()
on a data frame of the selected columns and pass the desired function to apply()
which we do in df[selected_cols].apply(round)
.df_2 = df.copy()
, we create a deep copy of the data frame df
via df.copy()
.df_2[selected_columns] = df_n
, we overwrite the original columns with the transformed columns while keeping other columns in df_2
unaffected.df_2
which is a copy of the input data frame df
.df_n
.We wish to perform multiple data transformation operations to each of a set of columns of a data frame without having to spell out each data transformation explicitly.
In this example, we wish to apply two data transformations to each of the columns ‘col_1'
and ‘col_2'
of the data frame df
. The two transformations are: round()
to round the values to the nearest integer and abs()
to obtain the absolute value
df_n = df[['col_1', 'col_2']].apply([round, abs])
df_2 = pd.concat([df, df_n], axis=1)
Here is how this works:
apply()
on a data frame of the selected columns and pass to apply()
a list of the functions which we wish to apply which we do in df[['col_1', 'col_2']].apply([round, abs])
.df.loc[:, ['col_1', 'col_2']].apply([round, abs])
returns four columns with a Pandas MultiIndex
where level 0 is the column names and level 1 is the function names.pd.concat([df, df_n], axis=1)
, we column bind (left join) the new columns from df_n
to the original data frame df
. See Reshaping.df_n
is converted from a MultiIndex
to a single Index
with columns labels: ('col_1', 'round')
, (col_1, 'abs')
, ('col_2', 'round')
, ('col_2', 'abs')
. See Output Naming for how to name the output columns.We wish to transform a set of columns of a data frame by applying the same data transformation lambda function to each column.
In this example, we wish to apply a lambda function that applies an arithmatic operation to the values of each of the columns ‘col_1'
, ‘col_2'
, and ‘col_4'
.
selected_cols = ['col_1', 'col_2', 'col_4']
df_n = df[selected_cols].apply(lambda x: x * 22 / 7)
df_2 = df.copy()
df_2[selected_cols] = df_n
Here is how this works:
apply()
on a data frame of the selected columns and pass the desired lambda function to apply()
which we do in df[selected_cols].apply(lambda x: x * 22 / 7)
.We wish to apply multiple data transformation lambda
functions to each of a set of columns of a data frame without having to spell out each data transformation explicitly.
selected_cols = ['col_1', 'col_2']
df_n = df[selected_cols].apply([lambda x: round(x, 2),
lambda x: x - x.shift(1)])
df_2 = pd.concat([df, df_n], axis=1)
Here is how this works:
('col_1', '<lambda>')
, ('col_1', '<lambda>')
, ('col_2', '<lambda>')
, ('col_2', '<lambda>')
. The duplication is because all lambda functions are given the name lambda
by default). See Output Naming for how to deal with this.pd.concat()
allows us to work with columns of duplicate names.df_2[selected_cols] = df_n
won’t work. We would get “ValueError: Columns must be same length as key”
.We wish to add to a data frame columns of a constant value produced by applying one aggregating function to one or more selected columns.
In this example, we wish to add to the data frame df
constant value columns each holding the max value of one of a selected set of columns.
df_2 = df.copy()
s_n = df[['col_1', 'col_2']].apply('max')
df_2[s_n.index] = s_n
Here is how this works:
max()
, apply()
returns a Series
with as many values as the columns (or rows if axis=1
).Series
are the column labels of the input columns.df[['col_1', 'col_2']].apply('max')
is, therefore, a Series
of two elements; the max value of col_1
and the max value of col_2
.df_2[s_n.index]
we select the columns of df_2
whose names correspond to the index of the Series
which selects the columns col_1
and col_2
.df_2[s_n.index] = s_n
, we overwrite the original columns with the values in s_n
. On assignment, the values in s_n
are replicated to all rows of df_2
.We wish to add to a data frame columns of a constant value produced by applying multiple aggregating functions to one or more selected columns.
In this example, we wish to add to the data frame df
two constant value columns for each of a selected set of columns; the first holding the min value of the column and the second holding the max value.
df_2 = df.copy()
s_n = df[['col_1', 'col_2']].apply(['max', 'min']).stack()
df_2[s_n.index] = s_n
Here is how this works:
apply()
when given multiple aggregating functions is a data frame where the columns are the column names of the input data frame and the rows are the names of the functions passed to apply()
.stack()
to convert the data frame to a Series
with a MultiIndex
where level 0 holds the function names and level 1 holds the column names.df_2[df_n.index] = df_n
, two things happen:MultiIndex
and in this case are ('max', 'col_1')
, ('max', 'col_2')
, ('min', 'col_1')
, ('min', 'col_2')
. See Output Naming
for how to assign the output column names.s_n
are replicated to all rows of df_2
.We present a generic and chainable custom solution for implicit data transformation operations.
def m_assign(df, select_fn, apply_fn):
df_c = df.copy()
selected_cols = select_fn(df_c)
df_n = df_c[selected_cols].apply(apply_fn)
df_c[df_n.columns] = df_n
return df_c
df_2 = df\
.pipe(m_assign,
lambda x: ['col_1', 'col_2'],
[round, abs])
Here is how this works:
m_assign()
which can be called in a chained manner via pipe()
.m_assign()
expects three inputs:df
: a data frame passed to it here via the pipe()
method.select_fn
: a lambda function that given a data frame, returns the names of the columns to be selected.apply_fn
: the function or list of function to be applied to the selected columns.selected_cols = select_fn(df_c)
we apply the lambda
function passed to select_fn
to the data frame df_c
and expect to get back a list of the column names to be transformed.df_c[selected_cols].apply(apply_fn)
, we apply the function or list of function passed to apply_fn
on the selected columns.df_c[df_n.columns] = df_n
, we assign df_n
to df_c[df_n.columns]
whichm_assign()
gets a single function passed to apply_fn
df_c
as the columns of df_n
if m_assign()
gets a list of functions passed to apply_fn
m_assign()
that allows us to assign better names to output columns.