In the implicit data transformation scenarios we covered in Function Specification, the output columns either had the same names as the input columns (overwriting them) or multiple new columns with standardized names were created. In this section, we cover how to override the default behavior and specify output column names. It is often needed to specify output column names that are more appropriate for the domain context.
In this section we cover how to name the columns resulting from applying each of the following to one or more selected columns:
We then present a generic and chainable custom solution to naming the columns resulting from implicit data transformation operations.
This section is complemented by
We wish to specify the names of the columns resulting from applying one named function to one or more selected columns.
In this example, we wish to apply the function round()
to columns 'col_1'
and 'col_2'
and to name the output columns 'col_1_rnd'
, 'col_2_rnd’
. The output should be a copy of the original data frame df
with the new column added.
selected_columns = ['col_1', 'col_2']
df_n = (df
.loc[:, selected_columns]
.apply(round)
.add_suffix('_rnd'))
df_2 = pd.concat([df, df_n], axis=1)
Here is how this works:
add_suffix()
method of data frames to add the suffix ‘_rnd’
to the name of each of the new columns resulting from apply()
. See Renaming.df_2 = pd.concat([df, df_n], axis=1)
, we use pd.concat()
to append the new columns df_n
to the original data frame df
returning a new data frame df_2
. We set axis=1
to append columns not rows. See Reshaping.Alternatively:
df_2 = df.copy()
selected_cols = ['col_1', 'col_2']
renamed_cols = ['{col}_rnd'.format(col=col) for col in selected_cols]
df_2[renamed_cols] = df[selected_cols].apply(round)
Here is how this works:
renamed_cols
from the original selected column names selected_cols
.format()
to insert the names of the original columns in the template '{col}_rnd'
. See String Operations.df_2[renamed_cols]
which has the effect of creating new columns in df_2
with the names given in renamed_cols
and assigning the new columns to them.renamed_cols
which is true in this case.df.copy()
should we wish to override the original data frame.We wish to specify the names of the columns resulting from applying multiple named function to one or more selected columns.
In this example, we wish to apply the functions abs()
to yield the absolute value and round()
to yield the rounded value of the columns ‘col_1'
and ‘col_2'
and to name the output columns ‘col_1_mag'
, ‘col_1_rnd'
, ‘col_2_mag'
, and ‘col_2_rnd'
.
df_n = (df
.loc[:, ['col_1', 'col_2']]
.apply([round, abs])
.rename(columns={'round': 'rnd', 'abs': 'mag'},
level=1))
df_n.columns = ['_'.join(col) for col in df_n.columns.values]
df_2 = pd.concat([df, df_n], axis=1)
Here is how this works:
apply([abs, round])
has a MultiIndex
of two levels: level 0 is the input column names and level 1 is the applied function names.rename()
method to set the names that we wish to use instead of the function names. See Renaming.rename()
:{'round': 'rnd', 'abs': 'mag'}
.level=1
.df_n.columns = ['_'.join(col) for col in df_n.columns.values]
we use a list comprehension to flatten the labels of the MultiIndex
into labels (names) of the form {col}_{fun}
where {col}
is the original column name (level 0) and {fun}
is the function name (level 1). This returns: ‘col_1_mag'
, ‘col_1_rnd'
, ‘col_2_mag'
, and ‘col_2_rnd'
.df_2 = pd.concat([df, df_n], axis=1)
, we use pd.concat()
to append the new columns df_n
to the original data frame df
returning a new data frame df_2
. We set axis=1
to append columns not rows. See Reshaping.rename()
. That would return column names ‘col_1_abs'
, ‘col_1_round'
, ‘col_2_abs'
, and ‘col_2_round'
.MultiIndex
, we can skip the last two steps.We wish to specify the names of the columns resulting from applying one lambda function to one or more selected columns.
In this example, we wish to apply a lambda function to columns 'col_1'
and 'col_2'
and to name the output columns 'col_1_rnd'
, 'col_2_rnd’
. The output should be a copy of the original data frame df
with the new column added.
selected_columns = ['col_1', 'col_2']
df_n = (df
.loc[:, selected_columns]
.apply(lambda x: round(x, 2))
.add_suffix('_rnd'))
df_2 = pd.concat([df, df_n], axis=1)
Here is how this works:
We wish to specify the names of the columns resulting from applying multiple named function to one or more selected columns.
In this example, we wish to apply two anonymous functions to the columns ‘col_1'
and ‘col_2'
and to name the output columns ‘col_1_rnd'
, ‘col_1_dlt'
, ‘col_2_rnd'
, and ‘col_2_dlt'
where ‘rnd'
represents the first anonymous function and ‘dlt'
represents the second.
selected_cols = ['col_1', 'col_2']
df_n = (df
.loc[:, selected_cols]
.apply([lambda x: round(x, 2),
lambda x: x - x.shift(1)]))
transformation_names = ['rnd', 'dlt']
col_names = ['{col}_{fn}'.format(col=col, fn=fn)
for col in selected_cols
for fn in transformation_names]
df_n.columns = col_names
df_2 = pd.concat([df, df_n], axis=1)
Here is how this works:
apply()
the two lambda
functions that we wish to apply to the selected columns as a list.apply()
is a data frame df_n
with four columns and a MultiIndex
of two levels:‘col_1’
and ‘col_2’
‘<lambda>’
by default (because a lambda
function is anonymous i.e. has no assigned name).rename()
, which we used above when dealing with multiple named functions, does not work here because the labels are not unique.{col}_{fun}
where {col}
is the original column name from selected_cols
and {fun}
is the name we wish to use for a lambda function from transformation_names
. This returns: ‘col_1_rnd'
, ‘col_1_dlt'
, ‘col_2_rnd'
, and ‘col_2_dlt'
.df_n.columns = col_names
we overwrite the MultiIndex
with a single Index
containing the column names we created in col_names
.lambda
functions passed to apply()
matches the order of names in transformation_names
.Alternatively:
Should we wish for the output to be a data frame of the newly created columns with a MultiIndex
selected_cols = ['col_1', 'col_2']
transformation_names = ['rnd', 'dlt']
df_n = (df
.loc[:, selected_cols]
.apply([lambda x: round(x, 2),
lambda x: x - x.shift(1)]))
df_n.columns = pd.MultiIndex\
.from_product([selected_cols, transformation_names])
Here is how this works:
rename()
, which we used above when dealing with multiple named functions, does not work here because the labels are not unique.MultiIndex
from the selected column names and transformation names.We wish to name the columns resulting from applying an aggregating function to one or more selected columns.
df_2 = df.copy()
s_n = df[['col_1', 'col_2']].apply('max').add_suffix('_max')
df_2[s_n.index] = s_n
Here is how this works:
add_suffix()
to add the suffix ‘_max’
to the labels of the Series
resulting from df[['col_1', 'col_2']].apply('max')
.We wish to name the columns resulting from applying multiple aggregating functions to one or more selected columns.
df_2 = df.copy()
s_n = df[selected_cols].apply(['max', 'min']).stack()
s_n.index = ['_'.join(col) for col in s_n.index.values]
df_2[s_n.index] = s_n
Here is how this works:
s_n.index = ['_'.join(col) for col ins_n.index.values]
we use a list comprehension to flatten the labels of the MultiIndex
of the Series
s_n
into labels (names) of the form {col}_{fun}
where {col}
is the original column name (level 0) and {fun}
is the function name (level 1). This returns: ‘col_1_max'
, ‘col_1_min'
, ‘col_2_max'
, and ‘col_2_min'
.We present a generic and chainable custom solution to naming the columns resulting from implicit data transformation operations.
def m_assign(p_df, p_select_fn, p_apply_fn, p_names='{col}_{fn}'):
df_c = p_df.copy()
selected_col = p_select_fn(df_c)
if isinstance(p_apply_fn, dict):
apply_fn_name = p_apply_fn.keys()
apply_fn = p_apply_fn.values()
elif isinstance(p_apply_fn, list):
apply_fn_name = [fn.__name__ for fn in p_apply_fn]
apply_fn = p_apply_fn
else:
apply_fn_name = [p_apply_fn.__name__]
apply_fn = p_apply_fn
df_n = df_c[selected_col].apply(apply_fn)
if not callable(p_apply_fn):
df_n.columns = [p_names.format(col=col, fn=fn) for col in selected_col for fn in apply_fn_name]
df_c[df_n.columns] = df_n
return df_c
df_2 = df \
.pipe(m_assign,
lambda x: x.columns[x.dtypes == 'float64'],
{'rnd': round, 'dlt': lambda x: x - x.shift(1)},
'{fn}_{col}')
Here is how this works:
m_assign()
that we introduced in Function Specification to handle renaming the new columns resulting from data transformation operations.m_assign()
is a copy of the input data frame with one or more data transformations carried out as specified by the input parameters.apply_fn
can take one of three things:'{col}_{fn}'
.names
argument of the form '{col}_{fn}'
where {col}
refers to the original column names and {fn}
refers to the original function names or the names provided as keys of the dictionary.__name__
attribute of a function object to obtain its name.names
.