In the implicit data aggregation scenarios we covered in Function Specification, the output columns either had the same names as the input columns or multiple new columns with standardized names were created. In this section, we cover how to override the default behavior and specify output column names. It is often needed to specify output column names that are more appropriate for the domain context.
This section is organized as follows:
This section is complemented by
For a deeper coverage of column naming, see Renaming.
We wish to specify the names of the columns resulting from implicitly applying one function to a set of columns instead of the default behavior of using the names of the original columns.
In this example, we wish to compute the sum of values of the columns col_2
and col_4
for each group and to name the output columns col_2_sum
and col_4_sum
. We are summarizing a data frame df
grouped by the values of the column col_1
.
df_2 = (df
.groupby('col_1')[['col_2', 'col_4']]
.sum()
.add_suffix('_sum')
.reset_index())
Here is how this works:
“_sum”
.add_suffix()
method of Pandas data frames to add a suffix to the column names of the summary table resulting from applying sum()
. See Renaming.reset_index
function is applied to the resulting DataFrame, which removes the 'col_1' column from the index and makes it a regular column. See Grouping.df_2
with one row for each unique value of the column col_1
, and three columns col_1
, col_2_sum
and col_4_sum
. The later two columns contain the sum of values of the corresponding column and group of rows.Extension: Naming Template
We wish to name the output columns by applying a template that is a function of the names of the input columns.
In this example, we wish to compute the sum of values of the columns col_2
and col_4
for each group and to name the output columns total_col_2_v2
and total_col_4_v2
. We are summarizing a data frame df
grouped by the values of the column col_1
.
def rename_columns(p_df, p_template):
p_df.columns = [p_template.format(col=col)
for col in p_df.columns]
return p_df
df_2 = (df
.groupby('col_1')[['col_2', 'col_4']]
.sum()
.pipe(rename_columns, 'total_{col}_v2')
.reset_index())
Here is how this works:
'total_{col}_v2'
where {col}
stands for the original column name.rename_columns()
which accepts a data frame with a single Index
and a naming template and then uses the python function format()
to rename the columns according to the template. See String Interpolation.rename_columns()
in a chained manner via pipe()
.df_2
with one row for each unique value of the column col_1
, and three columns col_1
, total_col_2_v2
and total_col_4_v2
. The later two contain the sum of values of the corresponding column and group of rows.We wish to specify the names of the columns resulting from implicitly applying an anonymous function to a set of columns instead of the default behavior of using the same names as the input columns.
In this example, we have a data frame df
and we wish to calculate the ratio of missing values to all values in each column for each group, where the groups are defined by the values of the column col_1
. We wish to have the output columns be named with the name of the input column followed by “_na_rate”
.
df_2 = (df
.groupby('col_1')[['col_2', 'col_4']]
.agg(lambda x: x.isna().sum() / len(x))
.add_suffix('_na_rate')
.reset_index())
Here is how this works:
“_na_rate”
.df_2
with one row for each unique value of the column col_1
, and three columns col_1
, col_na_rate
and col_4_na_rate
. The later two columns contain the ratio of missing values for the corresponding column and group of rows.Extension: Naming Template
We wish to name the output columns by applying a template that is a function of the names of the input columns and the functions applied.
In this example, we have a data frame df
and we wish to calculate the ratio of missing values to all values in each column for each group, where the groups are defined by the values of the column col_1
. We wish to have the output columns be named according to the template “v2_<col>_na_rate”
where “<col>”
is the input column name.
def rename_columns(p_df, p_template):
p_df.columns = [p_template.format(col=col)
for col in p_df.columns]
return p_df
df_2 = (df
.groupby('col_1')[['col_2', 'col_4']]
.agg(lambda x: x.isna().sum() / len(x))
.pipe(rename_columns, 'v2_{col}_na_rate')
.reset_index())
Here is how this works:
'v2_{col}_na_rate'
where {col}
stands for the original column name.df_2
with one row for each unique value of the column col_1
, and three columns col_1
, v2_col_2_na_rate
and v2_col_4_na_rate
. The later two columns contain the ratio of missing values for the corresponding column and group of rows.We wish to specify the names of the columns resulting from implicitly applying multiple functions to a set of selected columns instead of the default behavior of creating a MultiIndex where the first level is the column names and the second level is the function names.
In this example, we have a data frame df
and we wish to count the number of unique values and compute the sum of values (ignoring NAs) of the columns col_2
and col_4
for each group, where the groups are defined by the values of the column col_1
. We wish to have the output columns be named col_2_vals
, col_2_90th
, col_4_vals
, and col_4_90th
respectively.
def flatten_index(p_df):
c_df = p_df.copy()
c_df.columns = c_df.columns.map('_'.join)
return c_df
df_2 = (df
.groupby('col_1')[['col_2', 'col_4']]
.agg([('vals', 'nunique'),
('90th', lambda x: x.quantile(0.9))])
.pipe(flatten_index)
.reset_index())
Here is how this works:
agg()
to have a MultiIndex
where the first level holds the names of the aggregated columns and the second level holds the names of the aggregation functions.agg()
a list of tuples of the form ('name to use', 'aggregation function')
where the first element is a string specifying the name to use for columns created by the function (the second level of the output’s MultiIndex
) and the second element is the aggregation function. For example, in this case one of the tuples we pass to agg is ('vals', 'nunique')
which specifies that columns created by applying the nunique()
function will be called ‘vals’
.flatten_index()
to convert the MultiIndex
to a single Index
.flatten_index()
function uses map()
to iterate over each entry in the MultiIndex
, which is a tuple holding the values of both levels e.g. ('col_1', 'vals')
, and passes it to ‘_’.join()
to generate the corresponding underscore separated string e.g. col_1_vals
.flatten_index()
.df_2
, with one row for each unique value in the column col_1
, and two columns for each of the original columns col_2
and col_4
in the original df
in addition to a column col_1
. The first of these two new columns will contain the number of unique values in each column, and it’s name will have the suffix _vals
. The second column will contain the 90th percentile of the values in each column for the corresponding group of rows, and it’s name will have the suffix _90th
.Alternative: Without Custom Function
df_2 = (df
.groupby('col_1')[['col_2', 'col_4']]
.agg([('vals', 'nunique'),
('90th', lambda x: x.quantile(0.9))]))
df_2.columns = df_2.columns.map('_'.join)
df_2 = df_2.reset_index()
Here is how this works:
MultiIndex
, we do it via a separate commands.Alternative: String Replacement
df_2 = (df
.groupby('col_1')[['col_2', 'col_4']]
.agg(['nunique', lambda x: x.quantile(0.9)]))
df_2.columns = (df_2.columns.map('_'.join)
.str.replace('nunique','vals')
.str.replace('<lambda_0>','90th'))
df_2 = df_2.reset_index()
Here is how this works:
agg()
and then flattening the Index, we first flatten the Index and then apply string replacement operations on the flattened Index.df_2.columns.map('_'.join)
we flatten the MultiIndex as described in the primary solution above.str.replace('nunique','vals')
we replace the default nunique
with the desired ‘vals’
str.replace('<lambda_0>','90th')
we replace the default name applied to columns produced by the first lambda function '<lambda_0>'
with the desired '90th'
.Extension: Uneven Index
df_2 = (df
.groupby('col_1')[['col_2', 'col_4']]
.agg([('vals', 'nunique'),
('90th', lambda x: x.quantile(0.9))])
.reset_index())
df_2.columns = (df_2
.columns
.map(lambda x: '_'.join(x).rstrip('_')))
Here is how this works:
col_1
here have an empty second level in the MultiIndex
, the solution above will give them the name col_1_
which is not desirable.rstrip('_')
to remove any ‘_’
character occurring at the right end of a column name.Extension: Naming Template
We wish to name the output columns by applying a template that is a function of the names of the input columns and the functions applied.
In this example, we have a data frame df
and we wish to count the number of unique values and compute the sum of values of the columns col_2
and col_4
for each group, where the groups are defined by the values of the column col_1
. We wish to have the output columns be named v2_col_2_vals
, v2_col_2_90th
, v2_col_4_vals
, and v2_col_4_9th
respectively.
def rename_columns(p_df, p_template):
c_df = p_df.copy()
c_df.columns = [p_template.format(col=col, fn=fn)
for (col, fn) in c_df.columns.values]
return c_df
df_2 = (df
.groupby('col_1')[['col_2', 'col_4']]
.agg([('vals', 'nunique'),
('90th', lambda x: x.quantile(0.9))])
.pipe(rename_columns, 'v2_{col}_{fn}')
.reset_index())
Here is how this works:
'v2_{col}_{fn}'
where {col}
stands for the original column name and {fn}
stands for the function name.agg()
a list of tuples to specify the names to use for each function as described in the primary solution above. We can pass a simple list of functions to agg()
if we wish to use default function names.rename_columns()
that accepts a data frame with a MultiIndex
and template and uses the python function format()
to generate and apply new column names following the template. See String Interpolation.df_2
, with one row for each unique value in the column col_1
, and two columns for each of the original columns col_2
and col_4
in the original df
in addition to a column col_1
. The first of these two new columns will contain the number of unique values in each column and the second column will contain the 90th percentile of the values in each column for the corresponding group of rows. The output column names will be v2_col_2_vals
, v2_col_2_90th
, v2_col_4_vals
, and v2_col_4_90th
.