Output Naming

In the implicit data aggregation scenarios we covered in Function Specification, the output columns either had the same names as the input columns or multiple new columns with standardized names were created. In this section, we cover how to override the default behavior and specify output column names. It is often needed to specify output column names that are more appropriate for the domain context.

This section is organized as follows:

  • Named Function where we cover how to specify the names of the columns resulting from implicitly applying one aggregating function to a set of columns.
  • Anonymous Function where we cover how to specify the names of the columns resulting from implicitly applying an anonymous aggregating function to a set of columns.
  • Multiple Functions where we cover how to specify the names of the columns resulting from implicitly applying multiple aggregating functions to a set of columns.

This section is complemented by

  1. Column Selection where we cover how to select the column(s) on each of which we will apply aggregation operations.
  2. Function Specification where we cover how to specify the data aggregation expressions to apply to each of the selected columns.

For a deeper coverage of column naming, see Renaming.

Named Function

We wish to specify the names of the columns resulting from implicitly applying one function to a set of columns instead of the default behavior of using the names of the original columns.

In this example, we wish to compute the sum of values of the columns col_2 and col_4 for each group and to name the output columns col_2_sum and col_4_sum. We are summarizing a data frame df grouped by the values of the column col_1.

df_2 = (df
        .groupby('col_1')[['col_2', 'col_4']]
        .sum()
        .add_suffix('_sum')
        .reset_index())

Here is how this works:

  • The code works as described in Named Aggregation Function with the exception that here we modify the names of the output columns by adding the suffix “_sum”.
  • We use the add_suffix() method of Pandas data frames to add a suffix to the column names of the summary table resulting from applying sum(). See Renaming.
  • The reset_index function is applied to the resulting DataFrame, which removes the 'col_1' column from the index and makes it a regular column. See Grouping.
  • The output is a new data frame df_2 with one row for each unique value of the column col_1, and three columns col_1, col_2_sum and col_4_sum. The later two columns contain the sum of values of the corresponding column and group of rows.

Extension: Naming Template

We wish to name the output columns by applying a template that is a function of the names of the input columns.

In this example, we wish to compute the sum of values of the columns col_2 and col_4 for each group and to name the output columns total_col_2_v2 and total_col_4_v2. We are summarizing a data frame df grouped by the values of the column col_1.

def rename_columns(p_df, p_template):
    p_df.columns = [p_template.format(col=col)
                    for col in p_df.columns]
    return p_df

df_2 = (df
        .groupby('col_1')[['col_2', 'col_4']]
        .sum()
        .pipe(rename_columns, 'total_{col}_v2')
        .reset_index())

Here is how this works:

  • The code works as described in Named Aggregation Function with the exception that here we rename the output columns according to the template 'total_{col}_v2' where {col} stands for the original column name.
  • We define a custom function rename_columns() which accepts a data frame with a single Index and a naming template and then uses the python function format() to rename the columns according to the template. See String Interpolation.
  • We call the custom function rename_columns() in a chained manner via pipe().
  • The output is a new data frame df_2 with one row for each unique value of the column col_1, and three columns col_1, total_col_2_v2 and total_col_4_v2. The later two contain the sum of values of the corresponding column and group of rows.

Anonymous Function

We wish to specify the names of the columns resulting from implicitly applying an anonymous function to a set of columns instead of the default behavior of using the same names as the input columns.

In this example, we have a data frame df and we wish to calculate the ratio of missing values to all values in each column for each group, where the groups are defined by the values of the column col_1. We wish to have the output columns be named with the name of the input column followed by “_na_rate”.

df_2 = (df
        .groupby('col_1')[['col_2', 'col_4']]
        .agg(lambda x: x.isna().sum() / len(x))
        .add_suffix('_na_rate')
        .reset_index())

Here is how this works:

  • The code works as described in Anonymous Aggregation Function with the exception that here we modify the names of the output columns by adding the suffix “_na_rate”.
  • Adding a suffix works as described above under Named Function.
  • The output is a new data frame df_2 with one row for each unique value of the column col_1, and three columns col_1, col_na_rate and col_4_na_rate. The later two columns contain the ratio of missing values for the corresponding column and group of rows.

Extension: Naming Template

We wish to name the output columns by applying a template that is a function of the names of the input columns and the functions applied.

In this example, we have a data frame df and we wish to calculate the ratio of missing values to all values in each column for each group, where the groups are defined by the values of the column col_1. We wish to have the output columns be named according to the template “v2_<col>_na_rate” where “<col>” is the input column name.

def rename_columns(p_df, p_template):
    p_df.columns = [p_template.format(col=col)
                    for col in p_df.columns]
    return p_df

df_2 = (df
        .groupby('col_1')[['col_2', 'col_4']]
        .agg(lambda x: x.isna().sum() / len(x))
        .pipe(rename_columns, 'v2_{col}_na_rate')
        .reset_index())

Here is how this works:

  • The code works as described in Anonymous Aggregation Function with the exception that here we rename the output columns according to the template 'v2_{col}_na_rate' where {col} stands for the original column name.
  • We create a custom function rename_columns() that accepts a data frame and template and uses the python function format() to generate and apply new column names following the template. See String Interpolation.
  • The output is a new data frame df_2 with one row for each unique value of the column col_1, and three columns col_1, v2_col_2_na_rate and v2_col_4_na_rate. The later two columns contain the ratio of missing values for the corresponding column and group of rows.

Multiple Functions

We wish to specify the names of the columns resulting from implicitly applying multiple functions to a set of selected columns instead of the default behavior of creating a MultiIndex where the first level is the column names and the second level is the function names.

In this example, we have a data frame df and we wish to count the number of unique values and compute the sum of values (ignoring NAs) of the columns col_2 and col_4 for each group, where the groups are defined by the values of the column col_1. We wish to have the output columns be named col_2_vals, col_2_90th, col_4_vals, and col_4_90th respectively.

def flatten_index(p_df):
    c_df = p_df.copy()
    c_df.columns = c_df.columns.map('_'.join)
    return c_df

df_2 = (df
        .groupby('col_1')[['col_2', 'col_4']]
        .agg([('vals', 'nunique'),
              ('90th', lambda x: x.quantile(0.9))])
        .pipe(flatten_index)
        .reset_index())

Here is how this works:

  • The code works as described in Multiple Aggregation Functions with the exception that here we rename the output columns.
  • Since we are applying multiple functions, we expect the output of agg() to have a MultiIndex where the first level holds the names of the aggregated columns and the second level holds the names of the aggregation functions.
  • We have the option to pass to agg() a list of tuples of the form ('name to use', 'aggregation function') where the first element is a string specifying the name to use for columns created by the function (the second level of the output’s MultiIndex) and the second element is the aggregation function. For example, in this case one of the tuples we pass to agg is ('vals', 'nunique') which specifies that columns created by applying the nunique() function will be called ‘vals’.
  • We use a custom function flatten_index() to convert the MultiIndex to a single Index.
  • The flatten_index() function uses map() to iterate over each entry in the MultiIndex, which is a tuple holding the values of both levels e.g. ('col_1', 'vals'), and passes it to ‘_’.join() to generate the corresponding underscore separated string e.g. col_1_vals.
  • If it is desired to return a MultiIndex then we can skip the call to flatten_index().
  • The output is a new data frame df_2, with one row for each unique value in the column col_1, and two columns for each of the original columns col_2 and col_4 in the original df in addition to a column col_1. The first of these two new columns will contain the number of unique values in each column, and it’s name will have the suffix _vals. The second column will contain the 90th percentile of the values in each column for the corresponding group of rows, and it’s name will have the suffix _90th.

Alternative: Without Custom Function

df_2 = (df
        .groupby('col_1')[['col_2', 'col_4']]
        .agg([('vals', 'nunique'),
              ('90th', lambda x: x.quantile(0.9))]))

df_2.columns = df_2.columns.map('_'.join)

df_2 = df_2.reset_index()

Here is how this works:

  • This works similarly to the primary solution above except that instead of calling a custom function in a chained manner to flatten the MultiIndex, we do it via a separate commands.
  • We prefer the chained coding style for data manipulation for its code readability benefits.

Alternative: String Replacement

df_2 = (df
        .groupby('col_1')[['col_2', 'col_4']]
        .agg(['nunique', lambda x: x.quantile(0.9)]))

df_2.columns = (df_2.columns.map('_'.join)
                  .str.replace('nunique','vals')
                  .str.replace('<lambda_0>','90th'))

df_2 = df_2.reset_index()

Here is how this works:

  • This code produces the same results as the primary solution above. However, instead of specifying column names to use for each function via passing tuples to agg() and then flattening the Index, we first flatten the Index and then apply string replacement operations on the flattened Index.
  • In df_2.columns.map('_'.join) we flatten the MultiIndex as described in the primary solution above.
  • Column name modification is then done as:
    • In str.replace('nunique','vals') we replace the default nunique with the desired ‘vals’
    • In str.replace('<lambda_0>','90th') we replace the default name applied to columns produced by the first lambda function '<lambda_0>' with the desired '90th'.
    • See String Replacement.

Extension: Uneven Index

df_2 = (df
        .groupby('col_1')[['col_2', 'col_4']]
        .agg([('vals', 'nunique'),
              ('90th', lambda x: x.quantile(0.9))])
        .reset_index())

df_2.columns = (df_2
                .columns
                .map(lambda x: '_'.join(x).rstrip('_')))

Here is how this works:

  • In case some columns like col_1 here have an empty second level in the MultiIndex, the solution above will give them the name col_1_ which is not desirable.
  • To deal with this we add a call to rstrip('_') to remove any ‘_’ character occurring at the right end of a column name.

Extension: Naming Template

We wish to name the output columns by applying a template that is a function of the names of the input columns and the functions applied.

In this example, we have a data frame df and we wish to count the number of unique values and compute the sum of values of the columns col_2 and col_4 for each group, where the groups are defined by the values of the column col_1. We wish to have the output columns be named v2_col_2_vals, v2_col_2_90th, v2_col_4_vals, and v2_col_4_9th respectively.

def rename_columns(p_df, p_template):
    c_df = p_df.copy()
    c_df.columns = [p_template.format(col=col, fn=fn)
                    for (col, fn) in c_df.columns.values]
    return c_df

df_2 = (df
        .groupby('col_1')[['col_2', 'col_4']]
        .agg([('vals', 'nunique'),
              ('90th', lambda x: x.quantile(0.9))])
        .pipe(rename_columns, 'v2_{col}_{fn}')
        .reset_index())

Here is how this works:

  • The code works as described in Multiple Aggregation Functions with the exception that here we rename the output columns according to the template 'v2_{col}_{fn}' where {col} stands for the original column name and {fn} stands for the function name.
  • We pass to agg() a list of tuples to specify the names to use for each function as described in the primary solution above. We can pass a simple list of functions to agg() if we wish to use default function names.
  • We create a custom function rename_columns() that accepts a data frame with a MultiIndex and template and uses the python function format() to generate and apply new column names following the template. See String Interpolation.
  • The output is a new data frame df_2, with one row for each unique value in the column col_1, and two columns for each of the original columns col_2 and col_4 in the original df in addition to a column col_1. The first of these two new columns will contain the number of unique values in each column and the second column will contain the 90th percentile of the values in each column for the corresponding group of rows. The output column names will be v2_col_2_vals, v2_col_2_90th, v2_col_4_vals, and v2_col_4_90th.
PYTHON
I/O