Function Specification

We wish to specify one or more data aggregation functions to apply to each of a selected set of columns without spelling out each data aggregation expression explicitly.

In this section, we cover the following function specification scenarios:

  • Named Function where we cover how to apply a built-in function or a custom function to each of a selected set of columns.
  • Anonymous Function where we cover how to apply a lambda function to each of a selected set of columns.
  • Multiple functions where we cover how to apply each of a set of functions separately to each of a selected set of columns.
  • Multiple Function Sets where we cover how to have different sets of one or more aggregation functions applied to different columns.

This section is complemented by

  • Column Selection where we cover how to select the column(s) to each of which we will implicitly apply data aggregation operations.
  • Output Naming where we cover how to specify the name(s) of output column(s) created by the implicit data aggregation operations.

Named Function

We wish to apply the same named data aggregation function, e.g. sum(), to each of a set of selected columns.

In this example, we have a data frame df and we wish to compute the sum of values of the columns col_2 and col_4 for each group, where the groups are defined by the values of the column col_1.

df_2 = (df
        .groupby('col_1')[['col_2', 'col_4']]
        .agg('sum'))

Here is how this works:

  • The groupby() method groups the rows of the data frame df by the values in the column col_1. The data aggregation operations applied by the subsequent call to agg() are, therefore, applied to each group separately. See Summary Table.
  • agg() applies the function sum() to each of the selected columns, col_2 and col_4, for each group.
  • We can refer to methods that are defined for the Pandas GroupBy object by passing to agg() a string holding their name like we do here for ‘sum’. See Basic Aggregation.
  • The output is a new data frame df_2 with one row for each unique value of the column col_1, and two columns col_2 and col_4 containing the sum of values of that column for the corresponding group of rows.
  • The output columns have the names of the original columns. See Output Naming for how to specify the names of the output columns.

Alternative: Via Pandas Convenience Function

df_2 = df.groupby('col_1')[['col_2', 'col_4']].sum()

Here is how this works:

  • This code performs the same data aggregation operation as the primary solution above. However, we use the Pandas convenience data aggregation method sum() instead of agg().
  • In Pandas, the output of groupby() (a DataFrameGroupBy object) has a set of convenience methods to carry out the most common data aggregation operations which can be used instead of passing a function reference to agg(). Pandas applies the method to each column for each group behind the scenes. One such method is sum() which we use in this example. See Common Aggregation Operations for a coverage of most Pandas convenience data aggregation functions.

Extension: Passing Function Arguments

We wish to apply a data aggregation function that requires one or more parameter arguments to each of a set of selected columns.

df_2 = (df
        .groupby('col_1')[['col_2', 'col_4']]
        .agg('quantile', q=0.9))

Here is how this works:

  • In order to pass arguments to the named function, we can pass those arguments to agg() after the function. agg() will then pass those arguments along to the function.
  • In this example, we pass q=0.9 to agg() which then passes it along to quantile().
  • Alternatively, we can use a lambda lambda function to pass parameters to the named function of interest. In this example, the anonymous function solution would be lambda x: quantile(x, q=0.9). See Anonymous Function below.

Anonymous Function

We wish to apply the same data aggregating lambda function to each of a set of selected columns.

In this example, we have a data frame df and we wish to calculate the ratio of missing values to all values in each column for each group, where the groups are defined by the values of the column col_1.

df_2 = (df
        .groupby('col_1')[['col_2', 'col_4']]
        .agg(lambda x: x.isna().sum() / len(x)))

Here is how this works:

  • The groupby() method groups the rows of the data frame df by the values in the column col_1. The data aggregation operations applied by the subsequent call to agg() are, therefore, applied to each group separately. See Summary Table.
  • agg() applies the lambda function to each of the selected columns (col_2 and col_4) for each group.
  • The lambda function lambda x: x.isna().sum() / len(x) calculates the number of NA values in a column by applying the isna() method to each value in the column and summing the resulting logical vector. It then divides by the number of values of the current column for the current group to obtain the ratio of missing values.
  • The output is a new data frame df_2, with one row for each unique value in the col_1 column, and a column for each selected column, i.e. col_2 and col_4, containing the ratio of missing values for that column for the corresponding group of rows.

Multiple Functions

We wish to perform multiple data aggregation operations to each of a set of selected columns of a data frame individually without having to spell out each data aggregation explicitly.

In this example, we have a data frame df and we wish to count the number of unique values and compute the 90th percentile value of the columns col_2 and col_4 for each group, where the groups are defined by the values of the column col_1.

df_2 = (df
        .groupby('col_1')[['col_2', 'col_4']]
        .agg(['nunique', lambda x: x.quantile(0.9)]))

Here is how this works:

  • The groupby() method groups the rows of the data frame df by the values in the column col_1. The data aggregation operations applied by the subsequent call to agg() are, therefore, applied to each group separately. See Summary Table.
  • In ['col_2', 'col_4'] we select the columns that we wish to aggregate over.
  • We then pass to agg() a list of the aggregating functions or lambda functions that we wish to carry out.
  • The function nunique() returns the number of unique values in the input vector.
  • When passing a list of functions to agg(), one way to pass arguments to one of the functions is to use a lambda function. In this example, the lambda function lambda x: x.quantile(0.9) allows us to calculate the 90th percentile of values of the input vector which requires us to set the argument q to 0.9.
  • The output of calling agg() with a list of functions on a grouped data frame is a data frame with a hierarchical column MultiIndex where:
    • the names of the aggregated columns are the top level of the index which in this case are col_2 and col_4
    • and the aggregation functions are the second level of the index which in this case are nunique and <lambda_0>. In general, columns resulting from lambda functions are called <lambda_0>, <lambda_1>, etc..
  • Note that if we pass a list of 1 aggregated column or a list of 1 aggregation function the output data frame will also have a MultiIndex.
  • The output is a new data frame df_2, with one row for each unique value in the column col_1, and two columns for each of the original columns col_2 and col_4 in the original df. The first of these two new columns will contain the number of unique values in each column, and the second column will contain the 90th percentile of the values in each column for the corresponding group of rows.
  • See Output Naming for how to specify the names of the output columns.

Alternative: Wrapper Function

def quantile_90(x):
    return x.quantile(0.9)

df_2 = (df
        .groupby('col_1')[['col_2', 'col_4']]
        .agg(['nunique', quantile_90]))

Here is how this works:

  • This code performs the same data aggregation operation as the primary solution above. However, we use a wrapper function instead of a lambda function to pass parameters to the quantile() method.
  • Our wrapper function here is a regular function that calls the function quantile() with the desired argument values set; i.e. q=0.9. We then call the wrapper function from inside agg().

Multiple Function Sets

We wish to have different sets of one or more functions applied to different columns.

In this example, we have a data frame df that we wish to summarise over groups defined by the value of the column col_1. We wish to compute: (1) the sum of values of the columns col_2 and col_3 for each group, (2) the mean and standard deviation of values of the column col_4 for each group.

df_2 = (df
  .groupby('col_1')
  .agg({'col_2': 'sum',
                'col_3': 'sum',
        'col_4': ['mean', 'std']}))

Here is how this works:

  • The groupby() method groups the rows of the data frame df by the values in the column col_1. The data aggregation operations applied by the subsequent call to agg() are, therefore, applied to each group separately. See Summary Table.
  • In order to have different sets of functions applied to different columns, we can pass a dictionary to agg() where the keys are the columns names and the values are either a single function or a list of functions. This option to pass a dictionary to agg() allows us to generate extensive summaries with few keystrokes
  • In this example, we pass to agg() a dictionary with three key-value pairs as follows:
    • 'col_2': 'sum' specifies that we wish to compute the sum of values of the column col_2 for each group. Similarly, for 'col_3': 'sum'.
    • 'col_4': ['mean', 'std'] specifies that we wish to compute the mean and standard deviation of values of the column col_4 for each group.
  • The output is a new data frame df_2, with one row for each unique value in the column col_1, and four columns. The first two columns will contain the sum of values in the columns col_2 and col_3 respectively for the corresponding group of rows. The next two columns will mean and standard deviation of values of the column col_4 for the corresponding group of rows.

Alternative: Join Summary Tables

df_a = (df
        .groupby('col_1')[['col_2', 'col_3']]
        .agg('sum'))
df_b = (df
        .groupby('col_1')['col_4']
        .agg(['mean', 'std']))
df_2 = pd.concat([df_a, df_b], axis=1)

Here is how this works:

  • This code performs the same function as the code above but with a different approach. Instead of passing a dictionary to agg(), we generate multiple summary data frames and then join them (column bind them) via concat().
  • We generate two summary data frames as follows:
    • In both aggregation operations, we group by the grouping column of interest col_1.
    • df_a holds the sum of values of the columns col_2 and col_3 for each group.
    • df_b holds the mean and standard deviation of values of the column col_4 for each group.
  • In pd.concat([df_a, df_b], axis=1) we specify that we wish to concatenate the columns of df_b next to the columns of df_a (hence axis=1) matching rows via the index (which is the groups). See Reshaping.~~
PYTHON
I/O