We wish to specify one or more data aggregation functions to apply to each of a selected set of columns without spelling out each data aggregation expression explicitly.

In this section, we cover the following function specification scenarios:

**Named Function**where we cover how to apply a built-in function or a custom function to each of a selected set of columns.**Anonymous Function**where we cover how to apply a`lambda`

function to each of a selected set of columns.**Multiple functions**where we cover how to apply each of a set of functions separately to each of a selected set of columns.**Multiple Function Sets**where we cover how to have different sets of one or more aggregation functions applied to different columns.

This section is complemented by

**Column Selection**where we cover how to select the column(s) to each of which we will implicitly apply data aggregation operations.**Output Naming**where we cover how to specify the name(s) of output column(s) created by the implicit data aggregation operations.

We wish to apply the same named data aggregation function, e.g. `sum()`

, to each of a set of selected columns.

In this example, we have a data frame `df`

and we wish to compute the sum of values of the columns `col_2`

and `col_4`

for each group, where the groups are defined by the values of the column `col_1`

.

```
df_2 = (df
.groupby('col_1')[['col_2', 'col_4']]
.agg('sum'))
```

Here is how this works:

- The
`groupby()`

method groups the rows of the data frame`df`

by the values in the column`col_1`

. The data aggregation operations applied by the subsequent call to`agg()`

are, therefore, applied to each group separately. See Summary Table. `agg()`

applies the function`sum()`

to each of the selected columns,`col_2`

and`col_4`

, for each group.- We can refer to methods that are defined for the
`Pandas`

`GroupBy`

object by passing to`agg()`

a string holding their name like we do here for`‘sum’`

. See Basic Aggregation. - The output is a new data frame
`df_2`

with one row for each unique value of the column`col_1`

, and two columns`col_2`

and`col_4`

containing the sum of values of that column for the corresponding group of rows. - The output columns have the names of the original columns. See Output Naming for how to specify the names of the output columns.

*Alternative: Via Pandas Convenience Function*

```
df_2 = df.groupby('col_1')[['col_2', 'col_4']].sum()
```

Here is how this works:

- This code performs the same data aggregation operation as the primary solution above. However, we use the Pandas convenience data aggregation method
`sum()`

instead of`agg()`

. - In Pandas, the output of
`groupby()`

(a`DataFrameGroupBy`

object) has a set of convenience methods to carry out the most common data aggregation operations which can be used instead of passing a function reference to`agg()`

.`Pandas`

applies the method to each column for each group behind the scenes. One such method is`sum()`

which we use in this example. See Common Aggregation Operations for a coverage of most`Pandas`

convenience data aggregation functions.

*Extension: Passing Function Arguments*

We wish to apply a data aggregation function that requires one or more parameter arguments to each of a set of selected columns.

```
df_2 = (df
.groupby('col_1')[['col_2', 'col_4']]
.agg('quantile', q=0.9))
```

Here is how this works:

- In order to pass arguments to the named function, we can pass those arguments to
`agg()`

after the function.`agg()`

will then pass those arguments along to the function. - In this example, we pass
`q=0.9`

to`agg()`

which then passes it along to`quantile()`

. - Alternatively, we can use a
`lambda`

lambda function to pass parameters to the named function of interest. In this example, the anonymous function solution would be`lambda x: quantile(x, q=0.9)`

. See Anonymous Function below.

We wish to apply the same data aggregating lambda function to each of a set of selected columns.

In this example, we have a data frame `df`

and we wish to calculate the ratio of missing values to all values in each column for each group, where the groups are defined by the values of the column `col_1`

.

```
df_2 = (df
.groupby('col_1')[['col_2', 'col_4']]
.agg(lambda x: x.isna().sum() / len(x)))
```

Here is how this works:

- The
`groupby()`

method groups the rows of the data frame`df`

by the values in the column`col_1`

. The data aggregation operations applied by the subsequent call to`agg()`

are, therefore, applied to each group separately. See Summary Table. `agg()`

applies the`lambda`

function to each of the selected columns (`col_2`

and`col_4`

) for each group.- The lambda function
`lambda x: x.isna().sum() / len(x)`

calculates the number of`NA`

values in a column by applying the`isna()`

method to each value in the column and summing the resulting logical vector. It then divides by the number of values of the current column for the current group to obtain the ratio of missing values. - The output is a new data frame
`df_2`

, with one row for each unique value in the`col_1`

column, and a column for each selected column, i.e.`col_2`

and`col_4`

, containing the ratio of missing values for that column for the corresponding group of rows.

We wish to perform multiple data aggregation operations to each of a set of selected columns of a data frame individually without having to spell out each data aggregation explicitly.

In this example, we have a data frame `df`

and we wish to count the number of unique values and compute the 90th percentile value of the columns `col_2`

and `col_4`

for each group, where the groups are defined by the values of the column `col_1`

.

```
df_2 = (df
.groupby('col_1')[['col_2', 'col_4']]
.agg(['nunique', lambda x: x.quantile(0.9)]))
```

Here is how this works:

- The
`groupby()`

method groups the rows of the data frame`df`

by the values in the column`col_1`

. The data aggregation operations applied by the subsequent call to`agg()`

are, therefore, applied to each group separately. See Summary Table. - In
`['col_2', 'col_4']`

we select the columns that we wish to aggregate over. - We then pass to
`agg()`

a list of the aggregating functions or lambda functions that we wish to carry out. - The function
`nunique()`

returns the number of unique values in the input vector. - When passing a list of functions to
`agg()`

, one way to pass arguments to one of the functions is to use a`lambda`

function. In this example, the lambda function`lambda x: x.quantile(0.9)`

allows us to calculate the 90th percentile of values of the input vector which requires us to set the argument`q`

to`0.9`

. - The output of calling
`agg()`

with a list of functions on a grouped data frame is a data frame with a hierarchical column`MultiIndex`

where:- the names of the aggregated columns are the top level of the index which in this case are
`col_2`

and`col_4`

- and the aggregation functions are the second level of the index which in this case are
`nunique`

and`<lambda_0>`

. In general, columns resulting from lambda functions are called`<lambda_0>`

,`<lambda_1>`

, etc..

- the names of the aggregated columns are the top level of the index which in this case are
- Note that if we pass a list of 1 aggregated column or a list of 1 aggregation function the output data frame will also have a
`MultiIndex`

. - The output is a new data frame
`df_2`

, with one row for each unique value in the column`col_1`

, and two columns for each of the original columns`col_2`

and`col_4`

in the original`df`

. The first of these two new columns will contain the number of unique values in each column, and the second column will contain the 90th percentile of the values in each column for the corresponding group of rows. - See Output Naming for how to specify the names of the output columns.

*Alternative: Wrapper Function*

```
def quantile_90(x):
return x.quantile(0.9)
df_2 = (df
.groupby('col_1')[['col_2', 'col_4']]
.agg(['nunique', quantile_90]))
```

Here is how this works:

- This code performs the same data aggregation operation as the primary solution above. However, we use a wrapper function instead of a lambda function to pass parameters to the
`quantile()`

method. - Our wrapper function here is a regular function that calls the function
`quantile()`

with the desired argument values set; i.e.`q=0.9`

. We then call the wrapper function from inside`agg()`

.

We wish to have different sets of one or more functions applied to different columns.

In this example, we have a data frame `df`

that we wish to summarise over groups defined by the value of the column `col_1`

. We wish to compute: (1) the sum of values of the columns `col_2`

and `col_3`

for each group, (2) the mean and standard deviation of values of the column `col_4`

for each group.

```
df_2 = (df
.groupby('col_1')
.agg({'col_2': 'sum',
'col_3': 'sum',
'col_4': ['mean', 'std']}))
```

Here is how this works:

- The
`groupby()`

method groups the rows of the data frame`df`

by the values in the column`col_1`

. The data aggregation operations applied by the subsequent call to`agg()`

are, therefore, applied to each group separately. See Summary Table. - In order to have different sets of functions applied to different columns, we can pass a dictionary to
`agg()`

where the keys are the columns names and the values are either a single function or a list of functions. This option to pass a dictionary to`agg()`

allows us to generate extensive summaries with few keystrokes - In this example, we pass to
`agg()`

a dictionary with three key-value pairs as follows:`'col_2': 'sum'`

specifies that we wish to compute the sum of values of the column`col_2`

for each group. Similarly, for`'col_3': 'sum'`

.`'col_4': ['mean', 'std']`

specifies that we wish to compute the mean and standard deviation of values of the column`col_4`

for each group.

- The output is a new data frame
`df_2`

, with one row for each unique value in the column`col_1`

, and four columns. The first two columns will contain the sum of values in the columns`col_2`

and`col_3`

respectively for the corresponding group of rows. The next two columns will mean and standard deviation of values of the column`col_4`

for the corresponding group of rows.

*Alternative: Join Summary Tables*

```
df_a = (df
.groupby('col_1')[['col_2', 'col_3']]
.agg('sum'))
df_b = (df
.groupby('col_1')['col_4']
.agg(['mean', 'std']))
df_2 = pd.concat([df_a, df_b], axis=1)
```

Here is how this works:

- This code performs the same function as the code above but with a different approach. Instead of passing a dictionary to
`agg()`

, we generate multiple summary data frames and then join them (column bind them) via`concat()`

. - We generate two summary data frames as follows:
- In both aggregation operations, we group by the grouping column of interest
`col_1`

. `df_a`

holds the sum of values of the columns`col_2`

and`col_3`

for each group.`df_b`

holds the mean and standard deviation of values of the column`col_4`

for each group.

- In both aggregation operations, we group by the grouping column of interest
- In
`pd.concat([df_a, df_b], axis=1)`

we specify that we wish to concatenate the columns of`df_b`

next to the columns of`df_a`

(hence`axis=1`

) matching rows via the index (which is the groups). See Reshaping.~~

PYTHON