We wish to carry out an individual data aggregation operation e.g. we wish to obtain the sum of the values of a particular numeric column. One common situation where quick individual data aggregation is needed is during interactive inspection of data

We will cover two scenarios:

- Performing a data aggregation operation on an
**entire data frame**(not grouped) reducing one or more columns to a single scalar value. - Performing a data aggregation operation on a
**grouped data frame**reducing one or more columns into a set of values one for each group.

This section is complemented by:

- Summary Table where we cover how to carry out multiple aggregations on and return a summary data frame.
- Common Aggregation Operations where cover of the most common data aggregation operations.

We wish to perform one data aggregation operation on a data frame that is not grouped.

**One Column Function**

We wish to perform one data aggregation operation on a column of a data frame (that is not grouped) reducing that one column to a single scalar value.

In this example we wish to compute the sum of the column `‘col_2'`

of the data frame `df`

.

```
df['col_2'].sum()
```

Here is how this works:

- In
`df['col_2']`

, we select the column that we wish to summarize via the bracket operator returning a`Series`

. See Selecting. - We use the convenience method
`sum()`

of Pandas`Series`

to compute the sum of values of the column`col_2`

returning a single Scalar value. - See Common Aggregation Operations for a coverage of the most common data aggregation operations.

Alternatively,

```
from scipy.stats import gmean
gmean(df['col_2'])
```

Here is how this works:

- We pass the
`Series`

resulting from`df['col_2']`

to the aggregation function that we wish to apply. - The aggregation function in question should accept a Series or an array like object and return a summary value.
- This approach is appropriate when the function we wish to apply doesn't exist as a Pandas
`Series`

convenience function such as the geometric mean function`gmean()`

from`scipy.stats`

in this example.

**Multiple Column Function**

We wish to perform an aggregation operation on an entire data frame (that is not grouped) reducing multiple columns to a single scalar value.

In this example, we wish to compute the weighted average of the column `‘col_2'`

where the weights are given by the column `‘col_3'`

of the data frame `df`

.

```
np.average(df['col_2'], weights=df['col_3'])
```

Here is how this works:

- There are no convenience Pandas data frame methods that operate on multiple columns.
- To compute the weighted average, we use the
`average()`

function from`NumPy`

package. - We pass the column whose average we wish to compute
`df['col_2']`

to the first argument of`average()`

and the column to use as weights to the weights argument in`weights=df['col_3']`

.

*Alternatively:*

```
df.pipe(lambda x: np.average(x['col_2'], weights=x['col_3']))
```

Here is how this works:

- We use the
`pipe()`

method to pass the data frame df to a`lambda`

function that computes the weighted average via the`average(`

) function from`NumPy`

. - This approach may be appropriate if we wish to compute the data aggregation operation as part of a larger chain of data transformation operations.

We wish to perform one data aggregation operation on a grouped data frame.

**One Column Function**

We wish to perform an aggregation on a grouped data frame. The aggregation operation we wish to perform is a function of one column.

In this example, we wish to perform two individual aggregations on a data frame `df`

grouped by the column `‘col_1’`

. Those aggregations are: (1) Obtain the sum of the value of the column `‘col_2'`

for each group. (2) Obtain the 90th percentile value of the column `‘col_2'`

for each group.

```
df.groupby('col_1')['col_2'].agg('sum')
df.groupby('col_1')['col_2'].agg('quantile', 0.9)
```

Here is how this works:

- In
`df.groupby('col_1')`

, we group the data frame`df`

by the values of the column`‘col_1’`

. - We then select the column
`'col_2'`

returning a grouped`Series`

(more precisely a`SeriesGroupBy`

object). - We use the
`agg()`

method of grouped`Series`

(`SeriesGroupBy`

objects) to compute the aggregate value of the`Series`

for each group. - Any aggregating function, i.e. one that takes multiple individual values and returns a single summary value, can be passed to
`agg()`

. - We can pass a qualified function reference, be it a built-in function e.g.
`np.mean()`

or any custom function by its name. We can refer to Pandas convenience functions via a string of their name e.g.`‘quantile’`

instead of its qualified name`SeriesGroupBy.quantile()`

. We cover many examples of this in Common Aggregation Operations. - Pandas implicitly passes the values of the aggregated column for each group to the aggregating function as a
`Series`

. - We can pass any number of positional arguments or named arguments to the function by simply passing them to
`agg()`

like we do in`agg(’quantile’, 0.9)`

. - The output is a
`Series`

where each element is the output of aggregation for each group and where the`Index`

is the group identifier. - If we pass a non aggregating function i.e. the output per group has more than one scalar value e.g.
`sqrt()`

an exception is raised`ValueError: function does not reduce`

.

*Alternatively:*

```
df.groupby('col_1')['col_2'].sum()
df.groupby('col_1')['col_2'].quantile(0.9)
```

Here is how this works:

`Pandas`

provides convenience methods for the most common aggregation functions such as`sum()`

and`quantile()`

. We cover most of those convenience methods in Common Aggregation Operations.- We can call those convenience aggregation methods directly on a grouped Series (
`SeriesGroupBy`

object) like we do here. - The output is exactly the same as using
`agg()`

i.e. a`Series`

where each element is the output of aggregation for each group and where the`Index`

is the group identifier.

**Multiple Column Function**

We wish to perform an aggregation on a grouped data frame. The aggregation operation we wish to perform is a function of multiple columns.

In this example, we wish to compute the weighted mean of the column `‘col_2'`

where the weights are given by the column `‘col_3'`

. We wish to perform the aggregation for each group where the groups are defined by the column `‘col_1'`

.

```
df.groupby(col_1).apply(lambda x: np.average(x['col_2'], weights=x['col_3']))
```

Here is how this works:

`agg()`

which we used for one column aggregation can’t be used to execute data aggregation operations that involve multiple columns. The reason being that`agg()`

acts on one column at a time.- The way to go when we wish to perform a grouped aggregation that involves multiple columns is to use
`apply()`

. - When called on a grouped data frame (a
`DataFrameGroupBy`

object),`apply()`

receives each sub data frame (for each group) as a data frame allowing access to all columns for that group. - In
`df.groupby(col_1)`

, we group the data frame df by the values of the column`col_1`

. - In
`lambda x: np.average(x['col_2'], weights=x['col_3'])`

, we built a`lambda`

function that passes the values of the columns`‘col_2’`

and`‘col_3’`

to the values and weights arguments of the weighted average function`average()`

from`NumPy`

. - Instead of a lambda function, we can create a custom function and call it from within apply(). We show an example of that in Common Aggregation Operations.

PYTHON