In this page, we will cover the functions to use to carry out some of the most common data aggregation operations.

In its simplest form, a data aggregation function is one which takes multiple individual values and returns a single summary value; e.g. sum.

We have extended sections where we cover the data manipulation operations relevant to each data type quite extensively (see the data type operation sections in the navigation bar). Think of this page as a brief overview of the most common aggregation functions for each data type with links to the respective operations section for a more detailed coverage.

On this page, we demonstrate the data aggregation operations in a scenario where we summarize a data frame that is grouped by one column into a summary data frame. See Summary Table for a description of this pattern. To see how these data aggregation operations can be applied in different scenarios see:

- Individual Aggregation where we cover how to perform an individual aggregation operation e.g. we wish to obtain the sum of the values of one particular numeric column.
- Groupings where we cover multiple grouping scenarios.

We wish to count the number of rows in a group or the number of unique values.

In this example, we wish to count the number of rows in each group and the number of unique values of the column `col_2`

in each group. We wish to carry out the aggregation on the data frame `df`

grouped by the column `col_1`

.

```
df_2 = (df
.groupby('col_1')
.agg(
size = ('col_2', 'size'),
col_2_vals = ('col_2', 'nunique')
))
```

How this works:

- We call
`agg()`

on the output of`groupby()`

to have the data aggregation operations be carried out for each group. See Summary Table. - As their name implies, given a list of values:
`size()`

is a`Pandas`

convenience function that returns the number of values. See Counting.`nunique()`

is a Pandas convenience function that returns the number of unique values. See Uniqueness.

- To refer to the
`Pandas`

convenience methods implemented by`GroupBy`

,`agg()`

expects us to pass a string representation of their name e.g. in this case we used`‘size’`

to refer to the method`SeriesGroupBy.size()`

.

*Alternatively:*

```
df_2 = pd.DataFrame()
df_g = df.groupby('col_1')
df_2['size'] = df_g['col_2'].size()
df_2['col_2_vals'] = df_g['col_2'].nunique()
```

How this works:

- In
`df_2 = pd.DataFrame()`

, we create an empty dataframe called`df_2`

. - In
`df_g = df.groupby('col_1')`

, we create a grouped data frame (a`DataFrameGroupeBy`

object) which we reuse for each of the following aggregations. - We use the same functions as above except here we call them in a different way i.e. as method calls on grouped
`Series`

objects (`SeriesGroupBy`

objects). - The results of these operations are stored in the
`size`

and`col_2_vals`

columns of the`df_2`

dataframe.

```
df[‘col_1’].value_counts()
```

Here is how this works:

- For the case of computing the number of rows per group, we can use the function
`value_counts()`

instead of the more verbose`df.groupby('col_1')[’col_1’].count()`

. `value_counts()`

can act on a Series or a data frame to return the number of occurrences of each unique value or unique combination of values.

We wish to aggregate data by computing an arithmatic operation.

In this example, we wish to compute the sum of values of the column `col_2`

for each group and the sum of the product of the values of the columns `col_3`

and `col_4`

for each group. We wish to carry out the aggregation on the data frame `df`

grouped by the column `col_1`

.

```
df_2 = (df
.groupby('col_1')
.apply(lambda x: pd.Series({
'col_2_sum': x['col_2'].sum(),
'col_2_3_ws': (x['col_2'] * x['col_3']).sum()})
))
```

Here is how this works:

- As described in Summary Table:
- We call
`apply()`

on the output of`groupby()`

to have the data aggregation operations be carried out for each group. - We use
`apply()`

because we wish to perform data aggregation operations that involve multiple columns. - We have
`apply()`

return a`Series`

for each group, so we may create multiple columns.

- We call
- As its name implies,
`sum()`

computes the sum of values for the vector it's called on. - For a coverage of arithmatic operations, see Arithmatic Operations.

*Alternatively:*

```
df_2 = pd.DataFrame()
df_g = df.groupby('col_1')
df_2['col_2_sum'] = df_g['col_2'].sum()
df_2['col_2_3_ws'] = \
df_g.apply(lambda x: (x['col_2'] * x['col_3']).sum())
```

Here is how this works:

- In producing
`'col_2_3_ws'`

, use`apply()`

on the grouped data frame, so we may access all columns because the aggregation we wish to perform involves multiple columns. See “Function of Multiple Columns” in Individual Grouped Aggregation. - For the rest of this code snippet, see the description of the alternative solution under Counting above.

We wish to aggregate data by computing summary statistics, e.g. the mean, for each group.

In this example we wish to compute the mean, median, and standard deviation of values of the column `col_3`

for each group. We wish to carry out the aggregation on the data frame `df`

grouped by the column `‘col_1'`

.

```
df_2 = (df
.groupby('col_1')
.agg(
col_3_mean = ('col_3', 'mean'),
col_3_median = ('col_3', 'median'),
col_3_std = ('col_3', 'std')
))
```

Here is how this works:

- We call
`agg()`

on the output of`groupby()`

to have the data aggregation operations be carried out for each group. See Summary Table. - As their name implies, given a vector of values:
`mean()`

computes the mean of values.`median()`

computes the median of values.`sd()`

computes the standard deviation of values.

- To refer to the
`Pandas`

convenience methods implemented by`GroupBy`

,`agg()`

expects us to pass a string representation of their name e.g. in this case we used`‘mean’`

to refer to the method`SeriesGroupBy.mean()`

. - For a coverage of summary statistics, see Summary Statistics.

*Alternatively:*

```
df_2 = pd.DataFrame()
df_g = df.groupby('col_1')
df_2['col_3_mean'] = df_g['col_3'].mean()
df_2['col_3_median'] = df_g['col_3'].median()
df_2['col_3_std'] = df_g['col_3'].std()
```

See the description of the alternative solution under Counting above.

We wish to aggregate data by computing the maximum or minimum value of a given column.

In this example we wish to compute the minimum value of the numeric column `col_2`

for each group and the maximum value of the numeric column `col_3`

for each group. We wish to carry out the aggregation on the data frame `df`

grouped by the column `col_1`

.

```
df_2 = (df
.groupby('col_1')
.agg(
col_2_min = ('col_2', 'min'),
col_3_max = ('col_3', 'max')
))
```

Here is how this works:

- We call
`agg()`

on the output of`groupby()`

to have the data aggregation operations be carried out for each group. See Summary Table. - As their name implies, given a vector of values:
`min()`

returns the min value.`max()`

returns the max value.

- To refer to the
`Pandas`

convenience methods implemented by`GroupBy`

,`agg()`

expects us to pass a string representation of their name e.g. in this case we used`‘min’`

to refer to the method`SeriesGroupBy.min()`

. - Obtaining the max and the min are a subset of the larger class of ranking operations, which we cover in Ranking.
- Note that here we are aggregating the values of individual columns in a group, to filter entire rows in a group based on the rank of their values, see Filtering Groups by Rank.

*Alternatively:*

```
df_2 = pd.DataFrame()
df_g = df.groupby('col_1')
df_2['col_2_min'] = df_g['col_2'].min()
df_2['col_3_max'] = df_g['col_3'].max()
```

See the description of the alternative solution under Counting above.

We wish to aggregate data by returning the first or last value that a column takes.

In this example, we wish to return the first value of the column `col_2`

for each group, the last value of the column `col_2`

for each group, and the second value of the column `col_3`

for each group. We wish to carry out the aggregation on the data frame `df`

grouped by the column `col_1`

.

```
df_2 = (df
.groupby('col_1')
.agg(
col_2_first = ('col_2', 'first'),
col_2_last = ('col_2', 'last'),
col_3_2nd = ('col_3', lambda x: x.iloc[1])
))
```

Here is how this works:

- We call
`agg()`

on the output of`groupby()`

to have the data aggregation operations be carried out for each group. See Summary Table. - As their name implies, given a vector of values:
`first()`

returns the first occurring value.`last()`

returns the last occurring value.

- To refer to the
`Pandas`

convenience methods implemented by`GroupBy`

,`agg()`

expects us to pass a string representation of their name e.g. in this case we used`‘first’`

to refer to the method`GroupBy.first()`

. - In
`lambda x: x.iloc[1]`

, we use a lambda function to pass arguments to a function. This is the way to pass arguments when using the “Named Aggregation” stype of`agg()`

. - There is an
`nth()`

method that we cover in the alternative solution below. - It is often important to sort the data in some way before carrying out an aggregation by position operation.
- For a more detailed coverage of position related operations, see Position.
- Note: Here we are aggregating the values of individual columns in a group, to filter entire rows in a group based on their position, see Filtering Groups by Position.

*Alternatively:*

```
df_2 = pd.DataFrame()
df_g = df.groupby('col_1')
df_2['col_2_first'] = df_g['col_2'].first()
df_2['col_2_last'] = df_g['col_2'].last()
df_2['col_3_2nd'] = df_g['col_3'].nth(1)
```

Here is how this works:

- The functions are as in the solution above, except that we use the convenient
`nth()`

method which we can use here because we are working on a grouped Series (`SeriesGroupBy`

object)`df_g['col_3']`

. - For the rest of this code snippet, see the description of the alternative solution under Counting above.

We wish to perform data aggregation operations on a column of a string data type.

In this example, we wish to summarize the data frame df over groups defined by the column `‘col_1’`

. The aggregation operations we wish to carry out are: (1) Concatenate the unique values of the column `‘col_2'`

per group into a comma separated string*.* (2) Obtain the last value of `‘col_3'`

for each group that matches a given regular expression.

```
df_2 = (df
.groupby('col_1')
.agg(
col_2_distinct = ('col_2', lambda x: ','.join(x.unique())),
col_3_match = ('col_3',
lambda x: x[x.str.contains('\d$')].iloc[-1])
))
```

Here is how this works:

- We call
`agg()`

on the output of`groupby()`

to have the data aggregation operations be carried out for each group. See Summary Table. - In
`lambda x: ','.join(x.unique())`

we combine (concatenate) the unique values of a string column for each group into one comma separated string. The two functions we used are:`unique()`

returns the unique values. See Uniqueness.`join()`

combines multiple strings into one separated by the value on which it is applied in this cases`‘, ‘`

.

- In
`lambda x: x[x.str.contains('\d$')].iloc[-1]`

we return the last value of a string column for each group that matches the regular expression`'\d+$'`

.`str.contains()`

returns a logical`Series`

that is`True`

where the value of the string`Series`

matches the given pattern.`x[x.str.contains('\d$')]`

returns a Series that is a subset of the input Series and containing only elements that match the given pattern.`iloc[-1]`

returns the last occurring value of a given`Series`

.

- The data type needs to be string for this to work. If we wish to apply string operations e.g. concatenation to say a numerical data type, we need to covert it to string first. We can do that via
`x.astype(str)`

. See Type Casting. - For a coverage of string operations (including regular expressions), see String Operations.

We wish to perform data aggregation operations on a column of a logical data type; i.e. one that takes a value of `True`

or `False`

.

In this example, for each group, where the groups are defined by the value of the column `‘col_1'`

, we wish to obtain the number and ratio of rows where the column `‘col_2'`

is `NA`

.

```
df_2 = (df
.groupby('col_1')
.agg(
col_2_na_count =
('col_2', lambda x: x.isna().sum()),
col_2_na_rate =
('col_2', lambda x: x.isna().mean())
))
```

Here is how this works:

- We call
`agg()`

on the output of`groupby()`

to have the data aggregation operations be carried out for each group. See Summary Table. - We use
`lambda`

functions passed to`agg()`

to realize the aggregation logic. See Custom Operations below. - For each group, we apply the function
`isna()`

to the values of the column`‘col_2’`

. We then use`sum()`

to compute the number of values that are missing for each group and`mean()`

to compute the ratio of missing values in each group. - Summing a boolean vector is basically counting the number of
`TRUE`

values (which in this case is the number of missing values) because when arithmatic operations are carried out on logical vectors,`True`

is regarded as`1`

and`False`

is regarded as`0`

. - Similarly, taking the mean of a logical vector is basically taking the ratio between the number of
`True`

values and the total number of values in the input vector. - For a coverage of logical operations, see Logical Operations.

*Alternatively:*

```
df_2 = pd.DataFrame()
df_g = df.groupby('col_1')
df_2['col_2_na_count'] = \
df_g['col_2'].agg(lambda x: x.isna().sum())
df_2['col_2_na_rate'] = \
df_g['col_2'].agg(lambda x: x.isna().mean())
```

Here is how this works:

- We pass the
`lambda`

functions to`agg()`

. However, in this solution,`agg()`

is applied directly to the grouped Series holding the column we wish to aggregate. - For the rest of the code, see the description of the alternative solution under Counting above.

We wish to group a data frame by some unit or interval of time e.g. group by year or weekday.

In this example, we wish to group the data frame `df`

by weekday inferred from the date-time column `‘col_1'`

, where `‘col_1'`

is provided as a string. We then wish to compute the number of rows in each group and the number of unique hours that exist in each group.

```
df_2 = (df
.assign(
col_1 = pd.to_datetime(df['col_1']),
col_1_wd = lambda x: x['col_1'].dt.weekday)
.groupby('col_1_wd')
.agg(
count = ('col_2', 'size'),
hour_count = ('col_1', lambda x: x.dt.hour.nunique())
))
```

Here is how this works:

- We use
`assign()`

to execute two data transformations:- In
`pd.to_datetime(df['col_1'])`

we convert the string representation of date time that we originally have in the column`col_1`

into a date time object. We save the output as`col_1`

overwriting the original sting column. - In
`lambda x: x['col_1'].dt.weekday)`

we identify the weekday for each date time value of`‘col_1’`

. We save the output as a new column`col_1_wd`

.

- In
- We then call
`groupby()`

and pass the name of the newly created column`'col_1_wd'`

to create one group for each weekday. - We call
`agg()`

on the output of`groupby()`

to have the data aggregation operations be carried out for each group. See Summary Table. - In
`lambda x: x.dt.hour.nunique()`

, we extract the hour component of each datetime value for the current group and then count the number of unique hours in each group. - In addition to
`groupby()`

,`Pandas`

provides two powerful approaches,`resample()`

and`pd.Grouper()`

, that offer fine control on grouping by date-time over any time span. We cover those along with date time operations in Date Time Operations.

*Alternatively:*

```
df_c = df.copy()
df_c['col_1'] = pd.to_datetime(df['col_1'])
df_2 = pd.DataFrame()
df_g = df_c.groupby(df_c['col_1'].dt.weekday)
df_2['count'] = \
df_g['col_2'].size()
df_2['hour_count'] = \
df_g['col_1'].agg(lambda x: x.dt.hour.nunique())
```

Here is how this works:

- We create a deep copy of
`df`

which we call`df_c`

. It is generally good practice not to alter the input data. This can be skipped if we wish to modify the input data frame`df`

. - We then transform
`df_c`

by converting the data type of`col_1`

from string (`object`

) to datetime (`datetime64[ns]`

). - We extract the weekday from
`col_1`

inside`groupby()`

. - The data aggregation operations are the same as the core solution above and for the rest of the code, see the description of the alternative solution under Counting above.

We wish to perform a custom aggregation operation (as opposed to using a built-in function).

There are two approaches to performing custom aggregation operations:

- Via a custom function that we call from within
`agg()`

or`apply()`

. This is appropriate for aggregations that we need to carry out more than once or aggregations that are relatively complex. - Via a lambda function inside
`agg()`

or`apply()`

. This is appropriate for simple logic that we need to carry out once or twice.

We will look at both approaches in the context of one column and multiple column aggregation operations below.

**One Column Operations**

In this example, for each group of the data frame `df`

, where the groups are defined by the values of the column `col_1`

, we wish to: (1) Compute the mode, i.e. most frequently occurring value, of the rounded values of the floating point data type column `col_2`

and (2) Compute the sum of the numerical substrings of the values of the string column `col_3`

.

```
def str_sum(p_col):
col_sum = (p_col
.str.extract('(\d+$)', expand=False)
.astype(int)
.sum())
return col_sum
df_2 = (df
.groupby('col_1')
.agg(
col_2_rm = ('col_2', lambda x: x.round().mode()),
col_3_sum = ('col_3', str_sum)
))
```

Here is how this works:

- We call
`agg()`

on the output of`groupby()`

to have the data aggregation operations be carried out for each group where the groups are defined by the value of the column`‘col_1’`

. See Summary Table. - We use a
`lambda`

function to carry out the first aggregation. The`lambda`

function rounds the values of`‘col_2’`

then takes the mode of the rounded values for each group. The mode is the most occurring value in a vector of values. - We use a custom function
`str_sum()`

to carry out the second aggregation. The custom function`str_sum`

takes a Pandas series of a string data type as input and returns the sum of all the numbers extracted from the string values in the series. It works as follows:- The function uses the
`str.extract()`

method to extract all the numbers that match a given regular expression. See String Operations. - It then converts the extracted strings to integers using
`astype()`

. See Type Casting. - Finally, it computes and returns the sum of all the integers using the
`sum()`

method.

- The function uses the
- The resulting summary dataframe is stored in
`df_2`

. - As mentioned above, custom functions are preferred for more involved logic or logic that we need to carry out multiple times.

Alternatively:

```
def str_sum(p_col):
col_sum = (p_col
.str.extract('(\d+$)', expand=False)
.astype(int)
.sum())
return col_sum
df_2 = pd.DataFrame()
df_g = df.groupby('col_1')
df_2['col_2_rm'] = \
df_g['col_2'].agg(lambda x: x.round().mode())
df_2['col_3_sum'] = df_g['col_3'].agg(str_sum)
```

Here is how this works:

- This code is similar to the primary solution above, but it uses the lambda function and the custom function
`str_sum()`

in a different way i.e. as functions passed to`agg()`

instead of as a named aggregation. - For the rest of the code, see the description of the alternative solution under Counting above.

**Multiple Column Operations**

In this example, for each group of the data frame `df`

, where the groups are defined by the values of the column `col_1`

, we wish to: (1) Compute the sum of products of the columns `col_2`

and `col_3`

. (2) Compute the weighted sum of `col_2`

by the log of `col_3`

. (3) compute the min value of each row then compute the max value of those min values for each group.

```
def log_weight_avg(p_x, p_w):
w = np.log10(p_w)
lwa = np.average(p_x, weights=w)
return lwa
def max_mins(p_df):
m_m = (p_df
.select_dtypes('number')
.min(axis=1)
.max())
return m_m
df_2 = (df
.groupby('col_1')
.apply(lambda x: pd.Series({
'col_2_3_ws': (x['col_2'] * x['col_3']).sum(),
'col_2_3_swa': log_weight_avg(x['col_2'], x['col_3']),
'max_mins': max_mins(x)
})
))
```

Here is how this works:

- As described in Summary Table:
- We call
`apply()`

on the output of`groupby()`

to have the data aggregation operations be carried out for each group. - We use
`apply()`

because we wish to perform data aggregation operations that involve multiple columns. - We have
`apply()`

return a`Series`

for each group, so we may create multiple columns.

- We call
- We use a
`lambda`

function to carry out the first aggregation. The lambda function computes the sum of the products of the values of the columns`col_2`

and`col_3`

for each group. - We use a custom function
`log_weight_avg()`

to compute the second aggregation.- The
`log_weight_avg()`

function takes two inputs: a Pandas series`p_x`

and a Pandas series`p_w`

. - The function first takes the log base 10 of
`p_w`

using the`log10`

function from`NumPy`

and stores the result in the`w`

variable. - It then uses the
`average`

function from NumPy to compute and return the weighted average of`p_x`

using the weights in`w`

.

- The
- We use a custom function
`max_mins()`

to compute the third aggregation.- This function takes as input the entire group as a data frame.
- The function first uses the
`select_dtypes`

method to select only the numeric columns in`p_df`

- and then uses the
`min`

method with the`axis`

parameter set to`1`

to compute the minimum value of each row. - It then uses the
`max`

method to compute the maximum value of the minimum values of each row and returns the result.

- Note the difference between the inputs of the two functions
`log_weight_avg()`

and`max_mins()`

. The first is generic and accepts just the input columns it needs to perform its operation while the second accepts the entire group as a data frame.

*Alternatively:*

```
def log_weight_avg(p_x, p_w):
w = np.log10(p_w)
lwa = np.average(p_x, weights=w)
return lwa
def max_mins(p_df):
m_m = (p_df
.select_dtypes('number')
.min(axis=1)
.max())
return m_m
df_2 = pd.DataFrame()
df_g = df.groupby('col_1')
df_2['col_2_3_ws'] = \
df_g.apply(lambda x: (x['col_2'] * x['col_3']).sum())
df_2['col_2_3_swa'] = \
df_g.apply(lambda x: log_weight_avg(x['col_2'], x['col_3']))
df_2['max_mins'] = \
df_g.apply(max_mins)
```

Here is how this works:

- This code is similar to the primary solution above, but it uses the
`lambda`

function and the custom functions`log_weight_avg()`

and`max_mins()`

in a different way i.e. as individual aggregations instead of as one call to`apply()`

. - For the rest of the code, see the description of the alternative solution under Counting above.

PYTHON