In this page, we will cover the functions to use to carry out some of the most common data aggregation operations.
In its simplest form, a data aggregation function is one which takes multiple individual values and returns a single summary value; e.g. sum.
We have extended sections where we cover the data manipulation operations relevant to each data type quite extensively (see the data type operation sections in the navigation bar). Think of this page as a brief overview of the most common aggregation functions for each data type with links to the respective operations section for a more detailed coverage.
On this page, we demonstrate the data aggregation operations in a scenario where we summarize a data frame that is grouped by one column into a summary data frame. See Summary Table for a description of this pattern. To see how these data aggregation operations can be applied in different scenarios see:
We wish to count the number of rows in a group or the number of unique values.
In this example, we wish to count the number of rows in each group and the number of unique values of the column col_2
in each group. We wish to carry out the aggregation on the data frame df
grouped by the column col_1
.
df_2 = (df
.groupby('col_1')
.agg(
size = ('col_2', 'size'),
col_2_vals = ('col_2', 'nunique')
))
How this works:
agg()
on the output of groupby()
to have the data aggregation operations be carried out for each group. See Summary Table.size()
is a Pandas
convenience function that returns the number of values. See Counting.nunique()
is a Pandas convenience function that returns the number of unique values. See Uniqueness.Pandas
convenience methods implemented by GroupBy
, agg()
expects us to pass a string representation of their name e.g. in this case we used ‘size’
to refer to the method SeriesGroupBy.size()
.Alternatively:
df_2 = pd.DataFrame()
df_g = df.groupby('col_1')
df_2['size'] = df_g['col_2'].size()
df_2['col_2_vals'] = df_g['col_2'].nunique()
How this works:
df_2 = pd.DataFrame()
, we create an empty dataframe called df_2
.df_g = df.groupby('col_1')
, we create a grouped data frame (a DataFrameGroupeBy
object) which we reuse for each of the following aggregations.Series
objects (SeriesGroupBy
objects).size
and col_2_vals
columns of the df_2
dataframe.df[‘col_1’].value_counts()
Here is how this works:
value_counts()
instead of the more verbose df.groupby('col_1')[’col_1’].count()
.value_counts()
can act on a Series or a data frame to return the number of occurrences of each unique value or unique combination of values.We wish to aggregate data by computing an arithmatic operation.
In this example, we wish to compute the sum of values of the column col_2
for each group and the sum of the product of the values of the columns col_3
and col_4
for each group. We wish to carry out the aggregation on the data frame df
grouped by the column col_1
.
df_2 = (df
.groupby('col_1')
.apply(lambda x: pd.Series({
'col_2_sum': x['col_2'].sum(),
'col_2_3_ws': (x['col_2'] * x['col_3']).sum()})
))
Here is how this works:
apply()
on the output of groupby()
to have the data aggregation operations be carried out for each group.apply()
because we wish to perform data aggregation operations that involve multiple columns.apply()
return a Series
for each group, so we may create multiple columns.sum()
computes the sum of values for the vector it's called on.Alternatively:
df_2 = pd.DataFrame()
df_g = df.groupby('col_1')
df_2['col_2_sum'] = df_g['col_2'].sum()
df_2['col_2_3_ws'] = \
df_g.apply(lambda x: (x['col_2'] * x['col_3']).sum())
Here is how this works:
'col_2_3_ws'
, use apply()
on the grouped data frame, so we may access all columns because the aggregation we wish to perform involves multiple columns. See “Function of Multiple Columns” in Individual Grouped Aggregation.We wish to aggregate data by computing summary statistics, e.g. the mean, for each group.
In this example we wish to compute the mean, median, and standard deviation of values of the column col_3
for each group. We wish to carry out the aggregation on the data frame df
grouped by the column ‘col_1'
.
df_2 = (df
.groupby('col_1')
.agg(
col_3_mean = ('col_3', 'mean'),
col_3_median = ('col_3', 'median'),
col_3_std = ('col_3', 'std')
))
Here is how this works:
agg()
on the output of groupby()
to have the data aggregation operations be carried out for each group. See Summary Table.mean()
computes the mean of values.median()
computes the median of values.sd()
computes the standard deviation of values.Pandas
convenience methods implemented by GroupBy
, agg()
expects us to pass a string representation of their name e.g. in this case we used ‘mean’
to refer to the method SeriesGroupBy.mean()
.Alternatively:
df_2 = pd.DataFrame()
df_g = df.groupby('col_1')
df_2['col_3_mean'] = df_g['col_3'].mean()
df_2['col_3_median'] = df_g['col_3'].median()
df_2['col_3_std'] = df_g['col_3'].std()
See the description of the alternative solution under Counting above.
We wish to aggregate data by computing the maximum or minimum value of a given column.
In this example we wish to compute the minimum value of the numeric column col_2
for each group and the maximum value of the numeric column col_3
for each group. We wish to carry out the aggregation on the data frame df
grouped by the column col_1
.
df_2 = (df
.groupby('col_1')
.agg(
col_2_min = ('col_2', 'min'),
col_3_max = ('col_3', 'max')
))
Here is how this works:
agg()
on the output of groupby()
to have the data aggregation operations be carried out for each group. See Summary Table.min()
returns the min value.max()
returns the max value.Pandas
convenience methods implemented by GroupBy
, agg()
expects us to pass a string representation of their name e.g. in this case we used ‘min’
to refer to the method SeriesGroupBy.min()
.Alternatively:
df_2 = pd.DataFrame()
df_g = df.groupby('col_1')
df_2['col_2_min'] = df_g['col_2'].min()
df_2['col_3_max'] = df_g['col_3'].max()
See the description of the alternative solution under Counting above.
We wish to aggregate data by returning the first or last value that a column takes.
In this example, we wish to return the first value of the column col_2
for each group, the last value of the column col_2
for each group, and the second value of the column col_3
for each group. We wish to carry out the aggregation on the data frame df
grouped by the column col_1
.
df_2 = (df
.groupby('col_1')
.agg(
col_2_first = ('col_2', 'first'),
col_2_last = ('col_2', 'last'),
col_3_2nd = ('col_3', lambda x: x.iloc[1])
))
Here is how this works:
agg()
on the output of groupby()
to have the data aggregation operations be carried out for each group. See Summary Table.first()
returns the first occurring value.last()
returns the last occurring value.Pandas
convenience methods implemented by GroupBy
, agg()
expects us to pass a string representation of their name e.g. in this case we used ‘first’
to refer to the method GroupBy.first()
.lambda x: x.iloc[1]
, we use a lambda function to pass arguments to a function. This is the way to pass arguments when using the “Named Aggregation” stype of agg()
.nth()
method that we cover in the alternative solution below.Alternatively:
df_2 = pd.DataFrame()
df_g = df.groupby('col_1')
df_2['col_2_first'] = df_g['col_2'].first()
df_2['col_2_last'] = df_g['col_2'].last()
df_2['col_3_2nd'] = df_g['col_3'].nth(1)
Here is how this works:
nth()
method which we can use here because we are working on a grouped Series (SeriesGroupBy
object) df_g['col_3']
.We wish to perform data aggregation operations on a column of a string data type.
In this example, we wish to summarize the data frame df over groups defined by the column ‘col_1’
. The aggregation operations we wish to carry out are: (1) Concatenate the unique values of the column ‘col_2'
per group into a comma separated string. (2) Obtain the last value of ‘col_3'
for each group that matches a given regular expression.
df_2 = (df
.groupby('col_1')
.agg(
col_2_distinct = ('col_2', lambda x: ','.join(x.unique())),
col_3_match = ('col_3',
lambda x: x[x.str.contains('\d$')].iloc[-1])
))
Here is how this works:
agg()
on the output of groupby()
to have the data aggregation operations be carried out for each group. See Summary Table.lambda x: ','.join(x.unique())
we combine (concatenate) the unique values of a string column for each group into one comma separated string. The two functions we used are:unique()
returns the unique values. See Uniqueness.join()
combines multiple strings into one separated by the value on which it is applied in this cases ‘, ‘
.lambda x: x[x.str.contains('\d$')].iloc[-1]
we return the last value of a string column for each group that matches the regular expression '\d+$'
.str.contains()
returns a logical Series
that is True
where the value of the string Series
matches the given pattern.x[x.str.contains('\d$')]
returns a Series that is a subset of the input Series and containing only elements that match the given pattern.iloc[-1]
returns the last occurring value of a given Series
.x.astype(str)
. See Type Casting.We wish to perform data aggregation operations on a column of a logical data type; i.e. one that takes a value of True
or False
.
In this example, for each group, where the groups are defined by the value of the column ‘col_1'
, we wish to obtain the number and ratio of rows where the column ‘col_2'
is NA
.
df_2 = (df
.groupby('col_1')
.agg(
col_2_na_count =
('col_2', lambda x: x.isna().sum()),
col_2_na_rate =
('col_2', lambda x: x.isna().mean())
))
Here is how this works:
agg()
on the output of groupby()
to have the data aggregation operations be carried out for each group. See Summary Table.lambda
functions passed to agg()
to realize the aggregation logic. See Custom Operations below.isna()
to the values of the column ‘col_2’
. We then use sum()
to compute the number of values that are missing for each group and mean()
to compute the ratio of missing values in each group.TRUE
values (which in this case is the number of missing values) because when arithmatic operations are carried out on logical vectors, True
is regarded as 1
and False
is regarded as 0
.True
values and the total number of values in the input vector.Alternatively:
df_2 = pd.DataFrame()
df_g = df.groupby('col_1')
df_2['col_2_na_count'] = \
df_g['col_2'].agg(lambda x: x.isna().sum())
df_2['col_2_na_rate'] = \
df_g['col_2'].agg(lambda x: x.isna().mean())
Here is how this works:
lambda
functions to agg()
. However, in this solution, agg()
is applied directly to the grouped Series holding the column we wish to aggregate.We wish to group a data frame by some unit or interval of time e.g. group by year or weekday.
In this example, we wish to group the data frame df
by weekday inferred from the date-time column ‘col_1'
, where ‘col_1'
is provided as a string. We then wish to compute the number of rows in each group and the number of unique hours that exist in each group.
df_2 = (df
.assign(
col_1 = pd.to_datetime(df['col_1']),
col_1_wd = lambda x: x['col_1'].dt.weekday)
.groupby('col_1_wd')
.agg(
count = ('col_2', 'size'),
hour_count = ('col_1', lambda x: x.dt.hour.nunique())
))
Here is how this works:
assign()
to execute two data transformations:pd.to_datetime(df['col_1'])
we convert the string representation of date time that we originally have in the column col_1
into a date time object. We save the output as col_1
overwriting the original sting column.lambda x: x['col_1'].dt.weekday)
we identify the weekday for each date time value of ‘col_1’
. We save the output as a new column col_1_wd
.groupby()
and pass the name of the newly created column 'col_1_wd'
to create one group for each weekday.agg()
on the output of groupby()
to have the data aggregation operations be carried out for each group. See Summary Table.lambda x: x.dt.hour.nunique()
, we extract the hour component of each datetime value for the current group and then count the number of unique hours in each group.groupby()
, Pandas
provides two powerful approaches, resample()
and pd.Grouper()
, that offer fine control on grouping by date-time over any time span. We cover those along with date time operations in Date Time Operations.Alternatively:
df_c = df.copy()
df_c['col_1'] = pd.to_datetime(df['col_1'])
df_2 = pd.DataFrame()
df_g = df_c.groupby(df_c['col_1'].dt.weekday)
df_2['count'] = \
df_g['col_2'].size()
df_2['hour_count'] = \
df_g['col_1'].agg(lambda x: x.dt.hour.nunique())
Here is how this works:
df
which we call df_c
. It is generally good practice not to alter the input data. This can be skipped if we wish to modify the input data frame df
.df_c
by converting the data type of col_1
from string (object
) to datetime (datetime64[ns]
).col_1
inside groupby()
.We wish to perform a custom aggregation operation (as opposed to using a built-in function).
There are two approaches to performing custom aggregation operations:
agg()
or apply()
. This is appropriate for aggregations that we need to carry out more than once or aggregations that are relatively complex.agg()
or apply()
. This is appropriate for simple logic that we need to carry out once or twice.We will look at both approaches in the context of one column and multiple column aggregation operations below.
One Column Operations
In this example, for each group of the data frame df
, where the groups are defined by the values of the column col_1
, we wish to: (1) Compute the mode, i.e. most frequently occurring value, of the rounded values of the floating point data type column col_2
and (2) Compute the sum of the numerical substrings of the values of the string column col_3
.
def str_sum(p_col):
col_sum = (p_col
.str.extract('(\d+$)', expand=False)
.astype(int)
.sum())
return col_sum
df_2 = (df
.groupby('col_1')
.agg(
col_2_rm = ('col_2', lambda x: x.round().mode()),
col_3_sum = ('col_3', str_sum)
))
Here is how this works:
agg()
on the output of groupby()
to have the data aggregation operations be carried out for each group where the groups are defined by the value of the column ‘col_1’
. See Summary Table.lambda
function to carry out the first aggregation. The lambda
function rounds the values of ‘col_2’
then takes the mode of the rounded values for each group. The mode is the most occurring value in a vector of values.str_sum()
to carry out the second aggregation. The custom function str_sum
takes a Pandas series of a string data type as input and returns the sum of all the numbers extracted from the string values in the series. It works as follows:str.extract()
method to extract all the numbers that match a given regular expression. See String Operations.astype()
. See Type Casting.sum()
method.df_2
.Alternatively:
def str_sum(p_col):
col_sum = (p_col
.str.extract('(\d+$)', expand=False)
.astype(int)
.sum())
return col_sum
df_2 = pd.DataFrame()
df_g = df.groupby('col_1')
df_2['col_2_rm'] = \
df_g['col_2'].agg(lambda x: x.round().mode())
df_2['col_3_sum'] = df_g['col_3'].agg(str_sum)
Here is how this works:
str_sum()
in a different way i.e. as functions passed to agg()
instead of as a named aggregation.Multiple Column Operations
In this example, for each group of the data frame df
, where the groups are defined by the values of the column col_1
, we wish to: (1) Compute the sum of products of the columns col_2
and col_3
. (2) Compute the weighted sum of col_2
by the log of col_3
. (3) compute the min value of each row then compute the max value of those min values for each group.
def log_weight_avg(p_x, p_w):
w = np.log10(p_w)
lwa = np.average(p_x, weights=w)
return lwa
def max_mins(p_df):
m_m = (p_df
.select_dtypes('number')
.min(axis=1)
.max())
return m_m
df_2 = (df
.groupby('col_1')
.apply(lambda x: pd.Series({
'col_2_3_ws': (x['col_2'] * x['col_3']).sum(),
'col_2_3_swa': log_weight_avg(x['col_2'], x['col_3']),
'max_mins': max_mins(x)
})
))
Here is how this works:
apply()
on the output of groupby()
to have the data aggregation operations be carried out for each group.apply()
because we wish to perform data aggregation operations that involve multiple columns.apply()
return a Series
for each group, so we may create multiple columns.lambda
function to carry out the first aggregation. The lambda function computes the sum of the products of the values of the columns col_2
and col_3
for each group.log_weight_avg()
to compute the second aggregation.log_weight_avg()
function takes two inputs: a Pandas series p_x
and a Pandas series p_w
.p_w
using the log10
function from NumPy
and stores the result in the w
variable.average
function from NumPy to compute and return the weighted average of p_x
using the weights in w
.max_mins()
to compute the third aggregation.select_dtypes
method to select only the numeric columns in p_df
min
method with the axis
parameter set to 1
to compute the minimum value of each row.max
method to compute the maximum value of the minimum values of each row and returns the result.log_weight_avg()
and max_mins()
. The first is generic and accepts just the input columns it needs to perform its operation while the second accepts the entire group as a data frame.Alternatively:
def log_weight_avg(p_x, p_w):
w = np.log10(p_w)
lwa = np.average(p_x, weights=w)
return lwa
def max_mins(p_df):
m_m = (p_df
.select_dtypes('number')
.min(axis=1)
.max())
return m_m
df_2 = pd.DataFrame()
df_g = df.groupby('col_1')
df_2['col_2_3_ws'] = \
df_g.apply(lambda x: (x['col_2'] * x['col_3']).sum())
df_2['col_2_3_swa'] = \
df_g.apply(lambda x: log_weight_avg(x['col_2'], x['col_3']))
df_2['max_mins'] = \
df_g.apply(max_mins)
Here is how this works:
lambda
function and the custom functions log_weight_avg()
and max_mins()
in a different way i.e. as individual aggregations instead of as one call to apply()
.