Common Operations

In this page, we will cover the functions to use to carry out some of the most common data aggregation operations.

In its simplest form, a data aggregation function is one which takes multiple individual values and returns a single summary value; e.g. sum.

We have extended sections where we cover the data manipulation operations relevant to each data type quite extensively (see the data type operation sections in the navigation bar). Think of this page as a brief overview of the most common aggregation functions for each data type with links to the respective operations section for a more detailed coverage.

On this page, we demonstrate the data aggregation operations in a scenario where we summarize a data frame that is grouped by one column into a summary data frame. See Summary Table for a description of this pattern. To see how these data aggregation operations can be applied in different scenarios see:

  • Individual Aggregation where we cover how to perform an individual aggregation operation e.g. we wish to obtain the sum of the values of one particular numeric column.
  • Groupings where we cover multiple grouping scenarios.

Counting

We wish to count the number of rows in a group or the number of unique values.

In this example, we wish to count the number of rows in each group and the number of unique values of the column col_2 in each group. We wish to carry out the aggregation on the data frame df grouped by the column col_1.

df_2 = (df
  .groupby('col_1')
  .agg(
    size = ('col_2', 'size'),
    col_2_vals = ('col_2', 'nunique')
  ))

How this works:

  • We call agg() on the output of groupby() to have the data aggregation operations be carried out for each group. See Summary Table.
  • As their name implies, given a list of values:
    • size() is a Pandas convenience function that returns the number of values. See Counting.
    • nunique() is a Pandas convenience function that returns the number of unique values. See Uniqueness.
  • To refer to the Pandas convenience methods implemented by GroupBy, agg() expects us to pass a string representation of their name e.g. in this case we used ‘size’ to refer to the method SeriesGroupBy.size().

Alternatively:

df_2 = pd.DataFrame()
df_g = df.groupby('col_1')
df_2['size'] = df_g['col_2'].size()
df_2['col_2_vals'] = df_g['col_2'].nunique()

How this works:

  • In df_2 = pd.DataFrame(), we create an empty dataframe called df_2.
  • In df_g = df.groupby('col_1'), we create a grouped data frame (a DataFrameGroupeBy object) which we reuse for each of the following aggregations.
  • We use the same functions as above except here we call them in a different way i.e. as method calls on grouped Series objects (SeriesGroupBy objects).
  • The results of these operations are stored in the size and col_2_vals columns of the df_2 dataframe.
df[‘col_1’].value_counts()

Here is how this works:

  • For the case of computing the number of rows per group, we can use the function value_counts() instead of the more verbose df.groupby('col_1')[’col_1’].count().
  • value_counts() can act on a Series or a data frame to return the number of occurrences of each unique value or unique combination of values.

Arithmatic

We wish to aggregate data by computing an arithmatic operation.

In this example, we wish to compute the sum of values of the column col_2 for each group and the sum of the product of the values of the columns col_3 and col_4 for each group. We wish to carry out the aggregation on the data frame df grouped by the column col_1.

df_2 = (df
  .groupby('col_1')
  .apply(lambda x: pd.Series({
    'col_2_sum': x['col_2'].sum(),
    'col_2_3_ws': (x['col_2'] * x['col_3']).sum()})
  ))

Here is how this works:

  • As described in Summary Table:
    • We call apply() on the output of groupby() to have the data aggregation operations be carried out for each group.
    • We use apply() because we wish to perform data aggregation operations that involve multiple columns.
    • We have apply() return a Series for each group, so we may create multiple columns.
  • As its name implies, sum() computes the sum of values for the vector it's called on.
  • For a coverage of arithmatic operations, see Arithmatic Operations.

Alternatively:

df_2 = pd.DataFrame()
df_g = df.groupby('col_1')
df_2['col_2_sum'] = df_g['col_2'].sum()
df_2['col_2_3_ws'] = \
  df_g.apply(lambda x: (x['col_2'] * x['col_3']).sum())

Here is how this works:

  • In producing 'col_2_3_ws', use apply() on the grouped data frame, so we may access all columns because the aggregation we wish to perform involves multiple columns. See “Function of Multiple Columns” in Individual Grouped Aggregation.
  • For the rest of this code snippet, see the description of the alternative solution under Counting above.

Summary Statistics

We wish to aggregate data by computing summary statistics, e.g. the mean, for each group.

In this example we wish to compute the mean, median, and standard deviation of values of the column col_3 for each group. We wish to carry out the aggregation on the data frame df grouped by the column ‘col_1'.

df_2 = (df
  .groupby('col_1')
  .agg(
    col_3_mean = ('col_3', 'mean'),
    col_3_median = ('col_3', 'median'),
    col_3_std = ('col_3', 'std')
  ))

Here is how this works:

  • We call agg() on the output of groupby() to have the data aggregation operations be carried out for each group. See Summary Table.
  • As their name implies, given a vector of values:
    • mean() computes the mean of values.
    • median() computes the median of values.
    • sd() computes the standard deviation of values.
  • To refer to the Pandas convenience methods implemented by GroupBy, agg() expects us to pass a string representation of their name e.g. in this case we used ‘mean’ to refer to the method SeriesGroupBy.mean().
  • For a coverage of summary statistics, see Summary Statistics.

Alternatively:

df_2 = pd.DataFrame()
df_g = df.groupby('col_1')
df_2['col_3_mean'] = df_g['col_3'].mean()
df_2['col_3_median'] = df_g['col_3'].median()
df_2['col_3_std'] = df_g['col_3'].std()

See the description of the alternative solution under Counting above.

Range

We wish to aggregate data by computing the maximum or minimum value of a given column.

In this example we wish to compute the minimum value of the numeric column col_2 for each group and the maximum value of the numeric column col_3 for each group. We wish to carry out the aggregation on the data frame df grouped by the column col_1.

df_2 = (df
  .groupby('col_1')
  .agg(
    col_2_min = ('col_2', 'min'),
    col_3_max = ('col_3', 'max')
  ))

Here is how this works:

  • We call agg() on the output of groupby() to have the data aggregation operations be carried out for each group. See Summary Table.
  • As their name implies, given a vector of values:
    • min() returns the min value.
    • max() returns the max value.
  • To refer to the Pandas convenience methods implemented by GroupBy, agg() expects us to pass a string representation of their name e.g. in this case we used ‘min’ to refer to the method SeriesGroupBy.min().
  • Obtaining the max and the min are a subset of the larger class of ranking operations, which we cover in Ranking.
  • Note that here we are aggregating the values of individual columns in a group, to filter entire rows in a group based on the rank of their values, see Filtering Groups by Rank.

Alternatively:

df_2 = pd.DataFrame()
df_g = df.groupby('col_1')
df_2['col_2_min'] = df_g['col_2'].min()
df_2['col_3_max'] = df_g['col_3'].max()

See the description of the alternative solution under Counting above.

Position

We wish to aggregate data by returning the first or last value that a column takes.

In this example, we wish to return the first value of the column col_2 for each group, the last value of the column col_2 for each group, and the second value of the column col_3 for each group. We wish to carry out the aggregation on the data frame df grouped by the column col_1.

df_2 = (df
  .groupby('col_1')
  .agg(
    col_2_first = ('col_2', 'first'),
    col_2_last = ('col_2', 'last'),
    col_3_2nd = ('col_3', lambda x: x.iloc[1])
  ))

Here is how this works:

  • We call agg() on the output of groupby() to have the data aggregation operations be carried out for each group. See Summary Table.
  • As their name implies, given a vector of values:
    • first() returns the first occurring value.
    • last() returns the last occurring value.
  • To refer to the Pandas convenience methods implemented by GroupBy, agg() expects us to pass a string representation of their name e.g. in this case we used ‘first’ to refer to the method GroupBy.first().
  • In lambda x: x.iloc[1], we use a lambda function to pass arguments to a function. This is the way to pass arguments when using the “Named Aggregation” stype of agg().
  • There is an nth() method that we cover in the alternative solution below.
  • It is often important to sort the data in some way before carrying out an aggregation by position operation.
  • For a more detailed coverage of position related operations, see Position.
  • Note: Here we are aggregating the values of individual columns in a group, to filter entire rows in a group based on their position, see Filtering Groups by Position.

Alternatively:

df_2 = pd.DataFrame()
df_g = df.groupby('col_1')
df_2['col_2_first'] = df_g['col_2'].first()
df_2['col_2_last'] = df_g['col_2'].last()
df_2['col_3_2nd'] = df_g['col_3'].nth(1)

Here is how this works:

  • The functions are as in the solution above, except that we use the convenient nth() method which we can use here because we are working on a grouped Series (SeriesGroupBy object) df_g['col_3'].
  • For the rest of this code snippet, see the description of the alternative solution under Counting above.

String

We wish to perform data aggregation operations on a column of a string data type.

In this example, we wish to summarize the data frame df over groups defined by the column ‘col_1’. The aggregation operations we wish to carry out are: (1) Concatenate the unique values of the column ‘col_2' per group into a comma separated string. (2) Obtain the last value of ‘col_3' for each group that matches a given regular expression.

df_2 = (df
  .groupby('col_1')
  .agg(
    col_2_distinct = ('col_2', lambda x: ','.join(x.unique())),
    col_3_match = ('col_3',
                   lambda x: x[x.str.contains('\d$')].iloc[-1])
  ))

Here is how this works:

  • We call agg() on the output of groupby() to have the data aggregation operations be carried out for each group. See Summary Table.
  • In lambda x: ','.join(x.unique()) we combine (concatenate) the unique values of a string column for each group into one comma separated string. The two functions we used are:
    • unique() returns the unique values. See Uniqueness.
    • join() combines multiple strings into one separated by the value on which it is applied in this cases ‘, ‘.
  • In lambda x: x[x.str.contains('\d$')].iloc[-1] we return the last value of a string column for each group that matches the regular expression '\d+$'.
    • str.contains() returns a logical Series that is True where the value of the string Series matches the given pattern.
    • x[x.str.contains('\d$')] returns a Series that is a subset of the input Series and containing only elements that match the given pattern.
    • iloc[-1] returns the last occurring value of a given Series.
  • The data type needs to be string for this to work. If we wish to apply string operations e.g. concatenation to say a numerical data type, we need to covert it to string first. We can do that via x.astype(str). See Type Casting.
  • For a coverage of string operations (including regular expressions), see String Operations.

Logical

We wish to perform data aggregation operations on a column of a logical data type; i.e. one that takes a value of True or False.

In this example, for each group, where the groups are defined by the value of the column ‘col_1', we wish to obtain the number and ratio of rows where the column ‘col_2' is NA.

df_2 = (df
  .groupby('col_1')
  .agg(
    col_2_na_count = 
      ('col_2', lambda x: x.isna().sum()),
    col_2_na_rate = 
      ('col_2', lambda x: x.isna().mean())
  ))

Here is how this works:

  • We call agg() on the output of groupby() to have the data aggregation operations be carried out for each group. See Summary Table.
  • We use lambda functions passed to agg() to realize the aggregation logic. See Custom Operations below.
  • For each group, we apply the function isna() to the values of the column ‘col_2’. We then use sum() to compute the number of values that are missing for each group and mean() to compute the ratio of missing values in each group.
  • Summing a boolean vector is basically counting the number of TRUE values (which in this case is the number of missing values) because when arithmatic operations are carried out on logical vectors, True is regarded as 1 and False is regarded as 0.
  • Similarly, taking the mean of a logical vector is basically taking the ratio between the number of True values and the total number of values in the input vector.
  • For a coverage of logical operations, see Logical Operations.

Alternatively:

df_2 = pd.DataFrame()
df_g = df.groupby('col_1')
df_2['col_2_na_count'] = \
    df_g['col_2'].agg(lambda x: x.isna().sum())
df_2['col_2_na_rate'] = \
    df_g['col_2'].agg(lambda x: x.isna().mean())

Here is how this works:

  • We pass the lambda functions to agg(). However, in this solution, agg() is applied directly to the grouped Series holding the column we wish to aggregate.
  • For the rest of the code, see the description of the alternative solution under Counting above.

Date Time

We wish to group a data frame by some unit or interval of time e.g. group by year or weekday.

In this example, we wish to group the data frame df by weekday inferred from the date-time column ‘col_1', where ‘col_1' is provided as a string. We then wish to compute the number of rows in each group and the number of unique hours that exist in each group.

df_2 = (df
  .assign(
    col_1 = pd.to_datetime(df['col_1']),
    col_1_wd = lambda x: x['col_1'].dt.weekday)
  .groupby('col_1_wd')
  .agg(
    count = ('col_2', 'size'),
    hour_count = ('col_1', lambda x: x.dt.hour.nunique())
  ))

Here is how this works:

  • We use assign() to execute two data transformations:
    • In pd.to_datetime(df['col_1']) we convert the string representation of date time that we originally have in the column col_1 into a date time object. We save the output as col_1 overwriting the original sting column.
    • In lambda x: x['col_1'].dt.weekday) we identify the weekday for each date time value of ‘col_1’. We save the output as a new column col_1_wd.
  • We then call groupby() and pass the name of the newly created column 'col_1_wd' to create one group for each weekday.
  • We call agg() on the output of groupby() to have the data aggregation operations be carried out for each group. See Summary Table.
  • In lambda x: x.dt.hour.nunique(), we extract the hour component of each datetime value for the current group and then count the number of unique hours in each group.
  • In addition to groupby(), Pandas provides two powerful approaches, resample() and pd.Grouper(), that offer fine control on grouping by date-time over any time span. We cover those along with date time operations in Date Time Operations.

Alternatively:

df_c = df.copy()
df_c['col_1'] = pd.to_datetime(df['col_1'])
df_2 = pd.DataFrame()
df_g = df_c.groupby(df_c['col_1'].dt.weekday)
df_2['count'] = \
    df_g['col_2'].size()
df_2['hour_count'] = \
    df_g['col_1'].agg(lambda x: x.dt.hour.nunique())

Here is how this works:

  • We create a deep copy of df which we call df_c. It is generally good practice not to alter the input data. This can be skipped if we wish to modify the input data frame df.
  • We then transform df_c by converting the data type of col_1 from string (object) to datetime (datetime64[ns]).
  • We extract the weekday from col_1 inside groupby().
  • The data aggregation operations are the same as the core solution above and for the rest of the code, see the description of the alternative solution under Counting above.

Custom Operations

We wish to perform a custom aggregation operation (as opposed to using a built-in function).

There are two approaches to performing custom aggregation operations:

  • Via a custom function that we call from within agg() or apply(). This is appropriate for aggregations that we need to carry out more than once or aggregations that are relatively complex.
  • Via a lambda function inside agg() or apply(). This is appropriate for simple logic that we need to carry out once or twice.

We will look at both approaches in the context of one column and multiple column aggregation operations below.

One Column Operations

In this example, for each group of the data frame df, where the groups are defined by the values of the column col_1, we wish to: (1) Compute the mode, i.e. most frequently occurring value, of the rounded values of the floating point data type column col_2 and (2) Compute the sum of the numerical substrings of the values of the string column col_3.

def str_sum(p_col):
    col_sum = (p_col
               .str.extract('(\d+$)', expand=False)
               .astype(int)
               .sum())
    return col_sum

df_2 = (df
    .groupby('col_1')
    .agg(
        col_2_rm = ('col_2', lambda x: x.round().mode()),
        col_3_sum = ('col_3', str_sum)
    ))

Here is how this works:

  • We call agg() on the output of groupby() to have the data aggregation operations be carried out for each group where the groups are defined by the value of the column ‘col_1’. See Summary Table.
  • We use a lambda function to carry out the first aggregation. The lambda function rounds the values of ‘col_2’ then takes the mode of the rounded values for each group. The mode is the most occurring value in a vector of values.
  • We use a custom function str_sum() to carry out the second aggregation. The custom function str_sum takes a Pandas series of a string data type as input and returns the sum of all the numbers extracted from the string values in the series. It works as follows:
    • The function uses the str.extract() method to extract all the numbers that match a given regular expression. See String Operations.
    • It then converts the extracted strings to integers using astype(). See Type Casting.
    • Finally, it computes and returns the sum of all the integers using the sum() method.
  • The resulting summary dataframe is stored in df_2.
  • As mentioned above, custom functions are preferred for more involved logic or logic that we need to carry out multiple times.

Alternatively:

def str_sum(p_col):
    col_sum = (p_col
               .str.extract('(\d+$)', expand=False)
               .astype(int)
               .sum())
    return col_sum

df_2 = pd.DataFrame()
df_g = df.groupby('col_1')
df_2['col_2_rm'] = \
    df_g['col_2'].agg(lambda x:  x.round().mode())
df_2['col_3_sum'] = df_g['col_3'].agg(str_sum)

Here is how this works:

  • This code is similar to the primary solution above, but it uses the lambda function and the custom function str_sum() in a different way i.e. as functions passed to agg() instead of as a named aggregation.
  • For the rest of the code, see the description of the alternative solution under Counting above.

Multiple Column Operations

In this example, for each group of the data frame df, where the groups are defined by the values of the column col_1, we wish to: (1) Compute the sum of products of the columns col_2 and col_3. (2) Compute the weighted sum of col_2 by the log of col_3. (3) compute the min value of each row then compute the max value of those min values for each group.

def log_weight_avg(p_x, p_w):
    w = np.log10(p_w)
    lwa = np.average(p_x, weights=w)
    return lwa

def max_mins(p_df):
    m_m = (p_df
           .select_dtypes('number')
           .min(axis=1)
           .max())
    return m_m

df_2 = (df
  .groupby('col_1')
  .apply(lambda x: pd.Series({
    'col_2_3_ws': (x['col_2'] * x['col_3']).sum(),
    'col_2_3_swa': log_weight_avg(x['col_2'], x['col_3']),
    'max_mins': max_mins(x)
    })
  ))

Here is how this works:

  • As described in Summary Table:
    • We call apply() on the output of groupby() to have the data aggregation operations be carried out for each group.
    • We use apply() because we wish to perform data aggregation operations that involve multiple columns.
    • We have apply() return a Series for each group, so we may create multiple columns.
  • We use a lambda function to carry out the first aggregation. The lambda function computes the sum of the products of the values of the columns col_2 and col_3 for each group.
  • We use a custom function log_weight_avg() to compute the second aggregation.
    • The log_weight_avg() function takes two inputs: a Pandas series p_x and a Pandas series p_w.
    • The function first takes the log base 10 of p_w using the log10 function from NumPy and stores the result in the w variable.
    • It then uses the average function from NumPy to compute and return the weighted average of p_x using the weights in w.
  • We use a custom function max_mins() to compute the third aggregation.
    • This function takes as input the entire group as a data frame.
    • The function first uses the select_dtypes method to select only the numeric columns in p_df
    • and then uses the min method with the axis parameter set to 1 to compute the minimum value of each row.
    • It then uses the max method to compute the maximum value of the minimum values of each row and returns the result.
  • Note the difference between the inputs of the two functions log_weight_avg() and max_mins(). The first is generic and accepts just the input columns it needs to perform its operation while the second accepts the entire group as a data frame.

Alternatively:

def log_weight_avg(p_x, p_w):
    w = np.log10(p_w)
    lwa = np.average(p_x, weights=w)
    return lwa

def max_mins(p_df):
    m_m = (p_df
           .select_dtypes('number')
           .min(axis=1)
           .max())
    return m_m

df_2 = pd.DataFrame()
df_g = df.groupby('col_1')
df_2['col_2_3_ws'] = \
    df_g.apply(lambda x: (x['col_2'] * x['col_3']).sum())
df_2['col_2_3_swa'] = \
    df_g.apply(lambda x: log_weight_avg(x['col_2'], x['col_3']))
df_2['max_mins'] = \
    df_g.apply(max_mins)

Here is how this works:

  • This code is similar to the primary solution above, but it uses the lambda function and the custom functions log_weight_avg() and max_mins() in a different way i.e. as individual aggregations instead of as one call to apply().
  • For the rest of the code, see the description of the alternative solution under Counting above.
PYTHON
I/O