In this page, we show how to identify the columns on each of which we will apply data aggregation logic.
We will cover the following scenarios:
This section is complemented by
Note: In this page, we demonstrate implicit data aggregation in a context where a data frame is grouped by one column.
groupby()
to aggregate over the entire data frame without grouping in which case the output will be a Series
. See Basic Aggregation.We wish to apply a data aggregation operation to each column of a data frame.
In this example, we have a data frame df
and we wish to count the number of unique values in each column for each group where the groups are defined by the values of the column col_1
.
df_2 = df.groupby('col_1').agg('nunique')
Here is how this works:
groupby()
method groups the rows of the data frame df
by the values in the column col_1
. The data aggregation operations applied by the subsequent call to agg()
are, therefore, applied to each group separately. See Summary Table.agg()
on the entire grouped data frame (DataFrameGroupBy
object) produced by df.groupby('col_1')
which means that the data aggregation operation passed to agg()
, in this case nunique()
, is going to be applied to each column in each group separately.df_2
with one row for each unique value of the column col_1
, and a column for each column in the original df
data frame containing the number of unique values in that column for the corresponding group of rows.Alternative: Via Pandas Convenience Function
df_2 = df.groupby('col_1').nunique()
Here is how this works:
nunique()
instead of agg()
.groupby()
(a DataFrameGroupBy
object) has a set of convenience methods to carry out the most common data aggregation operations which can be used instead of agg()
. One such method is nunique()
. Pandas
applies the method to each column for each group behind the scenes. See Common Aggregation Operations.We wish to apply a data aggregation operation to each of a set of explicitly selected columns.
In this example, we have a data frame df
and we wish to count the number of unique values of the columns col_2
and col_4
for each group, where the groups are defined by the values of the column col_1
.
df_2 = (df
.groupby('col_1')[['col_2', 'col_4']]
.agg('nunique'))
Here is how this works:
groupby()
method groups the rows of the data frame df
by the values in the column col_1
. The data aggregation operations applied by the subsequent call to agg()
are, therefore, applied to each group separately. See Summary Table.['col_2', 'col_4']
, we use the bracket operator to select the desired columns by their name. It is worth noting that we can’t use the selector loc[]
on the output of groupby()
. See Basic Selection for a coverage of explicit column selection scenarios,.agg()
applies the function nunique()
to the selected columns (col_2
and col_4
) for each group.df_2
with one row for each unique value of the column col_1
, and two columns col_2
and col_4
containing the number of unique values in that column for the corresponding group of rows.Alternative: Select then Group
df_2 = (df
.loc[:, ['col_1', 'col_2', 'col_4']]
.groupby('col_1')
.nunique()
Here is how this works:
loc[]
then group via groupby()
. We must include the grouping column, here col_1
, in our selection.nunique()
, the columns available for aggregation would be the selected non grouping columns i.e. col_2
and col_4
.We wish to apply a data aggregation operation to each column in a set of implicitly selected columns. Implicit column selection is when we do not spell out the column names or positions explicitly but rather identify the columns via a property of their name or their data.
In this example, we have a data frame df
and we wish to compute the sum of values of all numerical columns for each group, where the groups are defined by the values of the column col_1
.
selected_cols = df.select_dtypes('number').columns
df_2 = (df
.groupby('col_1')[selected_cols]
.agg('sum'))
Here is how this works:
df.select_dtypes('number').columns
, we select all columns whose data type is numeric (i.e. integer or floating point). See Implicit Selection for a coverage of the most common scenarios of implicit column selection including by name pattern, data type, and Criteria satisfied by the column’s data.groupby()
method groups the rows of the data frame df
by the values in the column col_1
. The data aggregation operations applied by the subsequent call to agg()
are, therefore, applied to each group separately. See Summary Table.[selected_cols]
, we use the bracket operator to select the columns that were identified earlier for each group.sum()
computes the sum for each of the selected columns for each group.df_2
, with one row for each unique value in the col_1
column, and a column for each numeric column in the original df
data frame containing the sum of the values in that column for the corresponding group of rows.Alternative: Via Pandas Convenience Function
selected_cols = df.select_dtypes('number').columns
df_2 = df.groupby('col_1')[selected_cols].sum()
Here is how this works:
sum()
instead of agg()
.groupby()
(a DataFrameGroupBy
object) has a set of convenience methods to carry out the most common data aggregation operations which can be used instead of agg()
. One such method is nunique()
. Pandas
applies the method to each column for each group behind the scenes. See Common Aggregation Operations.Alternative: Select then Group
In this example, we have a data frame df
and we wish to compute the sum of values of columns whose names start with ‘x_’
for each group, where the groups are defined by the values of the column col_1
.
df_2 = (df
.loc[:, df.columns.str.startswith('x_')]
.groupby('x_col_1')
.sum())
Here is how this works:
loc[]
then group via groupby()
. In this example, the grouping column happens to be returned by the same implicit column selection operation.sum()
, the columns available for aggregation would be the selected non grouping columns i.e. col_2
and col_4
.We wish to apply a data aggregation operation to all but a set of excluded columns.
In this example, we have a data frame df
and for each group, we wish to count the number of unique values of all columns except col_2
and col_4
, where the groups are defined by the values of the column col_1
.
selected_cols = \
df.columns.difference(['col_2', 'col_4'])
df_2 = (df
.groupby('col_1')[selected_cols]
.agg('nunique'))
Here is how this works:
df.columns.difference(['col_3', 'col_4'])
we identify the columns we wish to exclude by name. See Exclude Columns for a coverage of column exclusion scenarios.df_2
, with one row for each unique value in the column col_1
, and a column for each column in the original data frame df
excluding col_2
and col_4
(and col_1
) containing the number of unique values in that column for the corresponding group of rows.