We wish to get a summary of a numeric column (e.g. the mean and standard deviation) for each group where the groups are defined by the combination of values of two categorical columns.
While we can explicitly compute all the common summary statistics for a numeric column over groups (see below), it would be efficient during data inspection to use a single function that given a numeric column and one or more grouping columns, computes the common summary statistics over groups.
df.groupby(['col_1', 'col_2'])['col_3'].describe()
Here is how this works:
groupby()
to the data frame df
to create a grouped data frame where the grouping variables are [col_1, col_2]
.col_3
via the bracket operator [’col_3’]
.describe()
to the grouped Series
object returned by the bracket operator.describe()
returns the common summary statistics of numeric variables which are count mean std and the common percentiles.We wish to compute the mean of a numerical column over groups defined by two categorical columns.
In this example, we wish to compute the mean of the numeric column col_3
for each group where the groups are defined by the values of the two categorical columns col_1
and col_2.
pd.crosstab(df['col_1'], df['col_2'],
values = df['col_3'],
aggfunc = 'mean')
Here is how this works:
crosstab()
to summarize multiple variables in data inspection contexts.crosstab()
is capable of aggregating a numeric variable over grouping established by two (or more) categorical variables.crosstab()
are as follows (in order):df['col_1']
.df['col_2']
.values = df['col_3']
.aggfun = “mean”
.aggfunc
argument of crosstab()
. See Summary Statistics for the common summary statistics functions..round(2)
.nan
values can make inspecting the output a little hard on the eye. It often helps to add .fillna(’-’)
to replace nans with a dash making the actual numbers visually standout.We wish to compute the sum of a numerical column over groups defined by two categorical columns.
In this example, we wish to compute the sum of the numeric column col_3
for each group where the groups are defined by the values of the two categorical columns col_1
and col_2.
pd.crosstab(df['col_1'], df['col_2'],
values = df['col_3'],
aggfunc = "sum")\
.fillna(0)
Here is how this works:
aggfunc
argument of crosstab()
is set to "sum"
.fillna(0)
so that combinations of values of col_1
and col_2
that do not exist in the data would get a sum of 0
and not be missing NaN
.We wish to obtain the ratio between the sum of values of a numeric variable for each group to the total sum of values of the numeric variable where the groups are defined by two categorical columns.
In this example, we compute the ratio of the sum of values of a numeric column col_3
for each group defined by col_1
and col_2
to the total sum of values of col_3
.
pd.crosstab(df['col_1'], df['col_2'],
values = df['col_3'],
aggfunc = 'sum',
normalize='all')
Here is how this works:
crosstab()
function to compute the sum of the numerical col_3
against the two grouping columns col_1
and col_2
.normalize
argument of the crosstab()
function to ‘all’
so it normalizes the sums computed earlier over the total sum of col_3
.normalize=’index’
and to normalize across columns we set normalize=’columns’
.We wish to compute multiple aggregation functions on a numeric column for each group where the groups are defined by two columns.
In this example, we compute sum and mean for a numeric column col_3
for each group where the groups are defined by col_1
and col_2
.
pd.crosstab(df['col_1'], df['col_2'],
values = df['col_3'],
aggfunc = ['sum', 'mean'])
Here is how this works:
crosstab()
as described above with the exception that we pass a list of two functions to the argument aggfunc
like so aggfunc = ['sum', 'mean']
.MultiIndex
for the columns where the top level is for the functions, which are here sum
and mean
, and the second level is for the values of the column(s) passed to the second argument of crosstab(),
which is here col_2
.