Summarize Numeric by 2 Factors

We wish to get a summary of a numeric column (e.g. the mean and standard deviation) for each group where the groups are defined by the combination of values of two categorical columns.

Column Summary

While we can explicitly compute all the common summary statistics for a numeric column over groups (see below), it would be efficient during data inspection to use a single function that given a numeric column and one or more grouping columns, computes the common summary statistics over groups.

df.groupby(['col_1', 'col_2'])['col_3'].describe()

Here is how this works:

  • We apply groupby() to the data frame df to create a grouped data frame where the grouping variables are [col_1, col_2].
  • We then select the numeric column of interest col_3 via the bracket operator [’col_3’].
  • We then apply describe() to the grouped Series object returned by the bracket operator.
  • describe() returns the common summary statistics of numeric variables which are count mean std and the common percentiles.

Mean

We wish to compute the mean of a numerical column over groups defined by two categorical columns.

In this example, we wish to compute the mean of the numeric column col_3 for each group where the groups are defined by the values of the two categorical columns col_1 and col_2.

pd.crosstab(df['col_1'], df['col_2'], 
    values = df['col_3'], 
    aggfunc = 'mean')

Here is how this works:

  • We recommend the use of crosstab() to summarize multiple variables in data inspection contexts.
  • crosstab() is capable of aggregating a numeric variable over grouping established by two (or more) categorical variables.
  • The arguments of crosstab() are as follows (in order):
    • The grouping variable to use for the rows of cross table; here df['col_1'].
    • The grouping variable to use for the columns of the cross table; here df['col_2'].
    • The numeric column to summarize in the cells of the cross table; here values = df['col_3'].
    • The aggregation function to use; here aggfun = “mean”.
  • Any aggregation function can be passed to the aggfunc argument of crosstab(). See Summary Statistics for the common summary statistics functions.
  • Here are two useful extensions to make this easier to inspect visually:
    • To round the output to say two digits we can add .round(2).
    • Sometimes having many nan values can make inspecting the output a little hard on the eye. It often helps to add .fillna(’-’) to replace nans with a dash making the actual numbers visually standout.

Sum

We wish to compute the sum of a numerical column over groups defined by two categorical columns.

In this example, we wish to compute the sum of the numeric column col_3 for each group where the groups are defined by the values of the two categorical columns col_1 and col_2.

pd.crosstab(df['col_1'], df['col_2'], 
    values = df['col_3'], 
    aggfunc = "sum")\
    .fillna(0)

Here is how this works:

  • This works similarly as above except that the aggregation function aggfunc argument of crosstab() is set to "sum".
  • We apply the function fillna(0) so that combinations of values of col_1 and col_2 that do not exist in the data would get a sum of 0 and not be missing NaN.

Proportion

We wish to obtain the ratio between the sum of values of a numeric variable for each group to the total sum of values of the numeric variable where the groups are defined by two categorical columns.

In this example, we compute the ratio of the sum of values of a numeric column col_3 for each group defined by col_1 and col_2 to the total sum of values of col_3.

pd.crosstab(df['col_1'], df['col_2'], 
    values = df['col_3'], 
    aggfunc = 'sum', 
    normalize='all')

Here is how this works:

  • We use the crosstab() function to compute the sum of the numerical col_3 against the two grouping columns col_1 and col_2.
  • We set the normalize argument of the crosstab() function to ‘all’ so it normalizes the sums computed earlier over the total sum of col_3.
  • To normalize across rows we set normalize=’index’ and to normalize across columns we set normalize=’columns’.

Multiple Functions

We wish to compute multiple aggregation functions on a numeric column for each group where the groups are defined by two columns.

In this example, we compute sum and mean for a numeric column col_3 for each group where the groups are defined by col_1 and col_2.

pd.crosstab(df['col_1'], df['col_2'], 
    values = df['col_3'], 
    aggfunc = ['sum', 'mean'])

Here is how this works:

  • We use crosstab() as described above with the exception that we pass a list of two functions to the argument aggfunc like so aggfunc = ['sum', 'mean'].
  • The output has a MultiIndex for the columns where the top level is for the functions, which are here sum and mean, and the second level is for the values of the column(s) passed to the second argument of crosstab(), which is here col_2.
PYTHON
I/O