Summarize Numeric by Factor

We wish to get a summary of a numeric column (e.g. the mean and standard deviation) for each group where the groups are defined by the values of a categorical column.

Column Summary

While we can explicitly compute all the common summary statistics for a numeric column over groups (see below), it would be efficient during data inspection to use a single function that given a numeric column and one or more grouping columns, computes the common summary statistics over groups.

In this example, we group by col_1 then compute the common summary statistic of a numeric column named col_2.

df.groupby('col_1')['col_2'].describe()

Here is how this works:

We apply groupby() to df to create a grouped data frame where the grouping variable is col_1.
We then select the numeric column of interest col_2 via the bracket operator [’col_2’].
We then apply describe() to the grouped Series object returned by the bracket operator.
describe() returns the common summary statistics of numeric variables which are count mean std and the common percentiles.

Mean

We wish to compute the mean of a numerical column over groups defined by one categorical column.

In this example, we wish to compute the mean of the numeric column col_2 for each group where the groups are defined by the values of the categorical column col_1.

df.groupby('col_1')['col_2'].mean()

Here is how this works:

We apply groupby() to df to create a grouped data frame where the grouping variable is col_1.
We then select the numeric column of interest col_2 via the bracket operator [’col_2’].
We then apply mean() to the grouped Series object returned by the bracket operator to return the mean of the numeric variable col_2 for each group defined by the variable col_1.
To round the output to say two digits we can add .round(2).
See Summary Statistics for how to compute all the common summary statistics.

Sum

We wish to compute the sum of values of a numerical column over groups defined by one categorical column.

In this example, we wish to compute the sum of the values of the numeric column col_2 for each group where the groups are defined by the values of the categorical column col_1.

df.groupby('col_1')['col_2'].sum()

Here is how this works:

This works similarly as above but we use sum() instead of mean().

Proportion

We wish to obtain the ratio between the sum of values of a numeric variable for each group to the total sum of values of the numeric variable where the groups are defined by a grouping variable.

In this example, we compute the ratio of the sum of values of a numeric column col_2 for each group defined by col_1 to the total sum of values of col_2.

df.groupby('col_1')['col_2'].sum() / df['col_2'].sum()

Here is how this works:

We compute the sum of values of col_2 for each group where the groups are defined by col_1 as we did above.
We then divide by the sum of values of col_2 computed via df['col_2'].sum().

Optima.io Reference beta

Summarize Numeric by Factor

Column Summary

Mean

Sum

Proportion