Summarize Numeric by Factor

We wish to get a summary of a numeric column (e.g. the mean and standard deviation) for each group where the groups are defined by the values of a categorical column.

Column Summary

While we can explicitly compute all the common summary statistics for a numeric column over groups (see below), it would be efficient during data inspection to use a single function that given a numeric column and one or more grouping columns, computes the common summary statistics over groups.

In this example, we group by col_1 then compute the common summary statistic of a numeric column named col_2.

df.groupby('col_1')['col_2'].describe()

Here is how this works:

  • We apply groupby() to df to create a grouped data frame where the grouping variable is col_1.
  • We then select the numeric column of interest col_2 via the bracket operator [’col_2’].
  • We then apply describe() to the grouped Series object returned by the bracket operator.
  • describe() returns the common summary statistics of numeric variables which are count mean std and the common percentiles.

Mean

We wish to compute the mean of a numerical column over groups defined by one categorical column.

In this example, we wish to compute the mean of the numeric column col_2 for each group where the groups are defined by the values of the categorical column col_1.

df.groupby('col_1')['col_2'].mean()

Here is how this works:

  • We apply groupby() to df to create a grouped data frame where the grouping variable is col_1.
  • We then select the numeric column of interest col_2 via the bracket operator [’col_2’].
  • We then apply mean() to the grouped Series object returned by the bracket operator to return the mean of the numeric variable col_2 for each group defined by the variable col_1.
  • To round the output to say two digits we can add .round(2).
  • See Summary Statistics for how to compute all the common summary statistics.

Sum

We wish to compute the sum of values of a numerical column over groups defined by one categorical column.

In this example, we wish to compute the sum of the values of the numeric column col_2 for each group where the groups are defined by the values of the categorical column col_1.

df.groupby('col_1')['col_2'].sum()

Here is how this works:

This works similarly as above but we use sum() instead of mean().

Proportion

We wish to obtain the ratio between the sum of values of a numeric variable for each group to the total sum of values of the numeric variable where the groups are defined by a grouping variable.

In this example, we compute the ratio of the sum of values of a numeric column col_2 for each group defined by col_1 to the total sum of values of col_2.

df.groupby('col_1')['col_2'].sum() / df['col_2'].sum()

Here is how this works:

  • We compute the sum of values of col_2 for each group where the groups are defined by col_1 as we did above.
  • We then divide by the sum of values of col_2 computed via df['col_2'].sum().
PYTHON
I/O