We wish to get a summary of a numeric column (e.g. the mean and standard deviation) for each group where the groups are defined by the values of a categorical column.
While we can explicitly compute all the common summary statistics for a numeric column over groups (see below), it would be efficient during data inspection to use a single function that given a numeric column and one or more grouping columns, computes the common summary statistics over groups.
In this example, we group by col_1
then compute the common summary statistic of a numeric column named col_2
.
df.groupby('col_1')['col_2'].describe()
Here is how this works:
groupby()
to df
to create a grouped data frame where the grouping variable is col_1
.col_2
via the bracket operator [’col_2’]
.describe()
to the grouped Series
object returned by the bracket operator.describe()
returns the common summary statistics of numeric variables which are count mean std and the common percentiles.We wish to compute the mean of a numerical column over groups defined by one categorical column.
In this example, we wish to compute the mean of the numeric column col_2
for each group where the groups are defined by the values of the categorical column col_1
.
df.groupby('col_1')['col_2'].mean()
Here is how this works:
groupby()
to df
to create a grouped data frame where the grouping variable is col_1
.col_2
via the bracket operator [’col_2’]
.mean()
to the grouped Series
object returned by the bracket operator to return the mean of the numeric variable col_2
for each group defined by the variable col_1
..round(2)
.We wish to compute the sum of values of a numerical column over groups defined by one categorical column.
In this example, we wish to compute the sum of the values of the numeric column col_2
for each group where the groups are defined by the values of the categorical column col_1
.
df.groupby('col_1')['col_2'].sum()
Here is how this works:
This works similarly as above but we use sum()
instead of mean()
.
We wish to obtain the ratio between the sum of values of a numeric variable for each group to the total sum of values of the numeric variable where the groups are defined by a grouping variable.
In this example, we compute the ratio of the sum of values of a numeric column col_2
for each group defined by col_1
to the total sum of values of col_2
.
df.groupby('col_1')['col_2'].sum() / df['col_2'].sum()
Here is how this works:
col_2
for each group where the groups are defined by col_1
as we did above.df['col_2'].sum()
.