We wish to get a summary of a numeric column (e.g. the mean and standard deviation) for each group where the groups are defined by the combination of values of two categorical columns.
While we can explicitly compute all the common summary statistics for a numeric column over groups (see below), it would be efficient during data inspection to use a single function that given a numeric column and one or more grouping columns, computes the common summary statistics over groups.
library(skimr)
df %>% group_by(col_1 col_2) %>% skim(col_3)
Here is how this works:
group_by()
to “partition” the data frame into groups according to the values of one or more grouping columns passed to group_by()
which in this case are col_1
and col_2
.skim()
as well as pass the numerical column whose value is to be summarized, in this case col_3
.skim()
is a great convenience. With one command, we get a consolidated report that has the most common summary statistics like row count, mean, standard deviation, minimum value, maximum value, and percentiles.skim()
, from the skimr
package, is a more powerful alternative to R’s built in summary()
function.We wish to compute the mean of a numerical column over groups defined by two categorical columns.
In this example, we wish to compute the mean of the numeric column col_3
for each group where the groups are defined by the values of the two categorical columns col_1
and col_2.
df %>%
group_by(col_1, col_2) %>%
summarize(col_3 = mean(col_3, na.rm=T)) %>%
pivot_wider(names_from = col_2, values_from = col_3)
Here is how this works:
group_by()
to the data frame df
specifying the grouping columns col_1
and col_2
.summarize()
to apply an aggregation function mean()
to col_3
for each group.na.rm = TRUE
so mean()
would ignore NA
values and return the mean of the rest. See Summary Statistics for how to compute all the common summary statistics in R.summarize()
, it is often helpful to view the output as a cross table with the values of the first categorical column col_1
are shown as rows and the values of the second categorical column col_2
are shown as columns. We use pivot_wider()
to pivot the values of col_2
from being over rows to being over columns. See Reshaping for more details.We wish to compute the sum of a numerical column over groups defined by two categorical columns.
In this example, we wish to compute the sum of the numeric column col_3
for each group where the groups are defined by the values of the two categorical columns col_1
and col_2.
df %>%
group_by(col_1, col_2) %>%
summarize(col_3 = sum(col_3, na.rm=T)) %>%
pivot_wider(names_from = col_2, values_from = col_3, values_fill = 0)
Here is how this works:
sum()
instead of mean()
.values_fill = 0
of pivot_wider()
so that combinations of values of col_1
and col_2
that do not exist in the data would get a sum of 0
and not NA
.We wish to obtain the ratio between the sum of values of a numeric variable for each group to the total sum of values of the numeric variable where the groups are defined by two categorical columns.
In this example, we compute the ratio of the sum of values of a numeric column col_3
for each group defined by col_1
and col_2
to the total sum of values of col_3
.
df %>%
group_by(col_1, col_2) %>%
summarize(col_3 = sum(col_3, na.rm=T)) %>%
mutate(col_3 = col_3 / sum(col_3)) %>%
pivot_wider(names_from = col_2, values_from = col_3, values_fill = 0)
Here is how this works:
group_by()
and summarize()
to apply sum()
to the values of col_3
over groups defined by col_1
and col_2
.mutate()
to the resulting summary to compute the ratio of the sum of values of col_3
for each group (which in the summary is in the col_3
column) to the total value of col_3
(which we compute via sum(col_3)
).pivot_wider()
to go from the tidy form returned by mutate()
to a cross table form with the values of the first categorical column col_1
shown as rows and the values of the second categorical column col_2
shown as columns.We wish to compute multiple aggregation functions on a numeric column for each group where the groups are defined by two columns.
In this example, we compute sum and mean for a numeric column col_3
for each group where the groups are defined by col_1
and col_2
.
df %>%
group_by(col_1, col_2) %>%
summarize(col_3_sum = sum(col_3, na.rm=T),
col_3_mean = mean(col_3, na.rm=T))
Here is how this works:
group_by()
and summarize()
as we did above.sum()
and mean()
.