We wish to get a summary of a numeric column (e.g. the mean and standard deviation) for each group where the groups are defined by the values of a categorical column.
While we can explicitly compute all the common summary statistics for a numeric column over groups (see below), it would be efficient during data inspection to use a single function that given a numeric column and one or more grouping columns, computes the common summary statistics over groups.
library(skimr)
df %>% group_by(col_1) %>% skim(col_2)
Here is how this works:
group_by()
to “partition” the data frame into groups according to the values of one or more grouping columns passed to group_by()
which in this case is col_1
.skim()
as well as pass the numerical column whose value is to be summarized, in this case col_2
.skim()
is a great convenience. With one command, we get a consolidated report that has the most common summary statistics like row count, mean, standard deviation, minimum value, maximum value, and percentiles.skim()
, from the skimr
package, is a more powerful alternative to R’s built in summary()
function.We wish to compute the mean of a numerical column over groups defined by one categorical column.
In this example, we wish to compute the mean of the numeric column col_2
for each group where the groups are defined by the values of the categorical column col_1
.
df %>%
group_by(col_1) %>%
summarize(col_2 = mean(col_2, na.rm = TRUE))
Here is how this works:
group_by()
to the data frame df
specifying the grouping column col_1
.summarize()
to apply an aggregation function (here mean()
) to each groupna.rm = TRUE
so mean()
would ignore NA
values and return the mean of the rest.We wish to compute the sum of values of a numerical column over groups defined by one categorical column.
In this example, we wish to compute the sum of the values of the numeric column col_2
for each group where the groups are defined by the values of the categorical column col_1
.
df %>%
group_by(col_1) %>%
summarize(col_2 = sum(col_2, na.rm=T))
Here is how this works:
This works similarly as above but we use sum()
instead of mean()
.
We wish to obtain the ratio between the sum of values of a numeric variable for each group to the total sum of values of the numeric variable where the groups are defined by a grouping variable.
In this example, we compute the ratio of the sum of values of a numeric column col_2
for each group defined by col_1
to the total sum of values of col_2
.
df %>%
group_by(col_1) %>%
summarize(col_2 = sum(col_2, na.rm=T)) %>%
mutate(col_2 = col_2 / sum(col_2))
Here is how this works:
group_by()
and summarize()
to apply sum()
to the values of col_2
over groups defined by col_1
.mutate()
to the resulting summary to compute the ratio of the sum of values of col_2
for each group (which in the summary is in the col_2
column) to the total value of col_2
(which we compute via sum(col_2)
).