Summarize Numeric by Factor

We wish to get a summary of a numeric column (e.g. the mean and standard deviation) for each group where the groups are defined by the values of a categorical column.

Column Summary

While we can explicitly compute all the common summary statistics for a numeric column over groups (see below), it would be efficient during data inspection to use a single function that given a numeric column and one or more grouping columns, computes the common summary statistics over groups.

library(skimr)
df %>% group_by(col_1) %>% skim(col_2)

Here is how this works:

We use group_by() to “partition” the data frame into groups according to the values of one or more grouping columns passed to group_by() which in this case is col_1.
We then pass the grouped data frame to the function skim() as well as pass the numerical column whose value is to be summarized, in this case col_2.
skim() is a great convenience. With one command, we get a consolidated report that has the most common summary statistics like row count, mean, standard deviation, minimum value, maximum value, and percentiles.
skim(), from the skimr package, is a more powerful alternative to R’s built in summary() function.

Mean

We wish to compute the mean of a numerical column over groups defined by one categorical column.

In this example, we wish to compute the mean of the numeric column col_2 for each group where the groups are defined by the values of the categorical column col_1.

df %>% 
 group_by(col_1) %>% 
 summarize(col_2 = mean(col_2, na.rm = TRUE))

Here is how this works:

We first apply group_by() to the data frame df specifying the grouping column col_1.
We then pass the grouped data frame to summarize() to apply an aggregation function (here mean()) to each group
We set the argument na.rm = TRUE so mean() would ignore NA values and return the mean of the rest.
See Summary Statistics for how to compute all the common summary statistics in R.

Sum

We wish to compute the sum of values of a numerical column over groups defined by one categorical column.

In this example, we wish to compute the sum of the values of the numeric column col_2 for each group where the groups are defined by the values of the categorical column col_1.

df %>%
    group_by(col_1) %>% 
    summarize(col_2 = sum(col_2, na.rm=T))

Here is how this works:

This works similarly as above but we use sum() instead of mean().

Proportion

We wish to obtain the ratio between the sum of values of a numeric variable for each group to the total sum of values of the numeric variable where the groups are defined by a grouping variable.

In this example, we compute the ratio of the sum of values of a numeric column col_2 for each group defined by col_1 to the total sum of values of col_2.

df %>%
    group_by(col_1) %>% 
    summarize(col_2 = sum(col_2, na.rm=T)) %>%
    mutate(col_2 = col_2 / sum(col_2))

Here is how this works:

This works similarly to the above. We use group_by() and summarize() to apply sum() to the values of col_2 over groups defined by col_1.
We then apply mutate() to the resulting summary to compute the ratio of the sum of values of col_2 for each group (which in the summary is in the col_2 column) to the total value of col_2 (which we compute via sum(col_2)).

Optima.io Reference beta

Summarize Numeric by Factor

Column Summary

Mean

Sum

Proportion