Summarize Numeric by Factor

We wish to get a summary of a numeric column (e.g. the mean and standard deviation) for each group where the groups are defined by the values of a categorical column.

Column Summary

While we can explicitly compute all the common summary statistics for a numeric column over groups (see below), it would be efficient during data inspection to use a single function that given a numeric column and one or more grouping columns, computes the common summary statistics over groups.

library(skimr)
df %>% group_by(col_1) %>% skim(col_2)

Here is how this works:

  • We use group_by() to “partition” the data frame into groups according to the values of one or more grouping columns passed to group_by() which in this case is col_1.
  • We then pass the grouped data frame to the function skim() as well as pass the numerical column whose value is to be summarized, in this case col_2.
  • skim() is a great convenience. With one command, we get a consolidated report that has the most common summary statistics like row count, mean, standard deviation, minimum value, maximum value, and percentiles.
  • skim(), from the skimr package, is a more powerful alternative to R’s built in summary() function.

Mean

We wish to compute the mean of a numerical column over groups defined by one categorical column.

In this example, we wish to compute the mean of the numeric column col_2 for each group where the groups are defined by the values of the categorical column col_1.

df %>% 
 group_by(col_1) %>% 
 summarize(col_2 = mean(col_2, na.rm = TRUE))

Here is how this works:

  • We first apply group_by() to the data frame df specifying the grouping column col_1.
  • We then pass the grouped data frame to summarize() to apply an aggregation function (here mean()) to each group
  • We set the argument na.rm = TRUE so mean() would ignore NA values and return the mean of the rest.
  • See Summary Statistics for how to compute all the common summary statistics in R.

Sum

We wish to compute the sum of values of a numerical column over groups defined by one categorical column.

In this example, we wish to compute the sum of the values of the numeric column col_2 for each group where the groups are defined by the values of the categorical column col_1.

df %>%
    group_by(col_1) %>% 
    summarize(col_2 = sum(col_2, na.rm=T))

Here is how this works:

This works similarly as above but we use sum() instead of mean().

Proportion

We wish to obtain the ratio between the sum of values of a numeric variable for each group to the total sum of values of the numeric variable where the groups are defined by a grouping variable.

In this example, we compute the ratio of the sum of values of a numeric column col_2 for each group defined by col_1 to the total sum of values of col_2.

df %>%
    group_by(col_1) %>% 
    summarize(col_2 = sum(col_2, na.rm=T)) %>%
    mutate(col_2 = col_2 / sum(col_2))

Here is how this works:

  • This works similarly to the above. We use group_by() and summarize() to apply sum() to the values of col_2 over groups defined by col_1.
  • We then apply mutate() to the resulting summary to compute the ratio of the sum of values of col_2 for each group (which in the summary is in the col_2 column) to the total value of col_2 (which we compute via sum(col_2)).
R
I/O