Summarize Numeric by 2 Factors

We wish to get a summary of a numeric column (e.g. the mean and standard deviation) for each group where the groups are defined by the combination of values of two categorical columns.

Column Summary

While we can explicitly compute all the common summary statistics for a numeric column over groups (see below), it would be efficient during data inspection to use a single function that given a numeric column and one or more grouping columns, computes the common summary statistics over groups.

library(skimr)
df %>% group_by(col_1 col_2) %>% skim(col_3)

Here is how this works:

  • We use group_by() to “partition” the data frame into groups according to the values of one or more grouping columns passed to group_by() which in this case are col_1 and col_2.
  • We then pass the grouped data frame to the function skim() as well as pass the numerical column whose value is to be summarized, in this case col_3.
  • skim() is a great convenience. With one command, we get a consolidated report that has the most common summary statistics like row count, mean, standard deviation, minimum value, maximum value, and percentiles.
  • skim(), from the skimr package, is a more powerful alternative to R’s built in summary() function.

Mean

We wish to compute the mean of a numerical column over groups defined by two categorical columns.

In this example, we wish to compute the mean of the numeric column col_3 for each group where the groups are defined by the values of the two categorical columns col_1 and col_2.

df %>% 
 group_by(col_1, col_2) %>% 
 summarize(col_3 = mean(col_3, na.rm=T)) %>% 
 pivot_wider(names_from = col_2, values_from = col_3)

Here is how this works:

  • We first apply group_by() to the data frame df specifying the grouping columns col_1 and col_2.
  • We then pass the grouped data frame to summarize() to apply an aggregation function mean() to col_3 for each group.
  • We set the argument na.rm = TRUE so mean() would ignore NA values and return the mean of the rest. See Summary Statistics for how to compute all the common summary statistics in R.
  • While we could leave it at the output of summarize(), it is often helpful to view the output as a cross table with the values of the first categorical column col_1 are shown as rows and the values of the second categorical column col_2 are shown as columns. We use pivot_wider() to pivot the values of col_2 from being over rows to being over columns. See Reshaping for more details.

Sum

We wish to compute the sum of a numerical column over groups defined by two categorical columns.

In this example, we wish to compute the sum of the numeric column col_3 for each group where the groups are defined by the values of the two categorical columns col_1 and col_2.

df %>%
    group_by(col_1, col_2) %>% 
    summarize(col_3 = sum(col_3, na.rm=T)) %>% 
    pivot_wider(names_from = col_2, values_from = col_3, values_fill = 0)

Here is how this works:

  • This works similarly as above but we use sum() instead of mean().
  • We set the argument values_fill = 0 of pivot_wider() so that combinations of values of col_1 and col_2 that do not exist in the data would get a sum of 0 and not NA.

Proportion

We wish to obtain the ratio between the sum of values of a numeric variable for each group to the total sum of values of the numeric variable where the groups are defined by two categorical columns.

In this example, we compute the ratio of the sum of values of a numeric column col_3 for each group defined by col_1 and col_2 to the total sum of values of col_3.

df %>%
    group_by(col_1, col_2) %>% 
    summarize(col_3 = sum(col_3, na.rm=T)) %>% 
  mutate(col_3 = col_3 / sum(col_3)) %>%
    pivot_wider(names_from = col_2, values_from = col_3, values_fill = 0)

Here is how this works:

  • This works similarly to the above. We use group_by() and summarize() to apply sum() to the values of col_3 over groups defined by col_1 and col_2.
  • We then apply mutate() to the resulting summary to compute the ratio of the sum of values of col_3 for each group (which in the summary is in the col_3 column) to the total value of col_3 (which we compute via sum(col_3)).
  • As above, we use pivot_wider() to go from the tidy form returned by mutate() to a cross table form with the values of the first categorical column col_1 shown as rows and the values of the second categorical column col_2 shown as columns.

Multiple Functions

We wish to compute multiple aggregation functions on a numeric column for each group where the groups are defined by two columns.

In this example, we compute sum and mean for a numeric column col_3 for each group where the groups are defined by col_1 and col_2.

df %>%
    group_by(col_1, col_2) %>% 
    summarize(col_3_sum = sum(col_3, na.rm=T),
                        col_3_mean = mean(col_3, na.rm=T))

Here is how this works:

  • We use group_by() and summarize() as we did above.
  • In this scenario, we apply two aggregation functions sum() and mean().
  • We chose not to pivot because the code to do so in this situation is a bit too involved for an inspection context (see Reshaping).
R
I/O