Grouped Data Frame Summary

We wish to generate the common summary statistics for all columns in a data frame for each group, where the groups are defined by one or more grouping columns. Examples of summary statistics are quantiles for numeric columns and unique value count for non numeric columns. While we can compute each of these statistics for each column of a data frame individually, it would be efficient during data inspection to use a function that given a grouped data frame, computes the common statistics for each of the data frame’s columns appropriate for the column’s data type.

library(skimr)

df %>% group_by(col_1) %>% skim()

Here is how this works:

  • We use group_by() to “partition” the data frame into groups according to the values of one or more grouping columns passed to group_by() which in this case is col_1.
  • We then pass the grouped data frame to the function skim().
  • skim(), from the skimr package, is a much more powerful alternative to R’s built in summary() function.
  • skim() separately describes numerical and non-numerical variables. In particular, it returns the following:
    1. Data Summary: observation count, column count.
    2. For each numerical column: missing count, completeness rate, mean, standard deviation, percentiles, and a small visual histogram.
    3. For each non-numerical column: missing count, completeness rate, unique count, among others.
R
I/O