We wish to get a summary of a numeric column (e.g. the mean and standard deviation) for each group where the groups are defined by the combination of values of two categorical columns.

While we can explicitly compute all the common summary statistics for a numeric column over groups (see below), it would be efficient during data inspection to use a single function that given a numeric column and one or more grouping columns, computes the common summary statistics over groups.

```
library(skimr)
df %>% group_by(col_1 col_2) %>% skim(col_3)
```

Here is how this works:

- We use
`group_by()`

to “partition” the data frame into groups according to the values of one or more grouping columns passed to`group_by()`

which in this case are`col_1`

and`col_2`

. - We then pass the grouped data frame to the function
`skim()`

as well as pass the numerical column whose value is to be summarized, in this case`col_3`

. `skim()`

is a great convenience. With one command, we get a consolidated report that has the most common summary statistics like row count, mean, standard deviation, minimum value, maximum value, and percentiles.`skim()`

, from the`skimr`

package, is a more powerful alternative to R’s built in`summary()`

function.

We wish to compute the mean of a numerical column over groups defined by two categorical columns.

In this example, we wish to compute the mean of the numeric column `col_3`

for each group where the groups are defined by the values of the two categorical columns `col_1`

and `col_2.`

```
df %>%
group_by(col_1, col_2) %>%
summarize(col_3 = mean(col_3, na.rm=T)) %>%
pivot_wider(names_from = col_2, values_from = col_3)
```

Here is how this works:

- We first apply
`group_by()`

to the data frame`df`

specifying the grouping columns`col_1`

and`col_2`

. - We then pass the grouped data frame to
`summarize()`

to apply an aggregation function`mean()`

to`col_3`

for each group. - We set the argument
`na.rm = TRUE`

so`mean()`

would ignore`NA`

values and return the mean of the rest. See Summary Statistics for how to compute all the common summary statistics in R. - While we could leave it at the output of
`summarize()`

, it is often helpful to view the output as a cross table with the values of the first categorical column`col_1`

are shown as rows and the values of the second categorical column`col_2`

are shown as columns. We use`pivot_wider()`

to pivot the values of`col_2`

from being over rows to being over columns. See Reshaping for more details.

We wish to compute the sum of a numerical column over groups defined by two categorical columns.

In this example, we wish to compute the sum of the numeric column `col_3`

for each group where the groups are defined by the values of the two categorical columns `col_1`

and `col_2.`

```
df %>%
group_by(col_1, col_2) %>%
summarize(col_3 = sum(col_3, na.rm=T)) %>%
pivot_wider(names_from = col_2, values_from = col_3, values_fill = 0)
```

Here is how this works:

- This works similarly as above but we use
`sum()`

instead of`mean()`

. - We set the argument
`values_fill = 0`

of`pivot_wider()`

so that combinations of values of`col_1`

and`col_2`

that do not exist in the data would get a sum of`0`

and not`NA`

.

We wish to obtain the ratio between the sum of values of a numeric variable for each group to the total sum of values of the numeric variable where the groups are defined by two categorical columns.

In this example, we compute the ratio of the sum of values of a numeric column `col_3`

for each group defined by `col_1`

and `col_2`

to the total sum of values of `col_3`

.

```
df %>%
group_by(col_1, col_2) %>%
summarize(col_3 = sum(col_3, na.rm=T)) %>%
mutate(col_3 = col_3 / sum(col_3)) %>%
pivot_wider(names_from = col_2, values_from = col_3, values_fill = 0)
```

Here is how this works:

- This works similarly to the above. We use
`group_by()`

and`summarize()`

to apply`sum()`

to the values of`col_3`

over groups defined by`col_1`

and`col_2`

. - We then apply
`mutate()`

to the resulting summary to compute the ratio of the sum of values of`col_3`

for each group (which in the summary is in the`col_3`

column) to the total value of`col_3`

(which we compute via`sum(col_3)`

). - As above, we use
`pivot_wider()`

to go from the tidy form returned by`mutate()`

to a cross table form with the values of the first categorical column`col_1`

shown as rows and the values of the second categorical column`col_2`

shown as columns.

We wish to compute multiple aggregation functions on a numeric column for each group where the groups are defined by two columns.

In this example, we compute sum and mean for a numeric column `col_3`

for each group where the groups are defined by `col_1`

and `col_2`

.

```
df %>%
group_by(col_1, col_2) %>%
summarize(col_3_sum = sum(col_3, na.rm=T),
col_3_mean = mean(col_3, na.rm=T))
```

Here is how this works:

- We use
`group_by()`

and`summarize()`

as we did above. - In this scenario, we apply two aggregation functions
`sum()`

and`mean()`

. - We chose not to pivot because the code to do so in this situation is a bit too involved for an inspection context (see Reshaping).

R