We wish to get a summary of a numeric column (e.g. the mean and standard deviation) for each group where the groups are defined by the values of a categorical column.

While we can explicitly compute all the common summary statistics for a numeric column over groups (see below), it would be efficient during data inspection to use a single function that given a numeric column and one or more grouping columns, computes the common summary statistics over groups.

```
library(skimr)
df %>% group_by(col_1) %>% skim(col_2)
```

Here is how this works:

- We use
`group_by()`

to “partition” the data frame into groups according to the values of one or more grouping columns passed to`group_by()`

which in this case is`col_1`

. - We then pass the grouped data frame to the function
`skim()`

as well as pass the numerical column whose value is to be summarized, in this case`col_2`

. `skim()`

is a great convenience. With one command, we get a consolidated report that has the most common summary statistics like row count, mean, standard deviation, minimum value, maximum value, and percentiles.`skim()`

, from the`skimr`

package, is a more powerful alternative to R’s built in`summary()`

function.

We wish to compute the mean of a numerical column over groups defined by one categorical column.

In this example, we wish to compute the mean of the numeric column `col_2`

for each group where the groups are defined by the values of the categorical column `col_1`

.

```
df %>%
group_by(col_1) %>%
summarize(col_2 = mean(col_2, na.rm = TRUE))
```

Here is how this works:

- We first apply
`group_by()`

to the data frame`df`

specifying the grouping column`col_1`

. - We then pass the grouped data frame to
`summarize()`

to apply an aggregation function (here`mean()`

) to each group - We set the argument
`na.rm = TRUE`

so`mean()`

would ignore`NA`

values and return the mean of the rest. - See Summary Statistics for how to compute all the common summary statistics in R.

We wish to compute the sum of values of a numerical column over groups defined by one categorical column.

In this example, we wish to compute the sum of the values of the numeric column `col_2`

for each group where the groups are defined by the values of the categorical column `col_1`

.

```
df %>%
group_by(col_1) %>%
summarize(col_2 = sum(col_2, na.rm=T))
```

Here is how this works:

This works similarly as above but we use `sum()`

instead of `mean()`

.

We wish to obtain the ratio between the sum of values of a numeric variable for each group to the total sum of values of the numeric variable where the groups are defined by a grouping variable.

In this example, we compute the ratio of the sum of values of a numeric column `col_2`

for each group defined by `col_1`

to the total sum of values of `col_2`

.

```
df %>%
group_by(col_1) %>%
summarize(col_2 = sum(col_2, na.rm=T)) %>%
mutate(col_2 = col_2 / sum(col_2))
```

Here is how this works:

- This works similarly to the above. We use
`group_by()`

and`summarize()`

to apply`sum()`

to the values of`col_2`

over groups defined by`col_1`

. - We then apply
`mutate()`

to the resulting summary to compute the ratio of the sum of values of`col_2`

for each group (which in the summary is in the`col_2`

column) to the total value of`col_2`

(which we compute via`sum(col_2)`

).

R