In this page, we will cover the functions to use to carry out some of the most common data aggregation operations.

In its simplest form, a data aggregation function is one which takes multiple individual values and returns a single summary value; e.g. sum.

We have extended sections where we cover the data manipulation operations relevant to each data type quite extensively (see the data type operation sections in the navigation bar). Think of this page as a brief overview of the most common aggregation functions for each data type with links to the respective operations section for a coverage.

Note: In this page, we demonstrate the data aggregation operations in a scenario where a data frame is grouped by one column. We can drop the call to `group_by()`

to aggregate over the entire data frame (without grouping). We can also group by multiple columns or in myriad other ways (See Grouping).

We wish to count the number of rows or the number of unique values in a group.

In this example, we wish to count the number of rows in each group and the number of unique values of the column `col_2`

in each group. We wish to carry out the aggregation on the data frame `df`

grouped by the column `col_1`

.

```
df_2 = df %>%
group_by(col_1) %>%
summarize(
count = n(),
col_2_vals = n_distinct(col_2))
```

Here is how this works:

- We call
`group_by()`

prior to calling`summarize()`

to have the data aggregation operation carried out on each group separately. `n()`

gives the current group size. See Counting.`n_distinct()`

gives the number of unique values. See Uniqueness.

*Alternatively,*

```
df %>% count(col_1)
```

Here is how this works:

- For the case of computing the number of rows per group, we can use the convenience
`count()`

function instead of the more verbose`df %>% group_by(col_1) %>% summarize(count = n())`

. `count()`

can take one or more columns and computes the number of occurrences of each unique value or each unique combination of values.

We wish to aggregate data by computing an arithmatic operation, e.g. sum, for each group.

In this example, we wish to compute the sum of values of the column `col_2`

for each group and the sum of the product of the values of the columns `col_3`

and `col_4`

for each group. We wish to carry out the aggregation on the data frame `df`

grouped by the column `col_1`

.

```
df_2 = df %>%
group_by(col_1) %>%
summarize(
col_2_sum = sum(col_2, na.rm=T),
col_3_4_ws = sum(col_3 * col_4, na.rm=T))
```

Here is how this works:

- We call
`group_by()`

prior to calling`summarize()`

to have the data aggregation operation carried out on each group separately. - We then use the built-in function
`sum()`

to compute the sum of values of`col_2`

. We set the argument`na.rm = TRUE`

so`sum()`

would ignore`NA`

values and return the sum of the rest. - In
`sum(col_3 * col_4))`

, we first multiply the values of`col_3`

and`col_4`

for each row then sum the results (for each group). - For a coverage of arithmatic operations, see Arithmatic Operations.

We wish to aggregate data by computing summary statistics, e.g. the mean, for each group.

In this example we wish to compute the mean, median, and standard deviation of values of the column `col_3`

for each group. We wish to carry out the aggregation on the data frame `df`

grouped by the column `col_1`

.

```
df_2 = df %>%
group_by(col_1) %>%
summarize(
col_3_mean = mean(col_3),
col_3_median = median(col_3),
col_3_std = sd(col_3)
)
```

Here is how this works:

- We call
`group_by()`

prior to calling`summarize()`

to have the data aggregations operation carried out on each group separately. - As their name implies, given a vector of values:
`mean()`

computes the mean of values.`median()`

computes the median of values.`sd()`

computes the standard deviation of values.

- All three descriptive statistics functions have an argument
`na.rm`

which is set to`FALSE`

by default meaning that if there is any`NA`

value, the result is`NA`

. If we wish to ignore`NA`

values, we can set`na.rm=TRUE`

(like we did with`sum()`

in Arithmatic above). - For a coverage of summary statistics, see Summary Statistics

We wish to aggregate data by computing the maximum or minimum value of a given column for each group.

In this example we wish to compute the minimum value of the numeric column `col_3`

for each group and the maximum value of the numeric column `col_4`

for each group. We wish to carry out the aggregation on the data frame `df`

grouped by the column `col_1`

.

```
df_2 = df %>%
group_by(col_1) %>%
summarize(
col_3_min = min(col_3),
col_4_max = max(col_4))
```

Here is how this works:

- We call
`group_by()`

prior to calling`summarize()`

to have the data aggregation operation carried out on each group separately. - As their name implies, given a vector of values:
`min()`

returns the minimum value.`max()`

returns the maximum value.

`max()`

and the`min()`

belong to the larger class of ranking operations, which we cover in Ranking.- Note that here we are aggregating the values of individual columns in a group, to filter entire rows in a group based on the rank of the values of some columns, see Filtering Groups by Rank.

We wish to aggregate data by returning the first or last value that a column takes for each group.

In this example, we wish to return the first value of the column `col_2`

for each group, the last value of the column `col_2`

for each group, and the second value of the column `col_3`

for each group. We wish to carry out the aggregation on the data frame `df`

grouped by the column `col_1`

.

```
df_2 = df %>%
group_by(col_1) %>%
summarize(
col_2_first = first(col_2),
col_2_last = last(col_2),
col_3_2nd = nth(col_3, 2))
```

Here is how this works:

- We call
`group_by()`

prior to calling`summarize()`

to have the data aggregation operation carried out on each group separately. - As their name implies, given a vector of values:
`first()`

returns the first occurring value.`last()`

returns the last occurring value.`nth()`

returns the nth occurring value.

- It is often important to sort the data in some way before carrying out an aggregation by position operation.
- For a coverage of position related operations, see Position.
- Note: Here we are aggregating the values of individual columns in a group, to filter entire rows in a group based on their position, see Filtering Groups by Position.

We wish to perform data aggregation operations on a column of a string data type.

In this example, we wish to summarize the data frame df over groups defined by the column `col_1`

. The aggregation operations we wish to carry out are: (1) Concatenate the unique values of the column `col_2`

per group into a comma separated string. (2) Obtain the last value of `col_3`

for each group that matches a given regular expression.

```
df_2 = df %>%
group_by(col_1) %>%
summarize(
col_2_distinct =
unique(col_2) %>% str_c(collapse = ', '),
col_3_match =
str_subset(col_3, '\\d+$') %>% last()
)
```

Here is how this works:

- We call
`group_by()`

prior to calling`summarize()`

to have the data aggregation operation carried out on each group separately. - In
`unique(col_2) %>% str_c(collapse = ', ')`

we combine (concatenate) the unique values of the string column`col_2`

for each group into one comma separated string. The two functions we used are:`unique()`

returns the unique values as a vector. See Uniqueness.`str_c()`

combines multiple strings into one separated by the value passed to the`collapse`

argument.

- In
`str_subset(col_3, '\\d+$') %>% last()`

we return the last value of`col_3`

for each group that matches the regular expression`'\\d+$'`

.`str_subset()`

returns the subset of values of a string vector that match the given pattern.`last()`

returns the last occurring value of a given vector.

- For a coverage of string operations (including regular expressions), see String Operations.

We wish to perform data aggregation operations on a column of a logical data type; i.e. one that takes a value of `TRUE`

or `FALSE`

.

In this example, for each group, where the groups are defined by the value of the column `col_1`

, we wish to obtain the number and ratio of rows where the column `col_2`

is `NA`

.

```
df_2 = df %>%
group_by(col_1) %>%
summarize(
col_2_na_count =
col_2 %>% is.na() %>% sum(),
col_2_na_rate =
col_2 %>% is.na() %>% mean()
)
```

Here is how this works:

- We call
`group_by()`

prior to calling`summarize()`

to have the data aggregation operation carried out on each group separately. - We pass the column
`col_2`

to the function`is.na()`

from base R which returns a logical vector that is`TRUE`

where`col_2`

is`NA`

and`FALSE`

otherwise. - We pass the boolean vector returned by
`is.na()`

to`sum()`

to obtain the number of missing values. Summing a boolean vector is basically counting the number of`TRUE`

values (which in this case is the number of missing values) because when arithmatic operations are carried out on logical vectors,`TRUE`

is regarded as`1`

and`FALSE`

is regarded as`0`

. - Similarly, we use
`mean()`

to obtain the proportion of missing values of`col_2`

for each group. Since`TRUE`

is regarded as`1`

and`FALSE`

is regarded as`0`

, taking the mean of a logical vector is basically taking the ratio between the number of`TRUE`

values and the total number of values in the input vector. - For a coverage of logical operations, see Logical Operations

We wish to group a data frame by some unit or interval of time e.g. group by year or weekday.

In this example, we wish to group the data frame `df`

by weekday inferred from the date-time column `col_1`

, where `col_1`

is provided as a string. We then wish to compute the number of rows in each group and the number of unique hours that `col_1`

takes in each group.

```
df_2 = df %>%
mutate(col_1 = ymd_hms(col_1)) %>%
group_by(week_day = wday(col_1, label = TRUE)) %>%
summarize(
count = n(),
hour_count = hour(col_1) %>% n_distinct())
```

Here is how this works:

- We call
`group_by()`

prior to calling`summarize()`

to have the data aggregation operations carried out on each group separately. - In
`col_1 = ymd_hms(col_1)`

, we convert the data type of the column`col_1`

from string to`datetime`

. - In
`wday(col_1, label = TRUE)`

, we identify the weekday of each datetime value of`col_1`

. We set label=TRUE to have string labels, e.g.`“Mon”`

, instead of integers. - We do the extraction of weekday inside
`group_by()`

which is a good idea when we do not want to retain the column beyond grouping. We cover this in more detail in Grouping. - We use
`n()`

to obtain the number of rows in each group. - In
`hour(col_1)`

, we extract the hour of each datetime value of`col_1`

for the current group. - We use
`n_distinct()`

to count the number of unique values of`hour(col_1)`

for each group. - For a coverage of date time operations, see Date Time Operations.

We wish to perform a custom aggregation operation i.e. one for which there is no built-in function.

There are two approaches to performing custom aggregation operations:

- Via a custom function that we call from within
`summarize()`

. This is appropriate for aggregations that we need to carry out more than once or aggregations that are relatively complex. - Via embedded logic inside
`summarize()`

. This is appropriate for simple logic that we need to carry out once or twice.

We will look at an example of both in the following example.

In this example, we wish to:

- compute the mode, i.e. most frequently occurring value, of the column
`col_2`

- compute the sum of the numerical substrings of the values of the string column
`col_3`

for each group of the data frame `df`

, where the groups are defined by the values of the column `col_1`

.

```
mode <- function(.x) {
.x %>%
table() %>%
sort(decreasing = TRUE) %>%
names() %>%
first()
}
df_2 = df %>%
group_by(col_1) %>%
summarize(
col_2_mode = mode(col_2),
col_3_sum = col_3 %>%
str_extract('\\d+') %>%
parse_integer() %>%
sum())
```

Here is how this works:

- We call
`group_by()`

prior to calling`summarize()`

to have the data aggregation operations carried out on each group separately. - We use a custom function to carry out the first aggregation:
- We created the custom function
`mode()`

that accepts a vector and returns the most commonly occurring value. We cover how this function works in Counting. - In
`col_2_mode = mode(col_2)`

, we call the`mode()`

function from within`summarize()`

and pass to it`col_2`

. Since we called`group_by()`

before`summarize()`

,`mode()`

will be called once for each group to return the most frequently occurring value of`col_2`

for that group.

- We created the custom function
- We use embedded logic to carry out the second aggregation:
- We carry out three functions in sequence each acting on the output of the one prior:
`str_extract('\\d+')`

extracts the first integer sub string occurring in each value of`col_3`

`parse_integer()`

converts the stings containing integers extracted by`str_extract()`

into proper integers- which are then passed to
`sum()`

to reduce to a single value for the current group

- We could have done this via
`sum(parse_integer(str_extract(col_3, '\\d+')))`

. However, instead of nesting functions, our preferred approach is to chain them via the pipe`%>%`

operator. Note that what is being piped through is not the data frame`df`

, but the values of the column`col_3`

for each group.

- We carry out three functions in sequence each acting on the output of the one prior:
- As mentioned above, custom functions are preferred over embedded logic for more involved logic or logic that we need to carry out multiple times.

R