In this page, we will cover the functions to use to carry out some of the most common data aggregation operations.
In its simplest form, a data aggregation function is one which takes multiple individual values and returns a single summary value; e.g. sum.
We have extended sections where we cover the data manipulation operations relevant to each data type quite extensively (see the data type operation sections in the navigation bar). Think of this page as a brief overview of the most common aggregation functions for each data type with links to the respective operations section for a coverage.
Note: In this page, we demonstrate the data aggregation operations in a scenario where a data frame is grouped by one column. We can drop the call to group_by()
to aggregate over the entire data frame (without grouping). We can also group by multiple columns or in myriad other ways (See Grouping).
We wish to count the number of rows or the number of unique values in a group.
In this example, we wish to count the number of rows in each group and the number of unique values of the column col_2
in each group. We wish to carry out the aggregation on the data frame df
grouped by the column col_1
.
df_2 = df %>%
group_by(col_1) %>%
summarize(
count = n(),
col_2_vals = n_distinct(col_2))
Here is how this works:
group_by()
prior to calling summarize()
to have the data aggregation operation carried out on each group separately.n()
gives the current group size. See Counting.n_distinct()
gives the number of unique values. See Uniqueness.Alternatively,
df %>% count(col_1)
Here is how this works:
count()
function instead of the more verbose df %>% group_by(col_1) %>% summarize(count = n())
.count()
can take one or more columns and computes the number of occurrences of each unique value or each unique combination of values.We wish to aggregate data by computing an arithmatic operation, e.g. sum, for each group.
In this example, we wish to compute the sum of values of the column col_2
for each group and the sum of the product of the values of the columns col_3
and col_4
for each group. We wish to carry out the aggregation on the data frame df
grouped by the column col_1
.
df_2 = df %>%
group_by(col_1) %>%
summarize(
col_2_sum = sum(col_2, na.rm=T),
col_3_4_ws = sum(col_3 * col_4, na.rm=T))
Here is how this works:
group_by()
prior to calling summarize()
to have the data aggregation operation carried out on each group separately.sum()
to compute the sum of values of col_2
. We set the argument na.rm = TRUE
so sum()
would ignore NA
values and return the sum of the rest.sum(col_3 * col_4))
, we first multiply the values of col_3
and col_4
for each row then sum the results (for each group).We wish to aggregate data by computing summary statistics, e.g. the mean, for each group.
In this example we wish to compute the mean, median, and standard deviation of values of the column col_3
for each group. We wish to carry out the aggregation on the data frame df
grouped by the column col_1
.
df_2 = df %>%
group_by(col_1) %>%
summarize(
col_3_mean = mean(col_3),
col_3_median = median(col_3),
col_3_std = sd(col_3)
)
Here is how this works:
group_by()
prior to calling summarize()
to have the data aggregations operation carried out on each group separately.mean()
computes the mean of values.median()
computes the median of values.sd()
computes the standard deviation of values.na.rm
which is set to FALSE
by default meaning that if there is any NA
value, the result is NA
. If we wish to ignore NA
values, we can set na.rm=TRUE
(like we did with sum()
in Arithmatic above).We wish to aggregate data by computing the maximum or minimum value of a given column for each group.
In this example we wish to compute the minimum value of the numeric column col_3
for each group and the maximum value of the numeric column col_4
for each group. We wish to carry out the aggregation on the data frame df
grouped by the column col_1
.
df_2 = df %>%
group_by(col_1) %>%
summarize(
col_3_min = min(col_3),
col_4_max = max(col_4))
Here is how this works:
group_by()
prior to calling summarize()
to have the data aggregation operation carried out on each group separately.min()
returns the minimum value.max()
returns the maximum value.max()
and the min()
belong to the larger class of ranking operations, which we cover in Ranking.We wish to aggregate data by returning the first or last value that a column takes for each group.
In this example, we wish to return the first value of the column col_2
for each group, the last value of the column col_2
for each group, and the second value of the column col_3
for each group. We wish to carry out the aggregation on the data frame df
grouped by the column col_1
.
df_2 = df %>%
group_by(col_1) %>%
summarize(
col_2_first = first(col_2),
col_2_last = last(col_2),
col_3_2nd = nth(col_3, 2))
Here is how this works:
group_by()
prior to calling summarize()
to have the data aggregation operation carried out on each group separately.first()
returns the first occurring value.last()
returns the last occurring value.nth()
returns the nth occurring value.We wish to perform data aggregation operations on a column of a string data type.
In this example, we wish to summarize the data frame df over groups defined by the column col_1
. The aggregation operations we wish to carry out are: (1) Concatenate the unique values of the column col_2
per group into a comma separated string. (2) Obtain the last value of col_3
for each group that matches a given regular expression.
df_2 = df %>%
group_by(col_1) %>%
summarize(
col_2_distinct =
unique(col_2) %>% str_c(collapse = ', '),
col_3_match =
str_subset(col_3, '\\d+$') %>% last()
)
Here is how this works:
group_by()
prior to calling summarize()
to have the data aggregation operation carried out on each group separately.unique(col_2) %>% str_c(collapse = ', ')
we combine (concatenate) the unique values of the string column col_2
for each group into one comma separated string. The two functions we used are:unique()
returns the unique values as a vector. See Uniqueness.str_c()
combines multiple strings into one separated by the value passed to the collapse
argument.str_subset(col_3, '\\d+$') %>% last()
we return the last value of col_3
for each group that matches the regular expression '\\d+$'
.str_subset()
returns the subset of values of a string vector that match the given pattern.last()
returns the last occurring value of a given vector.We wish to perform data aggregation operations on a column of a logical data type; i.e. one that takes a value of TRUE
or FALSE
.
In this example, for each group, where the groups are defined by the value of the column col_1
, we wish to obtain the number and ratio of rows where the column col_2
is NA
.
df_2 = df %>%
group_by(col_1) %>%
summarize(
col_2_na_count =
col_2 %>% is.na() %>% sum(),
col_2_na_rate =
col_2 %>% is.na() %>% mean()
)
Here is how this works:
group_by()
prior to calling summarize()
to have the data aggregation operation carried out on each group separately.col_2
to the function is.na()
from base R which returns a logical vector that is TRUE
where col_2
is NA
and FALSE
otherwise.is.na()
to sum()
to obtain the number of missing values. Summing a boolean vector is basically counting the number of TRUE
values (which in this case is the number of missing values) because when arithmatic operations are carried out on logical vectors, TRUE
is regarded as 1
and FALSE
is regarded as 0
.mean()
to obtain the proportion of missing values of col_2
for each group. Since TRUE
is regarded as 1
and FALSE
is regarded as 0
, taking the mean of a logical vector is basically taking the ratio between the number of TRUE
values and the total number of values in the input vector.We wish to group a data frame by some unit or interval of time e.g. group by year or weekday.
In this example, we wish to group the data frame df
by weekday inferred from the date-time column col_1
, where col_1
is provided as a string. We then wish to compute the number of rows in each group and the number of unique hours that col_1
takes in each group.
df_2 = df %>%
mutate(col_1 = ymd_hms(col_1)) %>%
group_by(week_day = wday(col_1, label = TRUE)) %>%
summarize(
count = n(),
hour_count = hour(col_1) %>% n_distinct())
Here is how this works:
group_by()
prior to calling summarize()
to have the data aggregation operations carried out on each group separately.col_1 = ymd_hms(col_1)
, we convert the data type of the column col_1
from string to datetime
.wday(col_1, label = TRUE)
, we identify the weekday of each datetime value of col_1
. We set label=TRUE to have string labels, e.g. “Mon”
, instead of integers.group_by()
which is a good idea when we do not want to retain the column beyond grouping. We cover this in more detail in Grouping.n()
to obtain the number of rows in each group.hour(col_1)
, we extract the hour of each datetime value of col_1
for the current group.n_distinct()
to count the number of unique values of hour(col_1)
for each group.We wish to perform a custom aggregation operation i.e. one for which there is no built-in function.
There are two approaches to performing custom aggregation operations:
summarize()
. This is appropriate for aggregations that we need to carry out more than once or aggregations that are relatively complex.summarize()
. This is appropriate for simple logic that we need to carry out once or twice.We will look at an example of both in the following example.
In this example, we wish to:
col_2
col_3
for each group of the data frame df
, where the groups are defined by the values of the column col_1
.
mode <- function(.x) {
.x %>%
table() %>%
sort(decreasing = TRUE) %>%
names() %>%
first()
}
df_2 = df %>%
group_by(col_1) %>%
summarize(
col_2_mode = mode(col_2),
col_3_sum = col_3 %>%
str_extract('\\d+') %>%
parse_integer() %>%
sum())
Here is how this works:
group_by()
prior to calling summarize()
to have the data aggregation operations carried out on each group separately.mode()
that accepts a vector and returns the most commonly occurring value. We cover how this function works in Counting.col_2_mode = mode(col_2)
, we call the mode()
function from within summarize()
and pass to it col_2
. Since we called group_by()
before summarize()
, mode()
will be called once for each group to return the most frequently occurring value of col_2
for that group.str_extract('\\d+')
extracts the first integer sub string occurring in each value of col_3
parse_integer()
converts the stings containing integers extracted by str_extract()
into proper integerssum()
to reduce to a single value for the current groupsum(parse_integer(str_extract(col_3, '\\d+')))
. However, instead of nesting functions, our preferred approach is to chain them via the pipe %>%
operator. Note that what is being piped through is not the data frame df
, but the values of the column col_3
for each group.