Common Operations

In this page, we will cover the functions to use to carry out some of the most common data aggregation operations.

In its simplest form, a data aggregation function is one which takes multiple individual values and returns a single summary value; e.g. sum.

We have extended sections where we cover the data manipulation operations relevant to each data type quite extensively (see the data type operation sections in the navigation bar). Think of this page as a brief overview of the most common aggregation functions for each data type with links to the respective operations section for a coverage.

Note: In this page, we demonstrate the data aggregation operations in a scenario where a data frame is grouped by one column. We can drop the call to group_by() to aggregate over the entire data frame (without grouping). We can also group by multiple columns or in myriad other ways (See Grouping).

Counting

We wish to count the number of rows or the number of unique values in a group.

In this example, we wish to count the number of rows in each group and the number of unique values of the column col_2 in each group. We wish to carry out the aggregation on the data frame df grouped by the column col_1.

df_2 = df %>% 
  group_by(col_1) %>%
  summarize(
    count = n(),
    col_2_vals = n_distinct(col_2))

Here is how this works:

  • We call group_by() prior to calling summarize() to have the data aggregation operation carried out on each group separately.
  • n() gives the current group size. See Counting.
  • n_distinct() gives the number of unique values. See Uniqueness.

Alternatively,

df %>% count(col_1)

Here is how this works:

  • For the case of computing the number of rows per group, we can use the convenience count() function instead of the more verbose df %>% group_by(col_1) %>% summarize(count = n()).
  • count() can take one or more columns and computes the number of occurrences of each unique value or each unique combination of values.

Arithmatic

We wish to aggregate data by computing an arithmatic operation, e.g. sum, for each group.

In this example, we wish to compute the sum of values of the column col_2 for each group and the sum of the product of the values of the columns col_3 and col_4 for each group. We wish to carry out the aggregation on the data frame df grouped by the column col_1.

df_2 = df %>% 
  group_by(col_1) %>%
  summarize(
    col_2_sum = sum(col_2, na.rm=T),
    col_3_4_ws = sum(col_3 * col_4, na.rm=T))

Here is how this works:

  • We call group_by() prior to calling summarize() to have the data aggregation operation carried out on each group separately.
  • We then use the built-in function sum() to compute the sum of values of col_2. We set the argument na.rm = TRUE so sum() would ignore NA values and return the sum of the rest.
  • In sum(col_3 * col_4)), we first multiply the values of col_3 and col_4 for each row then sum the results (for each group).
  • For a coverage of arithmatic operations, see Arithmatic Operations.

Summary Statistics

We wish to aggregate data by computing summary statistics, e.g. the mean, for each group.

In this example we wish to compute the mean, median, and standard deviation of values of the column col_3 for each group. We wish to carry out the aggregation on the data frame df grouped by the column col_1.

df_2 = df %>% 
  group_by(col_1) %>%
  summarize(
    col_3_mean = mean(col_3),
    col_3_median = median(col_3),
    col_3_std = sd(col_3)
  )

Here is how this works:

  • We call group_by() prior to calling summarize() to have the data aggregations operation carried out on each group separately.
  • As their name implies, given a vector of values:
    • mean() computes the mean of values.
    • median() computes the median of values.
    • sd() computes the standard deviation of values.
  • All three descriptive statistics functions have an argument na.rm which is set to FALSE by default meaning that if there is any NA value, the result is NA. If we wish to ignore NA values, we can set na.rm=TRUE (like we did with sum() in Arithmatic above).
  • For a coverage of summary statistics, see Summary Statistics

Range

We wish to aggregate data by computing the maximum or minimum value of a given column for each group.

In this example we wish to compute the minimum value of the numeric column col_3 for each group and the maximum value of the numeric column col_4 for each group. We wish to carry out the aggregation on the data frame df grouped by the column col_1.

df_2 = df %>% 
  group_by(col_1) %>%
  summarize(
    col_3_min = min(col_3),
    col_4_max = max(col_4))

Here is how this works:

  • We call group_by() prior to calling summarize() to have the data aggregation operation carried out on each group separately.
  • As their name implies, given a vector of values:
    • min() returns the minimum value.
    • max() returns the maximum value.
  • max() and the min() belong to the larger class of ranking operations, which we cover in Ranking.
  • Note that here we are aggregating the values of individual columns in a group, to filter entire rows in a group based on the rank of the values of some columns, see Filtering Groups by Rank.

Position

We wish to aggregate data by returning the first or last value that a column takes for each group.

In this example, we wish to return the first value of the column col_2 for each group, the last value of the column col_2 for each group, and the second value of the column col_3 for each group. We wish to carry out the aggregation on the data frame df grouped by the column col_1.

df_2 = df %>% 
  group_by(col_1) %>%
  summarize(
    col_2_first = first(col_2),
    col_2_last = last(col_2),
    col_3_2nd = nth(col_3, 2))

Here is how this works:

  • We call group_by() prior to calling summarize() to have the data aggregation operation carried out on each group separately.
  • As their name implies, given a vector of values:
    • first() returns the first occurring value.
    • last() returns the last occurring value.
    • nth() returns the nth occurring value.
  • It is often important to sort the data in some way before carrying out an aggregation by position operation.
  • For a coverage of position related operations, see Position.
  • Note: Here we are aggregating the values of individual columns in a group, to filter entire rows in a group based on their position, see Filtering Groups by Position.

String

We wish to perform data aggregation operations on a column of a string data type.

In this example, we wish to summarize the data frame df over groups defined by the column col_1. The aggregation operations we wish to carry out are: (1) Concatenate the unique values of the column col_2 per group into a comma separated string. (2) Obtain the last value of col_3 for each group that matches a given regular expression.

df_2 = df %>% 
  group_by(col_1) %>%
  summarize(
    col_2_distinct = 
            unique(col_2) %>% str_c(collapse = ', '),
    col_3_match =
      str_subset(col_3, '\\d+$') %>% last()
  )

Here is how this works:

  • We call group_by() prior to calling summarize() to have the data aggregation operation carried out on each group separately.
  • In unique(col_2) %>% str_c(collapse = ', ') we combine (concatenate) the unique values of the string column col_2 for each group into one comma separated string. The two functions we used are:
    • unique() returns the unique values as a vector. See Uniqueness.
    • str_c() combines multiple strings into one separated by the value passed to the collapse argument.
  • In str_subset(col_3, '\\d+$') %>% last() we return the last value of col_3 for each group that matches the regular expression '\\d+$'.
    • str_subset() returns the subset of values of a string vector that match the given pattern.
    • last() returns the last occurring value of a given vector.
  • For a coverage of string operations (including regular expressions), see String Operations.

Logical

We wish to perform data aggregation operations on a column of a logical data type; i.e. one that takes a value of TRUE or FALSE.

In this example, for each group, where the groups are defined by the value of the column col_1, we wish to obtain the number and ratio of rows where the column col_2 is NA.

df_2 = df %>% 
  group_by(col_1) %>%
  summarize(
    col_2_na_count = 
      col_2 %>% is.na() %>% sum(),
    col_2_na_rate =
      col_2 %>% is.na() %>% mean()
  )

Here is how this works:

  • We call group_by() prior to calling summarize() to have the data aggregation operation carried out on each group separately.
  • We pass the column col_2 to the function is.na() from base R which returns a logical vector that is TRUE where col_2 is NA and FALSE otherwise.
  • We pass the boolean vector returned by is.na() to sum() to obtain the number of missing values. Summing a boolean vector is basically counting the number of TRUE values (which in this case is the number of missing values) because when arithmatic operations are carried out on logical vectors, TRUE is regarded as 1 and FALSE is regarded as 0.
  • Similarly, we use mean() to obtain the proportion of missing values of col_2 for each group. Since TRUE is regarded as 1 and FALSE is regarded as 0, taking the mean of a logical vector is basically taking the ratio between the number of TRUE values and the total number of values in the input vector.
  • For a coverage of logical operations, see Logical Operations

Date Time Aggregation

We wish to group a data frame by some unit or interval of time e.g. group by year or weekday.

In this example, we wish to group the data frame df by weekday inferred from the date-time column col_1, where col_1 is provided as a string. We then wish to compute the number of rows in each group and the number of unique hours that col_1 takes in each group.

df_2 = df %>%
  mutate(col_1 = ymd_hms(col_1)) %>%
  group_by(week_day = wday(col_1, label = TRUE)) %>%
  summarize(
    count = n(),
    hour_count = hour(col_1) %>% n_distinct())

Here is how this works:

  • We call group_by() prior to calling summarize() to have the data aggregation operations carried out on each group separately.
  • In col_1 = ymd_hms(col_1), we convert the data type of the column col_1 from string to datetime.
  • In wday(col_1, label = TRUE), we identify the weekday of each datetime value of col_1. We set label=TRUE to have string labels, e.g. “Mon”, instead of integers.
  • We do the extraction of weekday inside group_by() which is a good idea when we do not want to retain the column beyond grouping. We cover this in more detail in Grouping.
  • We use n() to obtain the number of rows in each group.
  • In hour(col_1), we extract the hour of each datetime value of col_1 for the current group.
  • We use n_distinct() to count the number of unique values of hour(col_1) for each group.
  • For a coverage of date time operations, see Date Time Operations.

Custom Operations

We wish to perform a custom aggregation operation i.e. one for which there is no built-in function.

There are two approaches to performing custom aggregation operations:

  • Via a custom function that we call from within summarize(). This is appropriate for aggregations that we need to carry out more than once or aggregations that are relatively complex.
  • Via embedded logic inside summarize(). This is appropriate for simple logic that we need to carry out once or twice.

We will look at an example of both in the following example.

In this example, we wish to:

  • compute the mode, i.e. most frequently occurring value, of the column col_2
  • compute the sum of the numerical substrings of the values of the string column col_3

for each group of the data frame df, where the groups are defined by the values of the column col_1.

mode <- function(.x) {
  .x %>% 
    table() %>% 
    sort(decreasing = TRUE) %>% 
    names() %>% 
    first()
}

df_2 = df %>%
  group_by(col_1) %>%
  summarize(
    col_2_mode = mode(col_2), 
    col_3_sum = col_3 %>% 
      str_extract('\\d+') %>% 
      parse_integer() %>% 
      sum())

Here is how this works:

  • We call group_by() prior to calling summarize() to have the data aggregation operations carried out on each group separately.
  • We use a custom function to carry out the first aggregation:
    • We created the custom function mode() that accepts a vector and returns the most commonly occurring value. We cover how this function works in Counting.
    • In col_2_mode = mode(col_2), we call the mode() function from within summarize() and pass to it col_2. Since we called group_by() before summarize(), mode() will be called once for each group to return the most frequently occurring value of col_2 for that group.
  • We use embedded logic to carry out the second aggregation:
    • We carry out three functions in sequence each acting on the output of the one prior:
      • str_extract('\\d+') extracts the first integer sub string occurring in each value of col_3
      • parse_integer() converts the stings containing integers extracted by str_extract() into proper integers
      • which are then passed to sum() to reduce to a single value for the current group
    • We could have done this via sum(parse_integer(str_extract(col_3, '\\d+'))). However, instead of nesting functions, our preferred approach is to chain them via the pipe %>% operator. Note that what is being piped through is not the data frame df, but the values of the column col_3 for each group.
  • As mentioned above, custom functions are preferred over embedded logic for more involved logic or logic that we need to carry out multiple times.
R
I/O