Function Specification

We wish to specify one or more data transformation functions to apply to each of the selected columns without spelling out each data transformation expression explicitly.

In this section, we cover the following function specification scenarios:

  • A named function which may be a built-in function or a custom function.
  • An anonymous function which is a one-sided formula defined via ~.
  • Multiple functions each of which is applied separately to each of the selected columns.
  • A Non Vectorized Function i.e. one that acts on one row at a time.

This section is complemented by

  • Column Selection where we cover how to select the columns to each of which we will apply data transformation logic.
  • Output Naming where we cover how to specify the name(s) of output column(s) created by the implicit data transformation operations.

Named Function

We wish to transform a set of columns of a data frame by applying the same data transformation named function to each column.

In this example, we wish to apply the function round() to round the values of each of the columns col_1, col_2, and col_3 to the nearest integer.

df_2 = df %>%
  mutate(across(c(col_1, col_2, col_3), round))

Here is how this works:

  • We pass the data frame df to the function mutate().
  • Inside mutate() we use across() as follows:
    • The first argument to across() is a selection of columns. In this example, we use c(col_1, col_2, col_3) to select the columns col_1, col_2, and col_3.
    • The second argument to across() is the data transformation expression that we wish to apply to each column selected in the first argument. In this case the data transformation we wish to apply is the named function round().
  • The function round() is applied to each of the column col_1, col_2, and col_3.
  • The output columns overwrite the original columns. See Output Naming for how to append new columns instead.

Anonymous Function

We wish to transform a set of columns of a data frame by applying the same data transformation anonymous function to each column.

In this example, we wish to apply an arithmatic operation to each of the columns col_1, col_2, and col_3.

df_2 = df %>% 
  mutate(across(c(col_1, col_2, col_3), 
                ~ . * 22 / 7))

Here is how this works:

  • We use across() inside mutate() to carry out implicit data transformation as described in the “Named Function” scenario above.
  • In ~ . * 22 / 7, we specify an anonymous function that multiplies the values of the input column by 22 and divides by 7.
  • The anonymous function ~ . * 22 / 7 is applied to each of the selected columns.

Multiple Functions

We wish to perform multiple data transformation operations to each of a set of columns of a data frame without having to spell out each data transformation explicitly.

In this example, we wish to apply two data transformations to each of the columns col_1 and col_2 of the data frame df. The two transformations are: round() to round the values to the nearest integer and abs() to obtain the absolute value

df_2 = df %>%
  mutate(across(c(col_1, col_2), 
                list(round, abs)))

Here is how this works:

  • We use across() inside mutate() to carry out implicit data transformation as described in the “Named Function” scenario above.
  • In list(round, abs), we specify the functions that we wish to carry out on each of the selected columns.
  • When we pass more than one function to across(), the input columns will remain unaffected, and n new columns will be created for each input column. In this case, n is 2 since we have two data transformation functions.
  • The name of an output column will be the name of the input column followed by a postfix “_i" where i is the position of the function in the input list e.g. if we have an input column col_1, the names of the two output columns resulting from the transformation above will be col_1_1 and col_1_2. See Output Naming for how to specify the names of output columns of an implicit data transformation operation.

Non-Vectorized Function

One Variable

We wish to transform a set of columns by applying to each a non-vectorized data transformation operation that acts on one column at a time. In other words, the operation we wish to apply acts on each value of each column separately to generate a corresponding set of transformed output columns.

In this example, we wish to obtain an estimate of the average number of events to expect after 100 trials assuming that we have a Poisson process where the rate of event occurrence is constant. The rates of event occurrence are given by the individual values of columns col_1 and col_2.

df_2 = df %>% 
  rowwise() %>% 
  mutate(across(c(col_1, col_2), 
                ~ mean(rpois(100, .))))

Here is how this works:

  • We apply rowwise() before mutate() so mutate() acts on each row individually.
  • In across(c(col_1, col_2), … we specify that we wish to apply the non-vectorized transformation function to both col_1 and col_2 individually.
  • Therefore, the function ~ mean(rpois(100, .)) will be applied to each individual value of the columns col_1 and col_2.
  • We use rpois() to simulate the number of events that can occur given a particular mean rate of event occurrence.
  • In rpois(100, .), we simulate 100 draws from a Poisson distribution where the mean rate of event occurrence is defined by the . which is the current value of the current column.
  • We then take the mean() of those 100 simulated values to obtain the desired estimate.

Multiple Variables - Same Data Type

We wish to apply a non-vectorized data transformation operation of multiple columns all of which have the same data type.

In this example, we wish to create two new columns: col_4 to hold the mean of the values of all numeric columns and col_5 to hold the number of numeric columns whose value is NA for the current row.

df_2 = df %>%
  rowwise() %>%
  mutate(
    col_4 = mean((c_across(where(is.numeric)))),
    col_5 = sum(is.na(c_across(where(is.numeric))))
    )

Here is how this works:

  • We apply rowwise() before mutate() so mutate() acts on each row individually.
  • c_across() works with rowwise() to allow us to select the columns on which to perform row-wise aggregations just like we would inside a select() operation (i.e. uses tidyselect semantics). See Selecting.
  • c_across() uses vctrs::vec_c() to enforce safer outputs but that means it can only select columns of the same data types.
    • If we try to select columns of different data types We get an error like “Can't combine col_1 <integer> and col_2 <character>” .
    • While this is often the right thing to do, there are situations where we wish to work with columns of multiple data types e.g. for a case like col_5 here to count the number of columns with missing values for the current row. See “Multiple Variables - Different Data Types” for how to execute a non-vectorized data transformation operation on multiple columns of different data types.
    • Note that this is a constraint of c_across() so this works sum(is.na(c(col_1, col_2, col_3))) even if the columns have different data types.

Multiple Variables - Different Data Types

We wish to apply a non-vectorized data transformation operation of multiple columns that may have different data types.

In this example, we wish to create a new columns: col_4 which holds the number of columns whose value is NA for the current row.

df_2 = df %>% 
  mutate(col_4 = pmap_int(., ~sum(is.na(c(...)))))

Here is how this works:

  • As discussed in “Multiple Variables - Same Data Type” above, c_across() requires that the selected columns be of a common data type. An alternative approach that allows us to work with columns of different data types is to use the purrr family of map functions.
  • The first step to using the purrr family of map functions is to identify the right mapping function for the inputs and output of the situation at hand. In this case, since we have many input columns and one output of an integer data type, we opted for the pmap_int() mapping function.
  • pmap_int() expects:
    • a list of columns (more precisely, a list of lists) and
    • a function or an anonymous function (one-sided formula) that accepts as many inputs as the number of input columns and returns a single numerical value.
  • The anonymous function ~sum(is.na(c(...))) works as follows:
    • It accepts the values of all columns for the current row via
    • It wraps them into a vector via c() because is.na() expects a vector and not individual values.
    • The vector is passed to is.na().
    • The output of is.na() is a logical vector with as many elements (TRUE or FALSE) as the number of columns.
    • sum() acts on the logical vector output of is.na() treating TRUE as 1 and FALSE as 0 hence counting the number of TRUE values i.e. the number of missing values.
  • pmap_int() iterates over rows and passes the corresponding values of all columns to the anonymous function and finally returns the output as a vector of integers of the same size as the number of rows.
  • The output of pmap_int() is then assigned to the new columns col_4.
  • We cover mapping over a list and the purrr map family of functions in detail in List Operations.

Alternatively,

fun <- function(...) {
  return(sum(is.na(c(...))))
}

df_2 = df %>% 
    mutate(col_4 = pmap_dbl(., fun))

Here is how this works:

  • We isolated the transformation logic into a function fun() that we can then call from within pmap_dbl().
  • If an anonymous function (one-sided formula) gets too complicated, it is easy to make mistakes. It is, therefore, recommended to separate it out into a named function.
R
I/O