We wish to specify one or more data transformation functions to apply to each of the selected columns without spelling out each data transformation expression explicitly.
In this section, we cover the following function specification scenarios:
~
.This section is complemented by
We wish to transform a set of columns of a data frame by applying the same data transformation named function to each column.
In this example, we wish to apply the function round()
to round the values of each of the columns col_1
, col_2
, and col_3
to the nearest integer.
df_2 = df %>%
mutate(across(c(col_1, col_2, col_3), round))
Here is how this works:
df
to the function mutate()
.mutate()
we use across()
as follows:across()
is a selection of columns. In this example, we use c(col_1, col_2, col_3)
to select the columns col_1
, col_2
, and col_3
.across()
is the data transformation expression that we wish to apply to each column selected in the first argument. In this case the data transformation we wish to apply is the named function round()
.round()
is applied to each of the column col_1
, col_2
, and col_3
.We wish to transform a set of columns of a data frame by applying the same data transformation anonymous function to each column.
In this example, we wish to apply an arithmatic operation to each of the columns col_1
, col_2
, and col_3
.
df_2 = df %>%
mutate(across(c(col_1, col_2, col_3),
~ . * 22 / 7))
Here is how this works:
across()
inside mutate()
to carry out implicit data transformation as described in the “Named Function” scenario above.~ . * 22 / 7
, we specify an anonymous function that multiplies the values of the input column by 22 and divides by 7.~ . * 22 / 7
is applied to each of the selected columns.We wish to perform multiple data transformation operations to each of a set of columns of a data frame without having to spell out each data transformation explicitly.
In this example, we wish to apply two data transformations to each of the columns col_1
and col_2
of the data frame df
. The two transformations are: round()
to round the values to the nearest integer and abs()
to obtain the absolute value
df_2 = df %>%
mutate(across(c(col_1, col_2),
list(round, abs)))
Here is how this works:
across()
inside mutate()
to carry out implicit data transformation as described in the “Named Function” scenario above.list(round, abs)
, we specify the functions that we wish to carry out on each of the selected columns.across()
, the input columns will remain unaffected, and n
new columns will be created for each input column. In this case, n
is 2 since we have two data transformation functions.“_i"
where i
is the position of the function in the input list e.g. if we have an input column col_1, the names of the two output columns resulting from the transformation above will be col_1_1
and col_1_2
. See Output Naming for how to specify the names of output columns of an implicit data transformation operation.One Variable
We wish to transform a set of columns by applying to each a non-vectorized data transformation operation that acts on one column at a time. In other words, the operation we wish to apply acts on each value of each column separately to generate a corresponding set of transformed output columns.
In this example, we wish to obtain an estimate of the average number of events to expect after 100 trials assuming that we have a Poisson process where the rate of event occurrence is constant. The rates of event occurrence are given by the individual values of columns col_1
and col_2
.
df_2 = df %>%
rowwise() %>%
mutate(across(c(col_1, col_2),
~ mean(rpois(100, .))))
Here is how this works:
rowwise()
before mutate()
so mutate()
acts on each row individually.across(c(col_1, col_2), …
we specify that we wish to apply the non-vectorized transformation function to both col_1
and col_2
individually.~ mean(rpois(100, .))
will be applied to each individual value of the columns col_1
and col_2
.rpois()
to simulate the number of events that can occur given a particular mean rate of event occurrence.rpois(100, .)
, we simulate 100
draws from a Poisson distribution where the mean rate of event occurrence is defined by the .
which is the current value of the current column.mean()
of those 100 simulated values to obtain the desired estimate.Multiple Variables - Same Data Type
We wish to apply a non-vectorized data transformation operation of multiple columns all of which have the same data type.
In this example, we wish to create two new columns: col_4
to hold the mean of the values of all numeric columns and col_5
to hold the number of numeric columns whose value is NA
for the current row.
df_2 = df %>%
rowwise() %>%
mutate(
col_4 = mean((c_across(where(is.numeric)))),
col_5 = sum(is.na(c_across(where(is.numeric))))
)
Here is how this works:
rowwise()
before mutate()
so mutate()
acts on each row individually.c_across()
works with rowwise()
to allow us to select the columns on which to perform row-wise aggregations just like we would inside a select()
operation (i.e. uses tidyselect
semantics). See Selecting.c_across()
uses vctrs::vec_c()
to enforce safer outputs but that means it can only select columns of the same data types.“Can't combine col_1 <integer> and col_2 <character>”
.col_5
here to count the number of columns with missing values for the current row. See “Multiple Variables - Different Data Types” for how to execute a non-vectorized data transformation operation on multiple columns of different data types.c_across()
so this works sum(is.na(c(col_1, col_2, col_3)))
even if the columns have different data types.Multiple Variables - Different Data Types
We wish to apply a non-vectorized data transformation operation of multiple columns that may have different data types.
In this example, we wish to create a new columns: col_4
which holds the number of columns whose value is NA
for the current row.
df_2 = df %>%
mutate(col_4 = pmap_int(., ~sum(is.na(c(...)))))
Here is how this works:
c_across()
requires that the selected columns be of a common data type. An alternative approach that allows us to work with columns of different data types is to use the purrr
family of map
functions.purrr
family of map
functions is to identify the right mapping function for the inputs and output of the situation at hand. In this case, since we have many input columns and one output of an integer data type, we opted for the pmap_int()
mapping function.pmap_int()
expects:~sum(is.na(c(...)))
works as follows:…
c()
because is.na()
expects a vector and not individual values.is.na()
.is.na()
is a logical vector with as many elements (TRUE
or FALSE
) as the number of columns.sum()
acts on the logical vector output of is.na()
treating TRUE
as 1 and FALSE
as 0 hence counting the number of TRUE
values i.e. the number of missing values.pmap_int()
iterates over rows and passes the corresponding values of all columns to the anonymous function and finally returns the output as a vector of integers of the same size as the number of rows.pmap_int()
is then assigned to the new columns col_4
.purrr
map family of functions in detail in List Operations.Alternatively,
fun <- function(...) {
return(sum(is.na(c(...))))
}
df_2 = df %>%
mutate(col_4 = pmap_dbl(., fun))
Here is how this works:
fun()
that we can then call from within pmap_dbl()
.