Most data transformation involves operating on columns with vectorized functions; i.e. functions that accept a vector, perform an operation on each element of that vector, and return a vector of the same size as the input vector eliminating the need for a loop. There are times though when we need to operate on rows in a non-vectorized manner e.g. if we wish to obtain the mean value of some columns for each row.
In this example, we have a data frame df
with two numerical columns col_1
and col_2
and we wish to create a new column col_3
where each value is the mean of the values of the columns col_1
and col_2
for the same row. We also wish to create a column col_4
where each value is the minimum of the values of the columns col_1
and col_2
for the same row
df_2 = df %>%
rowwise() %>%
mutate(
col_3 = mean(c(col_1, col_2), na.rm = TRUE),
col_4 = min(col_1, col_2, na.rm = TRUE)
)
Here is how this works:
rowwise()
switches the mode of execution of the operations that follow from column wise operation to row wise operation which allows us to apply a non-vectorized function one row at a time.rowwise()
, the expression inside mutate()
will be applied one row at a time (instead of the usual execution on entire columns).mean(c(col_1, col_2))
, the mean of the values of the columns col_1
and col_2
for each row is computed.min(col_1, col_2)
, the minimum value of the values of the columns col_1
and col_2
for each row is selected.c()
or a list()
. For instance, here we wrap the column names in c()
for mean()
but pass the columns directly to min()
because of their signatures:mean()
is mean(x, na.rm = FALSE, ...)
i.e. it expects a single vector like object holding the numerical values to be averaged.min()
is min(..., na.rm = FALSE)
where the …
accepts any number of individual values (or vectors).na.rm = TRUE
, instructs both functions mean()
and min()
to ignore any missing values NA
.filter()
will also be carried out in a non-vectorized manner. To switch back to regular vectorized operation, add ungroup()
to the chain.Alternatively:
We can use any of the map family of functions from the purrr
library inside mutate()
to apply any non vectorized function to each element of one or more columns.
df_2 = df %>%
mutate(
col_3 = map2_dbl(col_1, col_2, ~mean(c(.x, .y), na.rm = TRUE)),
col_4 = map2_dbl(col_1, col_2, min, na.rm = TRUE)
)
Here is how this works:
purrr
map family of functions from is to identify the right mapping function for the inputs and output of the situation at hand. In this case since we have two input columns and the output is a double precision numerical, we opted for the map2_dbl()
mapping function.map2_dbl()
expects:map2_dbl()
iterates over the two columns one row at a time and passes the corresponding values of the two columns to the function and finally returns the output as a vector of the same size as each of the input lists.min()
can accept any number of inputs, mean()
expects a single vector like input. Therefore:mean()
: We used an anonymous function ~mean(c(.x, .y))
to wrap the values of the two columns into a vector which is then passed to mean()
.min()
: We let map2_dbl()
pass the column values directly to the first two arguments of min()
.purrr
map family of functions in detail in List Operations.rowwise()
is that we can include both vectorized and non-vectorized operations inside the same call to mutate()
.