Non-Vectorized Transformation

Most data transformation involves operating on columns with vectorized functions; i.e. functions that accept a vector, perform an operation on each element of that vector, and return a vector of the same size as the input vector eliminating the need for a loop. There are times though when we need to operate on rows in a non-vectorized manner e.g. if we wish to obtain the mean value of some columns for each row.

In this example, we have a data frame df with two numerical columns col_1 and col_2 and we wish to create a new column col_3 where each value is the mean of the values of the columns col_1 and col_2 for the same row. We also wish to create a column col_4 where each value is the minimum of the values of the columns col_1 and col_2 for the same row

df_2 = df %>%
    rowwise() %>%
    mutate(
        col_3 = mean(c(col_1, col_2), na.rm = TRUE),
        col_4 = min(col_1, col_2, na.rm = TRUE)
    )

Here is how this works:

  • rowwise() switches the mode of execution of the operations that follow from column wise operation to row wise operation which allows us to apply a non-vectorized function one row at a time.
  • Because of rowwise(), the expression inside mutate() will be applied one row at a time (instead of the usual execution on entire columns).
  • In mean(c(col_1, col_2)), the mean of the values of the columns col_1 and col_2 for each row is computed.
  • In min(col_1, col_2), the minimum value of the values of the columns col_1 and col_2 for each row is selected.
  • Depending on the signature of the function we wish to use, we may need to wrap the inputs in a vector c() or a list(). For instance, here we wrap the column names in c() for mean() but pass the columns directly to min() because of their signatures:
    • The signature of mean() is mean(x, na.rm = FALSE, ...) i.e. it expects a single vector like object holding the numerical values to be averaged.
    • While the signature of min() is min(..., na.rm = FALSE) where the accepts any number of individual values (or vectors).
  • The argument na.rm = TRUE, instructs both functions mean() and min() to ignore any missing values NA .
  • Note that any operations carried out after filter() will also be carried out in a non-vectorized manner. To switch back to regular vectorized operation, add ungroup() to the chain.

Alternatively:

We can use any of the map family of functions from the purrr library inside mutate() to apply any non vectorized function to each element of one or more columns.

df_2 = df %>%
  mutate(
    col_3 = map2_dbl(col_1, col_2, ~mean(c(.x, .y), na.rm = TRUE)),
    col_4 = map2_dbl(col_1, col_2, min,  na.rm = TRUE)
  )

Here is how this works:

  • The first step to using the purrr map family of functions from is to identify the right mapping function for the inputs and output of the situation at hand. In this case since we have two input columns and the output is a double precision numerical, we opted for the map2_dbl() mapping function.
  • map2_dbl() expects:
    • two columns (more precisely, two lists of the same length) and
    • a function or an anonymous function (one-sided formula) that accepts two inputs and returns a single numerical value.
  • map2_dbl() iterates over the two columns one row at a time and passes the corresponding values of the two columns to the function and finally returns the output as a vector of the same size as each of the input lists.
  • As described above, the way we structure the call to the function depends on the signature of the function. In this case, while min() can accept any number of inputs, mean() expects a single vector like input. Therefore:
    • For mean(): We used an anonymous function ~mean(c(.x, .y)) to wrap the values of the two columns into a vector which is then passed to mean().
    • For min(): We let map2_dbl() pass the column values directly to the first two arguments of min().
  • We cover mapping over a list and the purrr map family of functions in detail in List Operations.
  • Perhaps an advantage for using the map functions over rowwise() is that we can include both vectorized and non-vectorized operations inside the same call to mutate().
R
I/O