Common Transformation Scenarios

The most common data transformations involve applying a vectorized function (i.e. one that operates on an entire column at a time) to one or more columns of a data frame to output a new column.

We will cover three common scenarios for creating new columns:

  • Scalars where we create a new column entirely composed of the same value. This may be a starting point of a larger transformation or to identify rows of a data frame before appending rows from multiple data frames.
  • Operations where we create a new column by applying arithmetic or logical operations on existing columns. For instance, creating a new column that is the ratio of two existing columns.
  • Functions where we create a new column by applying one or more functions to existing columns. For instance, creating a new column that is the log of an existing column.

Scalars

We wish to create a new column that is entirely composed of the same value.

In this example, we create a new column col_3 that has the value 2022 for all rows of the data frame df.

df_2 = df %>%
    mutate(col_3 = 2022)

Here is how this works:

  • The workhorse of data transformation in the tidyverse is the function mutate() (referred to as a verb in tidyverse lingo).
  • mutate() expects a data frame, which in this case is df, and a data transformation expression, which in this case is col_3 = 2022.
  • We pass the data frame df to mutate() via the pipe %>%.
  • Let’s take a closer look at the data transformation expression col_3 = 2022:
    • col_3 is the name of the output column.
    • = is the assignment operator assigning the output of the expression on the right side to the variable on the left side.
    • 2022 is the data transformation to be carried out which here is simply to assign the same scaler value to each row.
  • The output of mutate() is a copy of the data frame with the new column(s) added or with existing data frames overwritten if the transformed column name is the same as any existing column names.
  • We can assign a scaler value of any data type including numeric, string, logical, factor, and datetime.

Operations

We wish to create a new column by applying arithmetic or logical operations on existing columns.

In this example, we create a new column col_3 that is the ratio of two existing columns col_1 and col_2.

df_2 = df %>%
    mutate(col_3 = col_1 / col_2)

Here is how this works:

  • col_3 is the name of the output column that will be created (or overwritten if a column by the same name exists in the data frame)
  • col_1 / col_2 is the data transformation to be carried out which is to divide the value of the column col_1 for each row by the value of the column col_2 of the same row.
  • mutate() carries out the specified transformation on the given data frame df. See the “Scalars” scenario above for more on mutate().
  • While there are operations that work for different data types, most often it is arithmetic operations (such as +, -, *, and /) or comparison operations (such as >, <, and ==) applied to numeric data which we cover in Numeric Operations.

Functions

Built-In Functions

We wish to create a new column by applying one or more built-in functions on existing columns.

In this example, we create a new column col_3 that is the log of an existing column col_1.

df_2 = df %>%
    mutate(col_3 = log10(col_1))

Here is how this works:

  • In col_3 = log10(col_1), we specify the data transformation expression where
    • col_3: The name of the output column that will be created (or overwritten if a column by the same name exists in the data frame)
    • log10(col_1): The data transformation to be carried out which is to apply the function log10() to take the log base 10 of each value of the column col_1.
  • mutate() carries out the specified transformation on the given data frame df. See the “Scalars” scenario above for more on mutate().
  • We cover functions for manipulating different data types in their respective operations sections; Numeric, String, Logical, Factor, and Datetime Operations.

Custom Functions

We wish to create a new column by applying one or more custom functions on existing columns.

In this example, we create a new column col_3 via a custom function that accepts a numerical column and returns a column of the same size that is the ratio of each value of the input column to the sum of all values of that column.

m_fun <- function(var) {  
  var / sum(var)
}

df_2 = df %>%
    mutate(col_3 = m_fun(col_1))

Here is how this works:

  • We create a custom function m_fun() which accepts a column passed to its var argument.
  • In var / sum(var), inside the custom function, we compute the ratio of each value of the input column to the sum of all values of the input column.
  • We can call the custom function as part of a data transformation expression inside mutate() just like we would a built-in function (see “Built-In Functions” above).
R
I/O