Common Transformation Scenarios

The most common data transformations involve applying a vectorized function (i.e. one that operates on an entire column at a time) to one or more columns of a data frame to output a new column.

We will cover three common scenarios for creating new columns:

Scalars where we create a new column entirely composed of the same value. This may be a starting point of a larger transformation or to identify rows of a data frame before appending rows from multiple data frames.
Operations where we create a new column by applying arithmetic or logical operations on existing columns. For instance, creating a new column that is the ratio of two existing columns.
Functions where we create a new column by applying one or more functions to existing columns. For instance, creating a new column that is the log of an existing column.

Scalars

We wish to create a new column that is entirely composed of the same value.

In this example, we create a new column col_3 that has the value 2022 for all rows of the data frame df.

df_2 = df %>%
    mutate(col_3 = 2022)

Here is how this works:

The workhorse of data transformation in the tidyverse is the function mutate() (referred to as a verb in tidyverse lingo).
mutate() expects a data frame, which in this case is df, and a data transformation expression, which in this case is col_3 = 2022.
We pass the data frame df to mutate() via the pipe %>%.
Let’s take a closer look at the data transformation expression col_3 = 2022:
- col_3 is the name of the output column.
- = is the assignment operator assigning the output of the expression on the right side to the variable on the left side.
- 2022 is the data transformation to be carried out which here is simply to assign the same scaler value to each row.
The output of mutate() is a copy of the data frame with the new column(s) added or with existing data frames overwritten if the transformed column name is the same as any existing column names.
We can assign a scaler value of any data type including numeric, string, logical, factor, and datetime.

Operations

We wish to create a new column by applying arithmetic or logical operations on existing columns.

In this example, we create a new column col_3 that is the ratio of two existing columns col_1 and col_2.

df_2 = df %>%
    mutate(col_3 = col_1 / col_2)

Here is how this works:

col_3 is the name of the output column that will be created (or overwritten if a column by the same name exists in the data frame)
col_1 / col_2 is the data transformation to be carried out which is to divide the value of the column col_1 for each row by the value of the column col_2 of the same row.
mutate() carries out the specified transformation on the given data frame df. See the “Scalars” scenario above for more on mutate().
While there are operations that work for different data types, most often it is arithmetic operations (such as +, -, *, and /) or comparison operations (such as >, <, and ==) applied to numeric data which we cover in Numeric Operations.

Functions

Built-In Functions

We wish to create a new column by applying one or more built-in functions on existing columns.

In this example, we create a new column col_3 that is the log of an existing column col_1.

df_2 = df %>%
    mutate(col_3 = log10(col_1))

Here is how this works:

In col_3 = log10(col_1), we specify the data transformation expression where
- col_3: The name of the output column that will be created (or overwritten if a column by the same name exists in the data frame)
- log10(col_1): The data transformation to be carried out which is to apply the function log10() to take the log base 10 of each value of the column col_1.
mutate() carries out the specified transformation on the given data frame df. See the “Scalars” scenario above for more on mutate().
We cover functions for manipulating different data types in their respective operations sections; Numeric, String, Logical, Factor, and Datetime Operations.

Custom Functions

We wish to create a new column by applying one or more custom functions on existing columns.

In this example, we create a new column col_3 via a custom function that accepts a numerical column and returns a column of the same size that is the ratio of each value of the input column to the sum of all values of that column.

m_fun <- function(var) {  
  var / sum(var)
}

df_2 = df %>%
    mutate(col_3 = m_fun(col_1))

Here is how this works:

We create a custom function m_fun() which accepts a column passed to its var argument.
In var / sum(var), inside the custom function, we compute the ratio of each value of the input column to the sum of all values of the input column.
We can call the custom function as part of a data transformation expression inside mutate() just like we would a built-in function (see “Built-In Functions” above).

Optima.io Reference beta

Common Transformation Scenarios

Scalars

Operations

Functions