The most common data transformations involve applying a vectorized function (i.e. one that operates on an entire column at a time) to one or more columns of a data frame to output a new column.
We will cover three common scenarios for creating new columns:
We wish to create a new column that is entirely composed of the same value.
In this example, we create a new column col_3
that has the value 2022 for all rows of the data frame df
.
df_2 = df %>%
mutate(col_3 = 2022)
Here is how this works:
tidyverse
is the function mutate()
(referred to as a verb in tidyverse
lingo).mutate()
expects a data frame, which in this case is df
, and a data transformation expression, which in this case is col_3 = 2022
.df
to mutate()
via the pipe %>%
.col_3 = 2022
:col_3
is the name of the output column.=
is the assignment operator assigning the output of the expression on the right side to the variable on the left side.2022
is the data transformation to be carried out which here is simply to assign the same scaler value to each row.mutate()
is a copy of the data frame with the new column(s) added or with existing data frames overwritten if the transformed column name is the same as any existing column names.We wish to create a new column by applying arithmetic or logical operations on existing columns.
In this example, we create a new column col_3
that is the ratio of two existing columns col_1
and col_2
.
df_2 = df %>%
mutate(col_3 = col_1 / col_2)
Here is how this works:
col_3
is the name of the output column that will be created (or overwritten if a column by the same name exists in the data frame)col_1 / col_2
is the data transformation to be carried out which is to divide the value of the column col_1
for each row by the value of the column col_2
of the same row.mutate()
carries out the specified transformation on the given data frame df
. See the “Scalars” scenario above for more on mutate()
.+
, -
, *
, and /
) or comparison operations (such as >
, <
, and ==
) applied to numeric data which we cover in Numeric Operations.Built-In Functions
We wish to create a new column by applying one or more built-in functions on existing columns.
In this example, we create a new column col_3
that is the log of an existing column col_1
.
df_2 = df %>%
mutate(col_3 = log10(col_1))
Here is how this works:
col_3 = log10(col_1)
, we specify the data transformation expression wherecol_3
: The name of the output column that will be created (or overwritten if a column by the same name exists in the data frame)log10(col_1)
: The data transformation to be carried out which is to apply the function log10()
to take the log base 10 of each value of the column col_1
.mutate()
carries out the specified transformation on the given data frame df
. See the “Scalars” scenario above for more on mutate()
.Custom Functions
We wish to create a new column by applying one or more custom functions on existing columns.
In this example, we create a new column col_3
via a custom function that accepts a numerical column and returns a column of the same size that is the ratio of each value of the input column to the sum of all values of that column.
m_fun <- function(var) {
var / sum(var)
}
df_2 = df %>%
mutate(col_3 = m_fun(col_1))
Here is how this works:
m_fun()
which accepts a column passed to its var
argument.var / sum(var)
, inside the custom function, we compute the ratio of each value of the input column to the sum of all values of the input column.mutate()
just like we would a built-in function (see “Built-In Functions” above).