In this section, we cover the five common string combining scenarios that we come across in data manipulation contexts:
We wish to concatenate the values of one column to the corresponding values of another column.
In this example, we wish to concatenate the values of the column col_1
to the corresponding values of the column col_2
with a comma as a separator into a new column col_3
.
df_2 = df %>%
mutate(col_3 = str_c(col_1, col_2, sep = ', '))
Here is how this works:
str_c()
function from the stringr
package (part of the tidyverse
) to concatenate the values in the column col_1
with the corresponding values in the column col_2
with a separator ', ‘
in between.str_c()
function takes the following arguments:col_1
and col_2
.sep
that specifies the separator to insert between the values in the concatenated columns; which in this case is sep = ', ‘
.df_2
will have the same number of rows as the original data frame df
but with an additional columncol_3
holding the concatenation of col_1
and col_2
with a comma and space ,
as a separator.Extension: More than 2 Columns
We wish to concatenate the values of multiple columns; i.e. concatenate the values of those columns for each row.
In this example, we wish to concatenate the values of the column col_1
to the corresponding values of the column col_2
and col_3
with a comma as a separator into a new column col_4
.
df_2 = df %>%
mutate(col_4 = str_c(col_1, col_2, col_3,
sep = ', '))
Here is how this works:
str_c()
the names of all the columns whose values we wish to concatenate; which in this case is col_1, col_2, col_3
.df
with an additional columncol_4
holding the concatenation of col_1,
col_2
and col_3
with a comma and space ,
as a separator.Extension: Handling Missing Values
We wish to concatenate the values of two columns and substitute missing values with a given token (string).
In this example, we wish to concatenate the values of the column col_1
to the corresponding values of the column col_2
with a comma as a separator into a new column col_3
. We wish to substitute missing values with ‘-’
.
df = df %>%
mutate(
col_3 = str_c(str_replace_na(col_1, '-'),
str_replace_na(col_2, '-'),
sep = ', '))
Here is how this works:
str_c()
on columns that have missing values, any concatenation operation where any of the values being concatenated is a missing value would have returned a missing value as the result of the operation.‘-’
as a replacement for missing values.str_replace_na()
to replace missing values in the columns to be concatenated, which here are col_1
and col_2
, with a particular replacement string, which here is '-'
.We wish to concatenate all values of a string column into one string with an optional separator.
In this example, we wish to concatenate the values of the column col_2
into a single string for each group where the groups are defined by the values of the column col_1
.
df_2 = df %>%
group_by(col_1) %>%
summarize(col_3 = str_c(col_2, collapse = ', '))
Here is how this works:
group_by()
function groups the rows of the data frame df
by the values in the column col_1
. The data aggregation operations applied by the subsequent call to summarize()
are, therefore, applied to each group separately. See Summary Table.str_c()
reduce a vector to a single string value that is the concatenation of the vector’s values, we need to pass some value to the argument collapse
, which here we set to collapse = ', '
.str_c()
then str_c()
returns the original vector as is.Alternative: Using a different function
df_2 = df %>%
group_by(col_1) %>%
summarize(col_3 = str_flatten(col_2, collapse = ', '))
Here is how this works:
str_flatten()
instead of str_c()
.str_flatten()
performs the same function as str_c()
. It may just be a bit more memorable to the use case of collapsing the values of a string vector to a single string value.We wish to create a new string column by inserting the value of one or more string variables into a string template.
In this example, we wish to create a new column col_3
from the values of the columns col_1
and col_2
where each value follows the template '{col_1} vs {col_2}'
for the corresponding rows.
df_2 = df %>%
mutate(col_3 = str_glue('{col_1} vs {col_2}'))
Here is how this works:
str_glue()
function from the stringr
package (part of the tidyverse
) to create a new column col_3 by inserting values of the columns col_1
and col_2
into the template '{col_1} vs {col_2}'
.str_glue()
is quite convenient to work with. We use it as follows:'{col_1} vs {col_2}'
.col_1
and col_2
by name.str_glue()
is that it accepts an argument .x
that specifies the environment in which it will be evaluated. When we run str_glue()
inside a dplyr verb like mutate()
, it is inferred that the environment is the current data frame.df_2
will have the same number of rows as the original data frame df
but with an additional columncol_3
holding the interpolation of the columns col_1
and col_2
in the template '{col_1} vs {col_2}'
.We wish to create a new string column by repeating the values in an existing string column a set number of times.
In this example, we wish to create a new column col_3
where each value is the repetition (concatenation to itself) of the corresponding value in the col_2
3 times.
df_2 = df %>%
mutate(col_3 = str_dup(col_2, 3))
Here is how this works:
str_dup()
function from the stringr
package (part of the tidyverse
) to create a new column col_3
by repeating (concatenating to itself) each value in the column col_2
a specified number of times.str_dup()
method takes an integer argument, which in this case is 3, specifying the number of times each value of the calling string column will be repeated.df
with an additional columncol_3
holding the values from the column col_2
each repeated 3 times.We wish to pad a string with a specified character for up to a given width.
In this example, we wish to create a new column col_3
that is a padding of the values of the column col_2 to get each to 4 characters.
df = df %>%
mutate(col_3 = str_pad(col_2,
4,
side='left',
pad = '*'))
Here is how this works:
str.pad()
method, from the str
accessor set of string manipulation methods of Pandas Series
, to pad the values of the column col_2
by adding the character *
to the left so each string value is 4
characters long.str.pad()
method takes three arguments:width
which is the desired length of the output padded string and which in this case is 4
.side
is the side of the string on which the padding should be applied and takes one of three values 'left'
, 'right'
, and 'both'
and which in this case is left
.pad
which specified the character to use for padding the input string to match the given width
and which in this case is '*'
.df
with an additional columncol_3
holding the values from the column col_2
with padding applied to the left side of each value.