Combining

In this section, we cover the five common string combining scenarios that we come across in data manipulation contexts:

  1. Concatenating: Append two or more strings to each other.
  2. Collapsing: Concatenate all elements of a string vector into one string. A common application is to aggregate the values of a string column into a single string.
  3. Interpolating: Insert the value of one or more string variables into a string template.
  4. Repeating: Concatenate a string with itself a given number of times.
  5. Padding a string with a given character up to a given width.

Concatenating

We wish to concatenate the values of one column to the corresponding values of another column.

In this example, we wish to concatenate the values of the column col_1 to the corresponding values of the column col_2 with a comma as a separator into a new column col_3.

df_2 = df %>%
    mutate(col_3 = str_c(col_1, col_2, sep = ', '))

Here is how this works:

  • We use the str_c() function from the stringr package (part of the tidyverse) to concatenate the values in the column col_1 with the corresponding values in the column col_2 with a separator ', ‘ in between.
  • The str_c() function takes the following arguments:
    • The two columns whose values for each row we wish to concatenate; which in this case are col_1 and col_2.
    • A named argument sep that specifies the separator to insert between the values in the concatenated columns; which in this case is sep = ', ‘.
  • The output data frame df_2 will have the same number of rows as the original data frame df but with an additional columncol_3 holding the concatenation of col_1 and col_2 with a comma and space , as a separator.

Extension: More than 2 Columns

We wish to concatenate the values of multiple columns; i.e. concatenate the values of those columns for each row.

In this example, we wish to concatenate the values of the column col_1 to the corresponding values of the column col_2 and col_3 with a comma as a separator into a new column col_4.

df_2 = df %>%
  mutate(col_4 = str_c(col_1, col_2, col_3, 
                       sep = ', '))

Here is how this works:

  • This works similarly to the code in the primary solution above except that, since we wish to concatenate the values of more than two columns, we pass to str_c() the names of all the columns whose values we wish to concatenate; which in this case is col_1, col_2, col_3.
  • The output is the original data frame df with an additional columncol_4 holding the concatenation of col_1, col_2 and col_3 with a comma and space , as a separator.

Extension: Handling Missing Values

We wish to concatenate the values of two columns and substitute missing values with a given token (string).

In this example, we wish to concatenate the values of the column col_1 to the corresponding values of the column col_2 with a comma as a separator into a new column col_3. We wish to substitute missing values with ‘-’.

df = df %>%
    mutate(
        col_3 = str_c(str_replace_na(col_1, '-'),
        str_replace_na(col_2, '-'), 
        sep = ', '))

Here is how this works:

  • If we run str_c() on columns that have missing values, any concatenation operation where any of the values being concatenated is a missing value would have returned a missing value as the result of the operation.
  • One way to deal with this it to replace missing values with some particular string (which can be an empty string). In this example, we wish to use ‘-’ as a replacement for missing values.
  • We use the function str_replace_na() to replace missing values in the columns to be concatenated, which here are col_1 and col_2, with a particular replacement string, which here is '-'.
  • See Missing Values for a general coverage of working with missing values.

Collapsing

We wish to concatenate all values of a string column into one string with an optional separator.

In this example, we wish to concatenate the values of the column col_2 into a single string for each group where the groups are defined by the values of the column col_1.

df_2 = df %>%
  group_by(col_1) %>%
  summarize(col_3 = str_c(col_2, collapse = ', '))

Here is how this works:

  • The group_by() function groups the rows of the data frame df by the values in the column col_1. The data aggregation operations applied by the subsequent call to summarize() are, therefore, applied to each group separately. See Summary Table.
  • To have str_c() reduce a vector to a single string value that is the concatenation of the vector’s values, we need to pass some value to the argument collapse, which here we set to collapse = ', '.
  • If we do not pass any value to the collapse argument of str_c() then str_c() returns the original vector as is.
  • For dealing with missing values, see “Extension: Handling Missing Values” under Concatenating above.

Alternative: Using a different function

df_2 = df %>%
  group_by(col_1) %>% 
  summarize(col_3 = str_flatten(col_2, collapse = ', '))

Here is how this works:

  • This code performs the same function as the primary solution above except that we use str_flatten() instead of str_c().
  • str_flatten() performs the same function as str_c(). It may just be a bit more memorable to the use case of collapsing the values of a string vector to a single string value.

Interpolating

We wish to create a new string column by inserting the value of one or more string variables into a string template.

In this example, we wish to create a new column col_3 from the values of the columns col_1 and col_2 where each value follows the template '{col_1} vs {col_2}' for the corresponding rows.

df_2 = df %>%
  mutate(col_3 = str_glue('{col_1} vs {col_2}'))

Here is how this works:

  • We use the str_glue() function from the stringr package (part of the tidyverse) to create a new column col_3 by inserting values of the columns col_1 and col_2 into the template '{col_1} vs {col_2}'.
  • The str_glue() is quite convenient to work with. We use it as follows:
    • We pass to it the static string template that we wish to use which in this case is '{col_1} vs {col_2}'.
    • We simply refer to the columns whose values we wish to insert into the template by name, in this case we refer to the columns col_1 and col_2 by name.
  • The reason we can refer to columns by name inside str_glue() is that it accepts an argument .x that specifies the environment in which it will be evaluated. When we run str_glue() inside a dplyr verb like mutate(), it is inferred that the environment is the current data frame.
  • The output data frame df_2 will have the same number of rows as the original data frame df but with an additional columncol_3 holding the interpolation of the columns col_1 and col_2 in the template '{col_1} vs {col_2}'.

Repeating

We wish to create a new string column by repeating the values in an existing string column a set number of times.

In this example, we wish to create a new column col_3 where each value is the repetition (concatenation to itself) of the corresponding value in the col_2 3 times.

df_2 = df %>% 
  mutate(col_3 = str_dup(col_2, 3))

Here is how this works:

  • We use the str_dup() function from the stringr package (part of the tidyverse) to create a new column col_3 by repeating (concatenating to itself) each value in the column col_2 a specified number of times.
  • The str_dup() method takes an integer argument, which in this case is 3, specifying the number of times each value of the calling string column will be repeated.
  • The output is the original data frame df with an additional columncol_3 holding the values from the column col_2 each repeated 3 times.

Padding

We wish to pad a string with a specified character for up to a given width.

In this example, we wish to create a new column col_3 that is a padding of the values of the column col_2 to get each to 4 characters.

df = df %>% 
  mutate(col_3 = str_pad(col_2, 
                         4, 
                         side='left', 
                         pad = '*'))

Here is how this works:

  • We use the str.pad() method, from the str accessor set of string manipulation methods of Pandas Series, to pad the values of the column col_2 by adding the character * to the left so each string value is 4 characters long.
  • The str.pad() method takes three arguments:
    • width which is the desired length of the output padded string and which in this case is 4.
    • side is the side of the string on which the padding should be applied and takes one of three values 'left', 'right', and 'both' and which in this case is left.
    • pad which specified the character to use for padding the input string to match the given width and which in this case is '*'.
  • The output is the original data frame df with an additional columncol_3 holding the values from the column col_2 with padding applied to the left side of each value.
R
I/O