Process String Parts

In some situations, we don’t merely wish to add new columns corresponding to some or all parts resulting from the split of a string column. Rather, we wish to apply some processing on the parts to generate the desired output.

In this section we cover three common scenarios of processing parts produced by string splitting:

  1. Concatenate Parts: We cover how to create new columns by concatenating some of the parts resulting from splitting the values of a string column.
  2. Set Data Type: We cover how to set the data type of parts resulting from splitting the values of a string column.
  3. Custom Logic: We cover how to apply any arbitrary processing logic to the parts resulting from splitting the values of a string column.

This section is complemented by:

  • Split an input string around a delimiter which may be a substring or a regex pattern e.g. break a sentence into words.
  • Select the desired parts out of the parts returned by the prior splitting step e.g. pick the second part and drop the rest.

Concatenate Parts

We wish to create new columns by concatenating some of the parts resulting from splitting the values of a string column.

In this example, we will pick the third and first parts resulting from splitting a string column and concatenate them into a new column.

df_2 = df %>%
  mutate(
    str_parts = str_split(col_1, '-', simplify=TRUE),
    col_2 = str_c(str_parts[, 3], str_parts[, 1], sep = '_')) %>% 
  select(-str_parts)

Here is how this works:

  • We use str_split() with simplify=TRUE to return a matrix of parts with one row for each element in col_1. We store this matrix in a column str_parts. See Split.
  • We select the matrix columns corresponding to the two parts (resulting from the split step) that we wish to concatenate, which are the third and the first in this example, via str_parts[, 3] and str_parts[, 1] respectively.
  • We then use the function str_c() to concatenate the selected vectors element-wise to produce a new vector as output and store that in a new column col_2.

Alternative: via unite()

df_2 = df %>%
  separate(col_1, c('str_part_1', NA, 'str_part_3'), remove=FALSE) %>%
  unite('col_2', str_part_3, str_part_1, sep = '_')

Here is how this works:

  • For this special case, where we wish to concatenate together the values of multiple vectors in an element-wise manner, we can use the convenience function unite() from the tidyr package which is specially built for this purpose.
  • We use the sister function of unite(), separate() to extract and create two new columns str_part_1 and str_part_3 holding the first and third parts resulting from splitting the string column col_1. See Alternative Solution under Selecting Some Parts.
  • In this case, we pass to the function unite() three inputs:
    • The name of the new column to create, which in this case is ‘col_2’.
    • A comma-separated list of the columns whose values we wish to concatenate, which in this case is str_part_1 and str_part_3.
    • A separate to use between the parts, which in this case is ‘-’.

Set Data Type

We wish to set the data type of parts resulting from splitting the values of a string column.

df_2 = df %>%
  mutate(
    str_parts = str_split(col_1, ' ', simplify = TRUE),
    col_2 = as.numeric(str_parts[, 2]),
    col_3 = str_parts[, 3]) %>% 
  select(-str_parts)

Here is how this works:

  • We can apply any of the data type conversion functions to a selected part to convert the resulting column to the desired data type. See Data Type Conversion.
  • In as.numeric(str_parts[, 2]), we use the function as.numeric() to covert the vector, that is the second part of the values of the column col_1, to a numeric data type.

Alternative: via separate()

df_2 = df %>%
  separate(col_1, 
           c(NA, 'col_2', 'col_3'), 
           sep=' ',
           convert=TRUE,
           remove=FALSE)

Here is how this works:

  • When our objective is to split a string column around a given delimiter into multiple columns, one solution is to use the function separate(). See Selecting Parts.
  • The separate() function has a convert argument which when set to convert=TRUE will convert the data type of the new columns to the most appropriate data type i.e. integer, double, or logical.

Apply Custom Logic

We wish to apply some arbitrary processing logic to the parts resulting from splitting the values of a string column. In other words, we wish to apply some custom logic to determine how to use the output of a string split operation.

In this example, the elements of column col_1 are strings of the form '4_2g, 16_3g, 12g'. We wish to multiply the numbers within the comma-separated segments and add up the results. So for this case, we would get 4 * 2 + 16 * 3 + 12 = 68.

flatten_quantity <- function(parts) {
  quantity = str_split(parts, '_') %>%
    map_dbl(~(str_extract_all(., '\\d+') %>% 
                as.integer() %>%
                prod())) %>%
    sum()
  return(quantity)
}

df_2 = df %>%
  mutate(col_2 = map_dbl(str_split(col_1, ', '), flatten_quantity))

Here is how this works:

  • We use the function str_split() to break down each element of the column col_1 into parts around the delimiter ‘, ‘. The output is a list (of the same size as col_1) of vectors (containing the parts of the respective element of col_1). See Split.
  • We use map_dbl() to iterate over the list of vectors returned by str_split() and call a custo function flatten_quantity() on each vector. We opted for map_dbl() because we expect a single numeric output for each element of the column col_1. See Working with Lists.
  • The custom function flatten_quantity() works as follows:
    • It accepts a vector of strings which are the parts resulting from splitting an element of the column col_1.
    • It splits each part on the underscore ‘_’ character.
    • Extracts all the digits from the string
    • Converts each of the resulting string sub-parts to an integer.
    • Computes the product of the sub-parts.
    • Then adds up the results to obtain a result for the current element of the column col_1.
R
I/O