Select Split Parts

In Split, we saw how to split each string element in a data frame column into parts around a given string delimiter pattern. We will see in this section how to select one, all, or some of the parts resulting from splitting a string.

This section is organized as follows:

  • One Part; where we cover how to select any one part from the parts obtained by splitting a string given its index.
  • All Parts; where we cover how to select all parts obtained from splitting a string.
  • Some Parts; where we cover how to select some of the parts returned by splitting a string.

This section is complemented by:

  • Split an input string around a delimiter which may be a substring or a regex pattern e.g. break a sentence into words.
  • Process: An optional step where we process the selected parts resulting from the prior selection process e.g. concatenate the first and third parts with an underscore in between.

One Part

We wish to select any one part from the parts obtained by splitting a string given its index.

In this example, we wish to split each element in the column col_1 around the hyphen ‘-’ character and return the second part.

df_2 = df %>%
  mutate(
    col_2 = str_split(col_1, '-', simplify = TRUE)[, 2])

Here is how this works:

  • We use str_split() with simplify=TRUE to return a matrix of parts with one row for each element in col_1. See Split.
  • We use [, 2] to select the second column of the matrix returned by str_split().

Extension: Select Last Part

In this example, we wish to split each element in the column col_1 around the hyphen ‘-’ character and return the last part.

df_2 = df %>%
  mutate(
    col_3 = map_chr(str_split(col_1, '-'), ~.[length(.)]))

Here is how this works:

  • The challenge here is when the number of parts returned by splitting different elements of the column col_1 is not the same.
  • To deal with that, we use str_split() with simplify=FALSE (the default) to return a list of vectors (of parts) for each element in col_1.
  • We then use map_chr() to iterate on each list and return the last element of the part vector. See Working with Lists.
  • We use the anonymous function (one-sided formula) ~.[length(.)] to select the last element in a vector.

Extension: Select nth Part Relative to End

In this example, we wish to split each element in the column col_1 around the hyphen ‘-’ character and return the second to last part.

df_2 = df %>%
  mutate(col_2 = map_chr(
    str_split(col_1, "-"), 
    ~ifelse(length(.) > 1, .[length(.) - 1], NA)))

Here is how this works:

  • This is similar to the extension above except that we run a check to ensure that a vector is sufficiently long before we attempt to extract an element.
  • If for a particular value of col_1, the returned vector of parts is not long enough, we return NA.

All Parts

We wish to select all parts obtained from splitting a string.

In this example, we wish to split each element of the column col_1 around the hyphen character ‘-’ and store each part in a new column. We expect three parts, and we wish to call the three resulting columns 'col_2', 'col_3', and 'col_4'.

df_2 = df %>%
  mutate(
    str_parts = str_split(col_1, '-', n=3, simplify=TRUE),
    col_2 = str_parts[, 1],
    col_3 = str_parts[, 2],
    col_4 = str_parts[, 3]) %>% 
  select(-str_parts)

Here is how this works:

  • We use str_split() while setting simplify=TRUE to split each element of the column col_1 into parts and return the result as a matrix. See Split.
  • We then explicitly capture each column of the said matrix into a column of the data frame.
  • Note that we set n=3 in the call to str_split() so we are confident that we will get three parts and accordingly a matrix of three columns as output (filled with NA if necessary) and won’t get any unexpected index out of bounds errors on indexing.
  • This approach is suitable when the number of parts is known and small. In cases where the number of parts is either unknown or large, we recommend the first extension below.

Alternative: via separate()

df_2 = df %>% 
  separate(
    col_1, 
    c('col_2', 'col_3', 'col_4'), 
    '-',
    extra='merge',
    remove=FALSE)

Here is how this works:

  • For this special case where we wish to split values of a column and return the output as a set of new columns, we can use the function separate() from the tidyr package which is especially built for this purpose.
  • In this case, we pass to the function seprate() three inputs:
    • The column whose values are to be split, which in this case is the column col_1.
    • A vector with the names to assign to the new columns to be created.
    • The separator to split around, which in this case is '-'. Note: We can ommit this and separate() will identify it as the separator automatically.
    • We set extra='merge' to separate() creates only as many parts as the number of column names passed and where the last part contains the rest of the string
    • We set remove=FALSE so the input column, col_1, is not dropped in the output. The default is remove=TRUE.
  • If all we aim to do is split a column into a specific number of new columns, then using separate() usually provides a cleaner and more concise solution than using str_split() and mutate() and takes care of creating multiple new columns.

Extension: Arbitrary Number of Parts

df_2 = df %>%
  mutate(
    as_tibble(str_split(col_1, '-', simplify=TRUE)))

Here is how this works:

  • We use as_tibble() to convert the matrix returned by str_split() to a data frame (a tibble). When a data frame is returned to mutate() without being assigned to any column name, the data frame is unpacked to individual columns which is what we are looking for. See Multi-Value Transformation.
  • The resulting columns will have names V1, V2, V3, etc.. which is the default for as_tibble().

Alternative: Implicit Assignment

df_2 = df %>%
  mutate(
    as_tibble(
      str_split(col_1, '-', simplify=TRUE),
      .name_repair = ~c('col_2', 'col_3', 'col_4')))

Here is how this works:

  • This works similarly to the extension right above except that we assign names to the produced columns.
  • Since the matrix returns by str_split() has no column names, we use the .name_repair argument of as_tibble() to specify the names to assign to the columns of the data frame being created. In this case, we pass what is called a “look-up formula” specifying the column name for each column (three columns in this case) as ~c('country', 'method', 'year'). See Multi-Value Transformation for a description of this pattern.
  • Note that for column naming via .name_repair to work, we must be confident of the number of parts returned by splitting. If we are not sure of the number of parts to expect (or do not wish to name the resulting columns), we can drop the call to .name_repair. The resulting columns will be named V1, V2, V3, etc.. by default.

Alternative: Bind Columns

df_2 = df %>%
  bind_cols(df %>% 
              pull(col_1) %>% 
              str_split('-', simplify = TRUE) %>% 
              as_tibble())

Here is how this works:

  • We set simplify=TRUE so str_split() returns a matrix. We then convert the matrix to a data frame via as_tibble().
  • We then use bind_cols() to column bind the new columns to the original data frame.
  • The resulting columns will have names V1, V2, and V3 which is the default for as_tibble().

Some Parts

We wish to select some of the parts returned by splitting a string.

In this example, we wish to select parts 1 and 3 resulting from splitting each element of the column col_1 around the hyphen character ‘-’ as new columns.

df_2 = df %>%
  mutate(
    str_parts = str_split(col_1, '-', n=3, simplify=TRUE),
    col_2 = str_parts[, 1],
    col_4 = str_parts[, 3]) %>% 
  select(-str_parts)

Here is how this works:

  • We use str_split() while setting simplify=TRUE to split each element of the column col_1 into parts and return the result as a matrix. See Split.
  • We then explicitly capture each column of the said matrix into a column of the data frame. See All Parts above.

Alternative: via separate()

df_2 = df %>% 
  separate(
    col_1, 
    c('col_2', NA, 'col_3'), 
    '-',
    remove=FALSE)

Here is how this works:

  • For this special case where we wish to split values of a column and return the output as a set of new columns, we can use the function separate() from the tidyr package which is especially built for this purpose.
  • In this case, we pass to the function seprate() three inputs:
    • The column whose values are to be split, which in this case is the column col_1.
    • A vector with the names to assign to the new columns to be created.
    • The separator to split around, which in this case is '-'. Note: We can ommit this and separate() will identify it as the separator automatically.
    • We set remove=FALSE so the input column, col_1, is not dropped in the output. The default is remove=TRUE.
  • In the vector of column names c('col_2', NA, 'col_3'), we specify that we wish to capture the first and third parts and store those in columns named col_2 and col_3. Having NA as the second element of the column name vector specifies that we wish to drop that part.
  • If all we aim to do is split a column into a specific number of new columns, then using separate() usually provides a cleaner and more concise solution than using str_split() and mutate() and takes care of creating multiple new columns.

Extension: Arbitrary Number of Parts

In this example, we wish to drop parts 1 and 3 and return the rest as new columns.

df_2 = df %>%
  mutate(
    str_split(col_1, '-', simplify=TRUE) %>% 
      as_tibble() %>% 
      select(-1, -3))

Here is how this works:

  • Once the matrix resulting from str_split() is converted to a data frame via as_tibble(), we can use select() on the resulting data frame.
  • select() allows us a wide set of column selection options such as dropping particular parts which is what we do here via select(-1, -3). See Selection.

Extension: Relative to End

When we wish to extract the second last and third last parts of each string regardless of how many parts resulted from the splitting of the string.

get_nth_end <- function(p_vec, p_offset) {
  ifelse(length(p_vec) > p_offset, p_vec[length(p_vec) - p_offset], NA)
}

df_2 = df %>%
  mutate(
    str_parts = str_split(col_1, '-'),
    col_2 = map_chr(str_parts, get_nth_end, p_offset=1),
    col_3 = map_chr(str_parts, get_nth_end, p_offset=2)) %>% 
  select(-str_parts)

Here is how this works:

  • The challenge here is when the number of parts returned by splitting different elements of the column col_1 is not the same.
  • To deal with that, we use str_split() with simplify=FALSE (the default) to return a list of vectors (of parts) for each element in col_1.
  • We then use map_chr() to iterate on each list and return the last element of the part vector. See Working with Lists.
  • We create a custom function get_nth_end() that expects a vector and an offset value and returns the element of the vector at that offset from the end of the vector. If the vector is smaller than the offset, the function returns NA.
R
I/O