Split a String

There are two common scenarios for splitting a string into parts which we will cover in this section:

  • Free Split: Split around a given delimiter executing as many splits as there are instances of the delimiter. The number of resulting substrings is, therefore, not fixed in advance. For instance, if we split ‘a, b, c, d’ around the ‘, ‘ as the delimiter, we would get four sub strings a, b, c and d.
  • Fixed Split: Split around a given delimiter executing a fixed pre-set number of splits. The maximum number of resulting substrings is known in advance. For instance, if we split ‘a, b, c, d’ around the ‘, ‘ as the delimiter with a number of splits set to two, we would get three substrings a, b, and c, d.

We then cover four extensions that may be applied to either of the above scenarios:

  • Split Words: Split a string into individual words.
  • Split Characters: Split a string into individual characters.
  • Multiple Delimiters: Split a string around any of a set of delimiters e.g. split a string around any hyphen ‘-’ character or underscore ‘_’ character.
  • Delimiter Column: Given a vector of strings to split and a vector of equal size of delimiters, we wish to split each element in the first vector around the corresponding delimiter in the second vector. This is often a case when we wish to split the values of a given column of a data frame using delimiters provided by another column.

This section is complemented by:

  • Select the desired parts out of the parts returned by the prior splitting step e.g. pick the second part and drop the rest.
  • Process: An optional step where we process the selected parts resulting from the prior selection process e.g. concatenate the first and third parts with an underscore in between.

Free Split

We wish to split a string around a given delimiter executing as many splits as there are instances of the delimiter. The number of resulting substrings is, therefore, not fixed in advance.

In this example, we wish to split each element in the column col_1 around hyphen ‘-’ characters and add a new column to the data frame df for each part where the nth part goes into the nth new column.

df_2 = df %>% 
  mutate(col_1 %>% 
           str_split('-', simplify = TRUE) %>% 
           as_tibble())

Here is how this works:

  • We use the function str_split() from the stringr package to split each element in a string vector, which here is the column col_1, into parts around a given delimiter, which here is the hyphen character ‘-’.
  • Note: The delimiter pattern can be a string (character sequence) or a regular expression.
  • The function str_split() gives us two options for the structure of its output via its simplify argument:
    • The default i.e. simplify=FALSE, is to return a list of vectors; as many vectors as the number of elements in the input string vector and each vector has as many elements as the number of parts obtained from each element.
    • When simplify=TRUE, str_split() returns a matrix with one row for each input string and one column for each part by location i.e. first parts go in the first column. There will be as many columns as the most splits generated from any value of the values of the column col_1. Values that yield fewer splits will have empty columns.
  • Since our aim in this example is to create new columns, its more convenient to work with the output of str_split() structured as a matrix, and therefore we set simplify=TRUE.
  • We use as_tibble() to convert the matrix returned by str_split() to a data frame (a tibble). When a data frame is returned to mutate() without being assigned to any column name, the data frame is unpacked to individual columns which is what we are looking for. See Multi-Value Transformation.
  • The output data frame df_2 will be a copy of the input data frame df with additional columns added holding the parts of the elements of the column col_1.

Fixed Split

We wish to split a string around a given delimiter executing a fixed pre-set number of splits. The maximum number of resulting substrings is known in advance.

In this example, we wish to split each element in the column col_1 around hyphen ‘-’ characters and add a new column to the data frame df for each part where the nth part goes into the nth new column.

df_2 = df %>% 
  mutate(col_1 %>% 
           str_split('-', n=2, simplify = TRUE) %>% 
           as_tibble())

Here is how this works:

  • This code works similarly to the Free Split scenario above except that we set the argument n to specify the number of parts that we wish to obtain.
  • In this case, we set n=2 meaning we wish to perform one split to obtain two parts.
  • If the number of parts that exist is less than the specified number, empty strings will be returned to compliment the difference.

Split Words

We wish to split a string into individual words.

df_2 = df %>% 
  mutate(col_1 %>% 
           str_split('\\W+', simplify = TRUE) %>% 
           as_tibble())

Here is how this works:

This code works similarly to the Free Split scenario above except that we use the regular expression '\W+' as the delimiter which denotes possible separators between words.

Alternative: Via boundary()

df_2 = df %>% 
  mutate(col_1 %>% 
           str_split(boundary("word"), simplify = TRUE) %>% 
           as_tibble())

Here is how this works:

In boundary("word"), we match the boundaries between words. boundary() is a stringr function that matches boundaries between characters, lines, sentences, or words depending on the argument passed to it.

Split Characters

We wish to split a string into individual characters.

df = tibble(
  col_1 = c("ABC", "XYZ", "KLM"))

df_2 = df %>% 
  mutate(col_1 %>% 
           str_split('', simplify = TRUE) %>% 
           as_tibble())

Here is how this works:

This code works similarly to the Free Split scenario above except that we use '' as the delimiter which denotes splitting around individual characters.

Multiple Delimiters

We wish to split a string around any of a set of delimiters

In this example, we wish to split each element of the column col_1 into its parts which we wish to return as new columns. The delimiter can be a hyphen ‘-’ or an underscore ‘_’.

df_2 = df %>% 
  mutate(col_1 %>% 
           str_split('-|_', simplify = TRUE) %>% 
           as_tibble())

Here is how this works:

  • We use the function str_split() to split each element in a string vector, which here is the column col_1, into parts around a given delimiter. See Free Split above.
  • We use the or operator | of regular expressions to build a regular expression that captures all the delimiter patterns we wish to look for or’ed together. In this case that regular expression is '-|_'.
  • Alternatively, if our delimiters are single characters, we can use the bracket operator like so [-_].

Delimiter Column

Given a vector of strings to split and a vector of equal size of delimiters, we wish to split each element in the first vector around the corresponding delimiter in the second vector. This is often a case when we wish to split the values of a given column of a data frame using delimiters provided by another column.

df_2 = df %>% 
  mutate(col_1 %>% 
           str_split(col_2, simplify = TRUE) %>% 
           as_tibble())

Here is how this works:

  • We use the function str_split() to split each element in a string vector into parts around a given delimiter. See Free Split above.
  • The function str_split() is vectorized over the input string, which here are the values of the column col_1, and the delimiter pattern, which here are the values of the column col_2. Therefore, the function str_split() will use the corresponding value of the column col_2 as the delimiter when splitting a value in the column col_1.
R
I/O