There are two common scenarios for splitting a string into parts which we will cover in this section:
‘a, b, c, d’
around the ‘, ‘
as the delimiter, we would get four sub strings a
, b
, c
and d
.‘a, b, c, d’
around the ‘, ‘
as the delimiter with a number of splits set to two, we would get three substrings a
, b
, and c, d
.We then cover four extensions that may be applied to either of the above scenarios:
‘-’
character or underscore ‘_’
character.This section is complemented by:
We wish to split a string around a given delimiter executing as many splits as there are instances of the delimiter. The number of resulting substrings is, therefore, not fixed in advance.
In this example, we wish to split each element in the column col_1
around hyphen ‘-’
characters and add a new column to the data frame df
for each part where the nth part goes into the nth new column.
df_2 = df %>%
mutate(col_1 %>%
str_split('-', simplify = TRUE) %>%
as_tibble())
Here is how this works:
str_split()
from the stringr
package to split each element in a string vector, which here is the column col_1
, into parts around a given delimiter, which here is the hyphen character ‘-’
.str_split()
gives us two options for the structure of its output via its simplify
argument:simplify=FALSE
, is to return a list of vectors; as many vectors as the number of elements in the input string vector and each vector has as many elements as the number of parts obtained from each element.simplify=TRUE
, str_split()
returns a matrix with one row for each input string and one column for each part by location i.e. first parts go in the first column. There will be as many columns as the most splits generated from any value of the values of the column col_1
. Values that yield fewer splits will have empty columns.str_split()
structured as a matrix, and therefore we set simplify=TRUE
.as_tibble()
to convert the matrix returned by str_split()
to a data frame (a tibble). When a data frame is returned to mutate()
without being assigned to any column name, the data frame is unpacked to individual columns which is what we are looking for. See Multi-Value Transformation.df_2
will be a copy of the input data frame df
with additional columns added holding the parts of the elements of the column col_1
.We wish to split a string around a given delimiter executing a fixed pre-set number of splits. The maximum number of resulting substrings is known in advance.
In this example, we wish to split each element in the column col_1
around hyphen ‘-’
characters and add a new column to the data frame df
for each part where the nth part goes into the nth new column.
df_2 = df %>%
mutate(col_1 %>%
str_split('-', n=2, simplify = TRUE) %>%
as_tibble())
Here is how this works:
n
to specify the number of parts that we wish to obtain.n=2
meaning we wish to perform one split to obtain two parts.We wish to split a string into individual words.
df_2 = df %>%
mutate(col_1 %>%
str_split('\\W+', simplify = TRUE) %>%
as_tibble())
Here is how this works:
This code works similarly to the Free Split scenario above except that we use the regular expression '\W+'
as the delimiter which denotes possible separators between words.
Alternative: Via boundary()
df_2 = df %>%
mutate(col_1 %>%
str_split(boundary("word"), simplify = TRUE) %>%
as_tibble())
Here is how this works:
In boundary("word")
, we match the boundaries between words. boundary()
is a stringr
function that matches boundaries between characters, lines, sentences, or words depending on the argument passed to it.
We wish to split a string into individual characters.
df = tibble(
col_1 = c("ABC", "XYZ", "KLM"))
df_2 = df %>%
mutate(col_1 %>%
str_split('', simplify = TRUE) %>%
as_tibble())
Here is how this works:
This code works similarly to the Free Split scenario above except that we use ''
as the delimiter which denotes splitting around individual characters.
We wish to split a string around any of a set of delimiters
In this example, we wish to split each element of the column col_1 into its parts which we wish to return as new columns. The delimiter can be a hyphen ‘-’
or an underscore ‘_’
.
df_2 = df %>%
mutate(col_1 %>%
str_split('-|_', simplify = TRUE) %>%
as_tibble())
Here is how this works:
str_split()
to split each element in a string vector, which here is the column col_1
, into parts around a given delimiter. See Free Split above.|
of regular expressions to build a regular expression that captures all the delimiter patterns we wish to look for or’ed together. In this case that regular expression is '-|_'
. [-_]
.Given a vector of strings to split and a vector of equal size of delimiters, we wish to split each element in the first vector around the corresponding delimiter in the second vector. This is often a case when we wish to split the values of a given column of a data frame using delimiters provided by another column.
df_2 = df %>%
mutate(col_1 %>%
str_split(col_2, simplify = TRUE) %>%
as_tibble())
Here is how this works:
str_split()
to split each element in a string vector into parts around a given delimiter. See Free Split above.str_split()
is vectorized over the input string, which here are the values of the column col_1
, and the delimiter pattern, which here are the values of the column col_2
. Therefore, the function str_split()
will use the corresponding value of the column col_2
as the delimiter when splitting a value in the column col_1
.