In Split, we saw how to split each string element in a data frame column into parts around a given string delimiter pattern. We will see in this section how to select one, all, or some of the parts resulting from splitting a string.
This section is organized as follows:
This section is complemented by:
We wish to select any one part from the parts obtained by splitting a string given its index.
In this example, we wish to split each element in the column col_1
around the hyphen ‘-’
character and return the second part.
df_2 = df %>%
mutate(
col_2 = str_split(col_1, '-', simplify = TRUE)[, 2])
Here is how this works:
str_split()
with simplify=TRUE
to return a matrix of parts with one row for each element in col_1
. See Split.[, 2]
to select the second column of the matrix returned by str_split()
.Extension: Select Last Part
In this example, we wish to split each element in the column col_1
around the hyphen ‘-’
character and return the last part.
df_2 = df %>%
mutate(
col_3 = map_chr(str_split(col_1, '-'), ~.[length(.)]))
Here is how this works:
col_1
is not the same.str_split()
with simplify=FALSE
(the default) to return a list of vectors (of parts) for each element in col_1
.map_chr()
to iterate on each list and return the last element of the part vector. See Working with Lists.~.[length(.)]
to select the last element in a vector.Extension: Select nth Part Relative to End
In this example, we wish to split each element in the column col_1
around the hyphen ‘-’
character and return the second to last part.
df_2 = df %>%
mutate(col_2 = map_chr(
str_split(col_1, "-"),
~ifelse(length(.) > 1, .[length(.) - 1], NA)))
Here is how this works:
col_1
, the returned vector of parts is not long enough, we return NA
.We wish to select all parts obtained from splitting a string.
In this example, we wish to split each element of the column col_1
around the hyphen character ‘-’ and store each part in a new column. We expect three parts, and we wish to call the three resulting columns 'col_2'
, 'col_3'
, and 'col_4'
.
df_2 = df %>%
mutate(
str_parts = str_split(col_1, '-', n=3, simplify=TRUE),
col_2 = str_parts[, 1],
col_3 = str_parts[, 2],
col_4 = str_parts[, 3]) %>%
select(-str_parts)
Here is how this works:
str_split()
while setting simplify=TRUE
to split each element of the column col_1
into parts and return the result as a matrix. See Split.n=3
in the call to str_split()
so we are confident that we will get three parts and accordingly a matrix of three columns as output (filled with NA
if necessary) and won’t get any unexpected index out of bounds errors on indexing.Alternative: via separate()
df_2 = df %>%
separate(
col_1,
c('col_2', 'col_3', 'col_4'),
'-',
extra='merge',
remove=FALSE)
Here is how this works:
separate()
from the tidyr
package which is especially built for this purpose.seprate()
three inputs:col_1
.'-'
. Note: We can ommit this and separate()
will identify it as the separator automatically.extra='merge'
to separate()
creates only as many parts as the number of column names passed and where the last part contains the rest of the stringremove=FALSE
so the input column, col_1
, is not dropped in the output. The default is remove=TRUE
.separate()
usually provides a cleaner and more concise solution than using str_split()
and mutate()
and takes care of creating multiple new columns.Extension: Arbitrary Number of Parts
df_2 = df %>%
mutate(
as_tibble(str_split(col_1, '-', simplify=TRUE)))
Here is how this works:
as_tibble()
to convert the matrix returned by str_split()
to a data frame (a tibble). When a data frame is returned to mutate()
without being assigned to any column name, the data frame is unpacked to individual columns which is what we are looking for. See Multi-Value Transformation.V1
, V2
, V3
, etc.. which is the default for as_tibble()
.Alternative: Implicit Assignment
df_2 = df %>%
mutate(
as_tibble(
str_split(col_1, '-', simplify=TRUE),
.name_repair = ~c('col_2', 'col_3', 'col_4')))
Here is how this works:
str_split()
has no column names, we use the .name_repair
argument of as_tibble()
to specify the names to assign to the columns of the data frame being created. In this case, we pass what is called a “look-up formula” specifying the column name for each column (three columns in this case) as ~c('country', 'method', 'year')
. See Multi-Value Transformation for a description of this pattern..name_repair
to work, we must be confident of the number of parts returned by splitting. If we are not sure of the number of parts to expect (or do not wish to name the resulting columns), we can drop the call to .name_repair
. The resulting columns will be named V1
, V2
, V3
, etc.. by default.Alternative: Bind Columns
df_2 = df %>%
bind_cols(df %>%
pull(col_1) %>%
str_split('-', simplify = TRUE) %>%
as_tibble())
Here is how this works:
simplify=TRUE
so str_split()
returns a matrix. We then convert the matrix to a data frame via as_tibble()
.bind_cols()
to column bind the new columns to the original data frame.V1
, V2
, and V3
which is the default for as_tibble()
.We wish to select some of the parts returned by splitting a string.
In this example, we wish to select parts 1 and 3 resulting from splitting each element of the column col_1
around the hyphen character ‘-’
as new columns.
df_2 = df %>%
mutate(
str_parts = str_split(col_1, '-', n=3, simplify=TRUE),
col_2 = str_parts[, 1],
col_4 = str_parts[, 3]) %>%
select(-str_parts)
Here is how this works:
str_split()
while setting simplify=TRUE
to split each element of the column col_1
into parts and return the result as a matrix. See Split.Alternative: via separate()
df_2 = df %>%
separate(
col_1,
c('col_2', NA, 'col_3'),
'-',
remove=FALSE)
Here is how this works:
separate()
from the tidyr
package which is especially built for this purpose.seprate()
three inputs:col_1
.'-'
. Note: We can ommit this and separate()
will identify it as the separator automatically.remove=FALSE
so the input column, col_1
, is not dropped in the output. The default is remove=TRUE
.c('col_2', NA, 'col_3')
, we specify that we wish to capture the first and third parts and store those in columns named col_2 and col_3. Having NA
as the second element of the column name vector specifies that we wish to drop that part.separate()
usually provides a cleaner and more concise solution than using str_split()
and mutate()
and takes care of creating multiple new columns.Extension: Arbitrary Number of Parts
In this example, we wish to drop parts 1 and 3 and return the rest as new columns.
df_2 = df %>%
mutate(
str_split(col_1, '-', simplify=TRUE) %>%
as_tibble() %>%
select(-1, -3))
Here is how this works:
str_split()
is converted to a data frame via as_tibble()
, we can use select()
on the resulting data frame.select()
allows us a wide set of column selection options such as dropping particular parts which is what we do here via select(-1, -3)
. See Selection.Extension: Relative to End
When we wish to extract the second last and third last parts of each string regardless of how many parts resulted from the splitting of the string.
get_nth_end <- function(p_vec, p_offset) {
ifelse(length(p_vec) > p_offset, p_vec[length(p_vec) - p_offset], NA)
}
df_2 = df %>%
mutate(
str_parts = str_split(col_1, '-'),
col_2 = map_chr(str_parts, get_nth_end, p_offset=1),
col_3 = map_chr(str_parts, get_nth_end, p_offset=2)) %>%
select(-str_parts)
Here is how this works:
col_1
is not the same.str_split()
with simplify=FALSE
(the default) to return a list of vectors (of parts) for each element in col_1
.map_chr()
to iterate on each list and return the last element of the part vector. See Working with Lists.get_nth_end()
that expects a vector and an offset value and returns the element of the vector at that offset from the end of the vector. If the vector is smaller than the offset, the function returns NA
.