In some situations, we don’t merely wish to add new columns corresponding to some or all parts resulting from the split of a string column. Rather, we wish to apply some processing on the parts to generate the desired output.
In this section we cover three common scenarios of processing parts produced by string splitting:
This section is complemented by:
We wish to create new columns by concatenating some of the parts resulting from splitting the values of a string column.
In this example, we will pick the third and first parts resulting from splitting a string column and concatenate them into a new column.
df_2 = df %>%
mutate(
str_parts = str_split(col_1, '-', simplify=TRUE),
col_2 = str_c(str_parts[, 3], str_parts[, 1], sep = '_')) %>%
select(-str_parts)
Here is how this works:
str_split()
with simplify=TRUE
to return a matrix of parts with one row for each element in col_1
. We store this matrix in a column str_parts
. See Split.str_parts[, 3]
and str_parts[, 1]
respectively.str_c()
to concatenate the selected vectors element-wise to produce a new vector as output and store that in a new column col_2
.Alternative: via unite()
df_2 = df %>%
separate(col_1, c('str_part_1', NA, 'str_part_3'), remove=FALSE) %>%
unite('col_2', str_part_3, str_part_1, sep = '_')
Here is how this works:
unite()
from the tidyr
package which is specially built for this purpose.unite()
, separate()
to extract and create two new columns str_part_1
and str_part_3
holding the first and third parts resulting from splitting the string column col_1
. See Alternative Solution under Selecting Some Parts.unite()
three inputs:‘col_2’
.str_part_1
and str_part_3
.‘-’
.We wish to set the data type of parts resulting from splitting the values of a string column.
df_2 = df %>%
mutate(
str_parts = str_split(col_1, ' ', simplify = TRUE),
col_2 = as.numeric(str_parts[, 2]),
col_3 = str_parts[, 3]) %>%
select(-str_parts)
Here is how this works:
as.numeric(str_parts[, 2])
, we use the function as.numeric()
to covert the vector, that is the second part of the values of the column col_1, to a numeric data type.Alternative: via separate()
df_2 = df %>%
separate(col_1,
c(NA, 'col_2', 'col_3'),
sep=' ',
convert=TRUE,
remove=FALSE)
Here is how this works:
separate()
. See Selecting Parts.separate()
function has a convert
argument which when set to convert=TRUE
will convert the data type of the new columns to the most appropriate data type i.e. integer, double, or logical.We wish to apply some arbitrary processing logic to the parts resulting from splitting the values of a string column. In other words, we wish to apply some custom logic to determine how to use the output of a string split operation.
In this example, the elements of column col_1
are strings of the form '4_2g, 16_3g, 12g'
. We wish to multiply the numbers within the comma-separated segments and add up the results. So for this case, we would get 4 * 2 + 16 * 3 + 12 = 68
.
flatten_quantity <- function(parts) {
quantity = str_split(parts, '_') %>%
map_dbl(~(str_extract_all(., '\\d+') %>%
as.integer() %>%
prod())) %>%
sum()
return(quantity)
}
df_2 = df %>%
mutate(col_2 = map_dbl(str_split(col_1, ', '), flatten_quantity))
Here is how this works:
str_split()
to break down each element of the column col_1
into parts around the delimiter ‘, ‘
. The output is a list (of the same size as col_1
) of vectors (containing the parts of the respective element of col_1
). See Split.map_dbl()
to iterate over the list of vectors returned by str_split()
and call a custo function flatten_quantity()
on each vector. We opted for map_dbl()
because we expect a single numeric output for each element of the column col_1
. See Working with Lists.flatten_quantity()
works as follows:col_1
.‘_
’ character.col_1
.