We wish to extract a substring, from a subject string, given the locations of its start and end characters.
We will cover the following scenarios:
We wish to extract a single character from a string given its index.
In this example, we wish to extract the first character from each value of the column col_1
.
df_2 = df %>%
mutate(col_2 = str_sub(col_1, 1, 1))
Here is how this works:
str_sub()
from the stringr
package to extract the first character from the left from each element of the column col_1
.str_sub()
the following:col_1
, from whose elements we wish to extract particular charactersstart
and end
locations of the substring we wish to extract. Since in this case, we wish to extract a single character, both start
and end
are the same.df_2
will be a copy of the input data frame df
with an added column col_2
which contains the first character of the corresponding element of the column col_1
.Extension: Right Indexing
We wish to extract a character from a string by indexing relative to the end (right side) of the string.
In this example, we wish to extract the last character from each value of the column col_1
.
df = df %>%
mutate(col_4 = str_sub(col_1, -1))
Here is how this works:
-1
to refer to the first character from the end i.e. the last character.start
index and not the end because we can leverage the fact that end
is the last character by default.We wish to extract a substring from a string given its start and end location indices.
In this example, we wish to extract the substring starting at the 2nd character and ending at the 5th character from each value of the column col_1
.
df_4 = df %>%
mutate(col_2 = str_sub(col_1, 2, 5))
Here is how this works:
str_sub()
from the stringr
package to extract a substring from each element of the column col_1
.str_sub()
the following:col_1
, from whose elements we wish to extract particular characters.start
and end
locations of the substring we wish to extract, which in this case are 2 and 5. Note that both start
and end
indices are inclusive i.e. the characters denoted by the start
and end
indices will be included in the extracted substring.df_2
will be a copy of the input data frame df
with an added column col_2
which contains the extracted substring (characters 2 through to 5) from the corresponding element of the column col_1
.Extension: nth to End
We wish to drop the first n characters and keep the rest.
df_2 = df %>%
mutate(col_2 = str_sub(col_1, 3))
Here is how this works:
start=3
and the last character.end
is last by default, we are passing a value for start
only.Extension: Start to nth
We wish to keep the first n characters and drop the rest.
df_2 = df %>%
mutate(col_2 = str_sub(col_1, end = 2))
Here is how this works:
start=1
and end=2
.start=1
by default, we are passing a value for end only.Extension: Start to nth
We wish to drop the last n characters and keep the rest.
df_2 = df %>%
mutate(col_2 = str_sub(col_1, end = -3))
Here is how this works:
start=1
and end=-3
(the third from the end).start=1
by default, we are passing a value for end only.Extension: nth from End to End
We wish to keep the last n characters and drop the rest.
df_2 = df %>%
mutate(col_2 = str_sub(col_1, -2))
Here is how this works:
start=-2
(the second from the end) and the last character.end
is the last character by default, we are passing a value for start
only.We wish to extract multiple substrings given their start and end location indices.
In this example, we wish to extract three substrings given their start and end locations from each value of the column col_1
. We wish to obtain the extracted substrings as three new columns named a
, b
, and c
.
start_locs = c(1, 2, 3)
end_locs = c(2, 4, 6)
df_2 = df %>%
mutate(
map_dfr(str_sub_all(col_1, start_locs, end_locs),
~set_names(., c('a', 'b', 'c')))
)
Here is how this works:
str_sub_all()
from the stringr package to extract multiple substrings from a string by location.str_sub_all()
the following as input:col_1
, from whose elements we wish to extract particular substrings.start_locs
and the end locations in end_locs
.str_sub_all()
is a list of character vectors.col_1
.“”
is returned.map_dfr()
to iterate over the elements of str_sub_all()
and:set_names()
function.set_names()
, which in this case are 'a', 'b', 'c'
.col_1
, and one column for each element in the list returned by str_sub_all()
.map_dfr()
and iterating with the map_*()
family of functions.mutate()
without being assigned to any column name, the data frame is unpacked to individual columns which is what we are looking for. See Multi-Value Transformation.df_2
will be a copy of the input data frame df
with three added columns 'a', 'b'
, and 'c'
holding the three substrings extracted from the corresponding elements of the column col_1
.Alternative: Multiple calls to str_sub()
df_2 <- df %>%
mutate(
a = str_sub(col_1, start = 1, end = 2),
b = str_sub(col_1, start = 2, end = 4),
c = str_sub(col_1, start = 3, end = 6)
)
Here is how this works:
str_sub_all()
, we can make individual calls to str_sub()
.df_2
that is a copy of the input data frame df
with three added columns 'a', 'b'
, and 'c'
holding the three substrings extracted from the corresponding elements of the column col_1
.We wish to extract substrings from each element of a string column by location where the start and / or end locations are provided as values of columns in a data frame.
In this example, we wish to create a new column col_4
where each element is a substring extracted from the corresponding value of the column col_1
where the start and end location are provided by the corresponding values of the columns col_2
and col_3
respectively.
df_2 = df %>%
mutate(
col_4 = str_sub(col_1, col_2, col_3)
)
Here is how this works:
str_sub()
to extract a substring given its locations.str_sub()
is vectorized over both the string and the locations. Therefore, we can pass:col_1
col_2
and col_3
.str_sub()
is a vector of the same size as col_1
and where each element is a substring extracted form the corresponding element of col_1
between start location provided by col_2
and end location provided by col_3
.One Substring
We wish to extract a substring by location from a given string where the locations are obtained in a dynamic manner for said string.
df_2 = df %>%
mutate(
col_2 = str_sub(col_1, str_locate(col_1, "(X){2,}"))
)
Here is how this works:
str_locate()
to obtain the locations of the first sequence of 2 or more ‘X’
characters, as specified by the regex ‘(X){2,}’
, from each element in the column col_1
. See Locating.str_sub()
to extract the corresponding substrings. See Substring above.Multiple Substrings
df_2 = df %>%
mutate(
col_2 = str_sub_all(
col_1,
str_locate_all(col_1, "(X){2,}")) %>%
map_chr(str_flatten, collapse='-')
)
Here is how this works:
str_locate_all()
to identify the locations of all matching patterns in each element of the column col_1
and return those as a matrix with two columns for the start and end locations. See Locating.str_sub_all()
, which accepts the matrix returned by str_locate_all()
as is without us needing to extract the start
and end
locations manually, to extract the corresponding substrings. See Multiple Substrings above.