By Location

We wish to extract a substring, from a subject string, given the locations of its start and end characters.

We will cover the following scenarios:

  • Character: Extract a single character given its index in the parent string.
  • Substring: Extract a substring given its start and end indices in the parent string.
  • Multiple Substrings: Extract multiple substrings given their start and end indices in the parent string.
  • Location Columns: The start and / or end locations are provided as values of columns in a data frame.
  • Dynamic Locations: We first locate a substring (via a pattern match) and then extract it.

Character

We wish to extract a single character from a string given its index.

In this example, we wish to extract the first character from each value of the column col_1.

df_2 = df %>% 
  mutate(col_2 = str_sub(col_1, 1, 1))

Here is how this works:

  • We use the function str_sub() from the stringr package to extract the first character from the left from each element of the column col_1.
  • We pass to str_sub() the following:
    • The vector of strings, which in this case is the column col_1, from whose elements we wish to extract particular characters
    • The start and end locations of the substring we wish to extract. Since in this case, we wish to extract a single character, both start and end are the same.
  • The output data frame df_2 will be a copy of the input data frame df with an added column col_2 which contains the first character of the corresponding element of the column col_1.

Extension: Right Indexing

We wish to extract a character from a string by indexing relative to the end (right side) of the string.

In this example, we wish to extract the last character from each value of the column col_1.

df = df %>% 
  mutate(col_4 = str_sub(col_1, -1))

Here is how this works:

  • We can use negative indices to index relative to the right end of the string. In this case we use -1 to refer to the first character from the end i.e. the last character.
  • In this solution we pass only the start index and not the end because we can leverage the fact that end is the last character by default.

Substring

We wish to extract a substring from a string given its start and end location indices.

In this example, we wish to extract the substring starting at the 2nd character and ending at the 5th character from each value of the column col_1.

df_4 = df %>% 
  mutate(col_2 = str_sub(col_1, 2, 5))

Here is how this works:

  • We use the function str_sub() from the stringr package to extract a substring from each element of the column col_1.
  • We pass to str_sub() the following:
    • The vector of strings, which in this case is the column col_1, from whose elements we wish to extract particular characters.
    • The start and end locations of the substring we wish to extract, which in this case are 2 and 5. Note that both start and end indices are inclusive i.e. the characters denoted by the start and end indices will be included in the extracted substring.
  • The output data frame df_2 will be a copy of the input data frame df with an added column col_2 which contains the extracted substring (characters 2 through to 5) from the corresponding element of the column col_1.

Extension: nth to End

We wish to drop the first n characters and keep the rest.

df_2 = df %>% 
  mutate(col_2 = str_sub(col_1, 3))

Here is how this works:

  • Our objective is to drop the first n characters (which in this example is 2) and keep the rest; i.e. to capture the characters between start=3 and the last character.
  • Since end is last by default, we are passing a value for start only.

Extension: Start to nth

We wish to keep the first n characters and drop the rest.

df_2 = df %>% 
  mutate(col_2 = str_sub(col_1, end = 2))

Here is how this works:

  • Our objective is to keep the first n characters (which in this example is 2) and drop the rest; i.e. to capture the characters between start=1 and end=2.
  • Since start=1 by default, we are passing a value for end only.

Extension: Start to nth

We wish to drop the last n characters and keep the rest.

df_2 = df %>% 
  mutate(col_2 = str_sub(col_1, end = -3))

Here is how this works:

  • Our objective is to drop the last n characters (which in this example is 2) and keep the rest; i.e. to capture the characters between start=1 and end=-3 (the third from the end).
  • Since start=1 by default, we are passing a value for end only.

Extension: nth from End to End

We wish to keep the last n characters and drop the rest.

df_2 = df %>% 
  mutate(col_2 = str_sub(col_1, -2))

Here is how this works:

  • Our objective is to keep the last n characters (which in this example is 2) and drop the rest; i.e. to capture the characters between start=-2 (the second from the end) and the last character.
  • Since end is the last character by default, we are passing a value for start only.

Multiple Substrings

We wish to extract multiple substrings given their start and end location indices.

In this example, we wish to extract three substrings given their start and end locations from each value of the column col_1. We wish to obtain the extracted substrings as three new columns named a, b, and c.

start_locs = c(1, 2, 3)
end_locs = c(2, 4, 6)

df_2 = df %>% 
  mutate(
    map_dfr(str_sub_all(col_1, start_locs, end_locs), 
            ~set_names(., c('a', 'b', 'c')))
  )

Here is how this works:

  • We use the function str_sub_all() from the stringr package to extract multiple substrings from a string by location.
  • We pass to str_sub_all() the following as input:
    • The vector of strings, which in this case is the column col_1, from whose elements we wish to extract particular substrings.
    • Two vectors of equal size holding the start and the corresponding end locations respectively of the substrings that we wish to extract. In this case, the start locations are specified in start_locs and the end locations in end_locs.
  • The output of str_sub_all() is a list of character vectors.
    • The list has as many elements as the input, which in this case is the column col_1.
    • Each vector has the substrings extracted and will have as many elements as the number of elements in the start (and end) location vectors, which in this case is 3.
  • If a string is shorter than the locations of the substring to be extracted, an empty string “” is returned.
  • We use map_dfr() to iterate over the elements of str_sub_all() and:
    • to give each element in the vector of substrings a name via the set_names() function.
    • convert the vector of substrings to a one-row data frame where the column names are the names assigned via set_names(), which in this case are 'a', 'b', 'c'.
    • The outputs from all iterations are combined into a single data frame which has one row for each element in col_1, and one column for each element in the list returned by str_sub_all().
    • See Working with Lists for more on map_dfr() and iterating with the map_*() family of functions.
  • When a data frame is returned to mutate() without being assigned to any column name, the data frame is unpacked to individual columns which is what we are looking for. See Multi-Value Transformation.
  • The output data frame df_2 will be a copy of the input data frame df with three added columns 'a', 'b', and 'c' holding the three substrings extracted from the corresponding elements of the column col_1.

Alternative: Multiple calls to str_sub()

df_2 <- df %>% 
  mutate(
    a = str_sub(col_1, start = 1, end = 2),
    b = str_sub(col_1, start = 2, end = 4),
    c = str_sub(col_1, start = 3, end = 6)
  )

Here is how this works:

  • Instead of using str_sub_all(), we can make individual calls to str_sub().
  • The output will be the same as the primary solution above; i.e. a data frame df_2 that is a copy of the input data frame df with three added columns 'a', 'b', and 'c' holding the three substrings extracted from the corresponding elements of the column col_1.

Location Columns

We wish to extract substrings from each element of a string column by location where the start and / or end locations are provided as values of columns in a data frame.

In this example, we wish to create a new column col_4 where each element is a substring extracted from the corresponding value of the column col_1 where the start and end location are provided by the corresponding values of the columns col_2 and col_3 respectively.

df_2 = df %>% 
  mutate(
    col_4 = str_sub(col_1, col_2, col_3)
  )

Here is how this works:

  • We use str_sub() to extract a substring given its locations.
  • str_sub() is vectorized over both the string and the locations. Therefore, we can pass:
    • A vector of strings from which substrings will be extracted, which in this case is col_1
    • Vectors of start and end locations that are of the same size as the input vector of strings which in this case are col_2 and col_3.
  • The output of str_sub() is a vector of the same size as col_1 and where each element is a substring extracted form the corresponding element of col_1 between start location provided by col_2 and end location provided by col_3.

Dynamic Locations

One Substring

We wish to extract a substring by location from a given string where the locations are obtained in a dynamic manner for said string.

df_2 = df %>% 
  mutate(
    col_2 = str_sub(col_1, str_locate(col_1, "(X){2,}"))
  )

Here is how this works:

  • We use str_locate() to obtain the locations of the first sequence of 2 or more ‘X’ characters, as specified by the regex ‘(X){2,}’, from each element in the column col_1. See Locating.
  • We pass the extracted locations to str_sub() to extract the corresponding substrings. See Substring above.

Multiple Substrings

df_2 = df %>% 
  mutate(
    col_2 = str_sub_all(
        col_1, 
        str_locate_all(col_1, "(X){2,}")) %>% 
      map_chr(str_flatten, collapse='-')
  )

Here is how this works:

  • We use str_locate_all() to identify the locations of all matching patterns in each element of the column col_1 and return those as a matrix with two columns for the start and end locations. See Locating.
  • We then use str_sub_all(), which accepts the matrix returned by str_locate_all() as is without us needing to extract the start and end locations manually, to extract the corresponding substrings. See Multiple Substrings above.
R
I/O