Locating

We wish to obtain the start and end locations (as integers) of a given pattern in a target string.

In this section we will cover the following four common string pattern location scenarios:

  • First Match where we cover how to obtain the start and end locations of the first occurrence of a given pattern in a given string.
  • All Matches where we cover how to obtain the start and end locations of all occurrences of a given pattern in a given string.
  • Last Match where we cover how to obtain the start and end locations of the last occurrence of a given pattern in a given string.
  • nth Match where we cover how to obtain the start and end locations of the nth occurrence of a given pattern in a given string.

For each of these four scenarios, we will cover two cases:

  • Substring where the pattern is a plain character sequence
  • Regular Expressions where the pattern is a regular expression

In addition, we cover the following scenarios which can be applied to extend any of the above:

  • Ignore Case where we cover how to ignore the case (of both pattern and string) while matching.
  • Complement where we cover how to obtain the locations of the non-matching segments of the string (the inverse of the matching locations).
  • Pattern Column where we cover how to match a vector of strings against a vector of patterns of the same size. This is often needed when we wish to locate the value of a column in another column for each row.
  • Multiple Patterns where we cover how to extend any of the above scenarios to check against multiple patterns at a time.

Locating a pattern in a string is often by followed by additional steps to achieve a goal. For instance, a common next step after locating patterns in a string is to substitute them with other values. We cover that in Substituting.

First Match

Substring

We wish to obtain the location of the first match of a substring in a string.

In this example, we wish to create a new column col_2 where each row value holds the location of the first character of the first match of the string 'gm' in the corresponding string value of the column col_1.

df_2 = df %>%
  mutate(col_2 = str_locate(col_1, fixed('gm'))[, 'start'])

Here is how this works:

  • We use the str_locate() function from the Stringr package (part of the tidyverse) to identify the location of the first match of the substring 'gm' in each value of the column col_1.
  • The function str_locate() takes as input:
    • A string value or a vector of string values to look into which in this case is col_1.
    • A pattern to look for which in this case is the substring 'gm'. We wrap the substring in the helper fixed() because by default str_locate() assumes that the pattern passed is a regular expression. fixed() specifies that the pattern is a fixed string.
  • The output of str_locate() is a matrix with as many rows as the size of the first input which in this case is the column col_1 and two columns; start and end holding the location of the first character and the last character of the first match of the given pattern in the string corresponding to the current row.
  • The output of a string locating function is the integer indices of the match of the pattern in a given string.
    • In R indexing is 1-based. For instance, a match that occurs at the very beginning of the string would have a start of 1.
    • The end index is that of the last character of the match in the given string (inclusive).
  • If there is no match, the value of start and end will be NA for that particular row of the output matrix.
  • In [, 'start'] , we select the first column holding the start location of the substring ‘gm’ in the corresponding value of col_1. Alternatively, we could use [, 1] to refer to the first column of the matrix.
  • The output is a new data frame df_2 that is a copy of df with an additional column col_2containing the starting position of the first occurrence of the pattern 'gm' in each element of col_1.

Alternative: Convert to Data Frame then Pull

df_2 = df %>%
  mutate(col_2 = str_locate(col_1, 'gm') %>% 
           as_tibble() %>% 
           pull(start))

Here is how this works:

  • Having gotten accustomed to working with data frames (tibbles), working with a matrix in R may feel awkward. In this solution, we present an approach that uses data frames.
  • An alternative approach to obtaining the start location of the first occurrence of a pattern in a string is:
    • Convert the table produced by str_locate(col_1, 'gm') to a tibble via as_tibble()
    • then extract the start column via pull(start)
    • then assign that to the new column col_2
  • The output is the same as in the primary solution above.

Regular Expression

We wish to obtain the location of the first match of a regular expression in a string.

In this example, we wish to create two new columns start and end where each row value holds the location of the first character and the last character, respectively, of the first match of a sequence of integers in the corresponding string value of the column col_1.

df_2 = df %>%
  mutate(as_tibble(str_locate(col_1, '\\d+')))

Here is how this works:

  • We use the str_locate() function from the Stringr package (part of the tidyverse) to identify the start and end locations of the first match of the regular expression '\\d+' in each value of the column col_1.
  • The function str_locate() takes as input:
    • A string value or a vector of string values to look into which in this case is col_1.
    • A regular expression to look for which in this case is the substring '\\d+',
  • The output of str_locate() is a matrix with as many rows as the size of the first input which in this case is the column col_1 and two columns; start and end holding the location of the first character and the last character of the first match of the given regular expression in the string corresponding to the current row.
  • If there is no match, the value of start and end will be NA for that particular row of the output matrix.
  • Usually, when matching regular expressions, we need to hold on to both the start and end locations because the number of characters matched may vary; in this case, the digit sequences captured may be of different lengths.
  • The matrix resulting from str_locate() is passed to as_tibble(), which converts the result to a tibble (a more functional data frame extension in the tidyverse).
  • Since we do not assign the data frame resulting from as_tibble() to any column name, mutate() expands it creating two new columns start and end. See Multi-Value Transformation.
  • The resulting data frame df_2 will have the same number of rows as the original data frame df, with two additional start and end columns containing the locations of the first and last characters, respectively, of the first sequence of digits in the corresponding value of the column col_1.

All Matches

Substring

We wish to obtain the locations of all matches of a given substring in a given string.

In this example, we wish to compute the average separation (in number of characters) between occurrences of the pattern ‘gm’ in each value of the column col_1.

df_2 = df %>% 
  mutate(result = str_locate_all(df$col_1, fixed('gm'))) %>% 
  mutate(
    avg_sep = 
      map_dbl(result,
              ~ (lead(.x[, 'start']) - .x[, 'end']) %>% 
                mean(na.rm=T))) %>% 
  select(-result)

Here is how this works:

  • We use the str_locate_all() function from the Stringr package (part of the tidyverse) to identify the locations of all matches of the substring 'gm' in each value of the column col_1.
  • str_locate_all() returns a list of matrices where:
    • Each matrix has two columns start and end and as many rows as there are matches of the pattern (which in this case is ‘gm’) in the corresponding value for the column col_1
    • The list has as many matrices as there are elements in the input vector which in this case is col_1
  • In the first call to mutate() we create a new column result where each cell holds the matrix of start and end locations returned by str_locate_all() for the value of col_1 for the current row.
  • In the second call to mutate() we compute the average separation between occurrences by:
    • Subtracting the end location of each occurrence .x[, 'end'] from the start position of the next occurrence lead(.x[, 'start']) and then taking the mean.
    • We use map_dbl() (from the map family of functions from the tidyverse package purrr) to iterate over each matrix stored in the column result. See Non-Vectorized Transformation and Working with Lists.
  • The output data frame df_2 will be a copy of the input data frame df with an added column avg_sep holding the average number of characters between occurrences of the substring ‘gm’ for the corresponding value of the column col_1.

Regular Expression

We wish to obtain the locations of all matches of a regular expression in a string.

In this example, we wish to compute the average separation (in number of characters) between occurrences of a sequence of integers in each value of the column col_1.

df_2 = df %>% 
  mutate(result = str_locate_all(df$col_1, '\\d+')) %>% 
  mutate(
    avg_sep = 
      map_dbl(result,
              ~ (lead(.x[, 'start']) - .x[, 'end']) %>% 
                mean(na.rm=T))) %>% 
  select(-result)

Here is how this works:

  • This works just like the Substring case above except that we pass a regular expression to str_locate_all(). Note that str_locate_all() expects a regular expression by default.
  • The output data frame df_2 will be a copy of the input data frame df with an added column avg_sep holding the average number of characters between numbers (digit sequences) for the corresponding value of the column col_1.

Last Match

Substring

We wish to obtain the location of the last match of a substring in a string.

In this example, we wish to create a new column col_2 where each row value holds the location of the first character of the last match of the string 'gm' in the corresponding string value of the column col_1.

library(stringi)

df_2 = df %>%
  mutate(
    col_2 = stri_locate_last(col_1, fixed='gm')[, 'start'])

Here is how this works:

  • There is no function for the last match in Stringr. While we could use str_locate_all() and pick the last match (see alternative below), we think it is easier to use the stri_locate_last() function from the Stringi package (which is the package underlying many of the Stringr functions).
  • We pass to stri_locate_last() :
    • The column whose values we wish to locate matches in; which in this case is col_1.
    • The substring that we wish to locate which we pass to the fixed argument of stri_locate_last(); in this case fixed='gm'.
  • The output is a matrix with as many rows as the length of the column col_1 and two columns start and end holding the start location and end location of the last match of the substring ‘gm’ in the corresponding value of the column col_1. For a bit more detail, see First Match above.
  • In [, 'start'], we extract the start column from the matrix returned by stri_locate_last() which is then assigned to the column col_2 by mutate().
  • To enforce a case-insensitive match, we set the argument case_insensitive to case_insensitive=TRUE.
  • The output is a new data frame df_2 that is a copy of df with an additional column col_2 containing the starting position of the last occurrence of the pattern 'gm' in each element of col_1.

Regular Expression

We wish to obtain the location of the last match of a regular expression in a string.

In this example, we wish to create two new columns start and end where each row value holds the location of the first character and the last character, respectively, of the last match of a sequence of integers in the corresponding string value of the column col_1.

df_2 = df %>%
  mutate(as_tibble(stri_locate_last(col_1, regex='\\d+')))

Here is how this works:

  • This works just like the Substring case above except that we pass a regular expression to the regex argument of stri_locate_last().
  • The resulting data frame df_2 will have the same number of rows as the original data frame df, with two additional start and end columns containing the locations of the first and last characters, respectively, of the last sequence of digits in the corresponding value of the column col_1.

Alternative: Locate All and Return Last

df_2 = df %>% 
  mutate(map_dfr(str_locate_all(col_1, '\\d+'), 
                 ~ (.x[nrow(.x), ])))

Here is how this works:

  • We use str_locate_all() to locate all matches. See All Matches above.
  • We then use map_dfr() to iterate over the matrices returned by str_locate_all() for the values of the column col_1 and for each, extract the last row via ~ (.x[nrow(.x), ])). See Non-Vectorized Transformation and Working with Lists.
  • Since we do not assign the data frame resulting from map_dfr() to any column name, mutate() expands it creating two new columns start and end. See Multi-Value Transformation.
  • The output is the same as the primary solution above.

nth Match

We wish to obtain the location of the nth match of a pattern in a string.

In this example, we wish to create two new columns start and end where each row value holds the location of the first character and the last character, respectively, of the nth match of a sequence of integers in the corresponding string value of the column col_1.

get_nth_match <- function(match_matrix, n) {
  if (between(n, 1, nrow(match_matrix))) {
    return(match_matrix[n, ])
  } else {
    return(c(start=NA, end=NA))
  }
}

str_locate_nth <- function(p_col, p_pat, p_n) {
  map_dfr(str_locate_all(p_col, p_pat), 
          get_nth_match, n = p_n)
}

df_2 = df %>%
  mutate(str_locate_nth(col_1, '\\d+', 2))

Here is how this works:

  • At a high level, the approach we follow here to locate the nth occurrence of a pattern in a string, we use str_locate_all() to locate all occurrences and then extract the location of the nth occurrence.
  • We create a custom function str_locate_nth() to which we pass:
    • The column whose values we wish to look into; which in this case is col_1.
    • The pattern we wish to locate; which in this case '\\d+' and matches one or more consecutive digits.
    • The index of the occurrence we wish to locate; which in this case is 2.
  • The custom function str_locate_nth() operates as follows:
    • It first calls str_locate_all() to obtain a matrix for each value of the column col_1 where there is one row for each occurrence and two columns start and end. See All Matches above for a description of str_locate_all().
    • It then calls another custom function get_nth_match() which uses the function between() (covered in Numerical Operations) to check if the index requested is within the number of occurrences and if so returns the corresponding row in the match matrix. If not, it returns an empty row (with named elements).
    • We use map_dfr() to iterate over the matrices returned by str_locate_all() and return a row of two values, the start and end locations, of the nth occurrence. See Working with Lists.
  • Since we do not assign the data frame resulting from map_dfr() to any column name, mutate() expands it creating two new columns start and end. See Multi-Value Transformation.
  • While in this example, the pattern we are locating is a regular expression, we can just as well use a plain string. We simply pass that to str_locate_all(). See Substring under All Matches above.
  • The resulting data frame df_2 will have the same number of rows as the original data frame df, with two additional start and end columns containing the locations of the first and last characters, respectively, of the 2nd sequence of digits in the corresponding value of the column col_1.

Ignore Case

Substring

We wish to obtain the location of the first match of a substring in a string irrespective of whether the letters are in upper or lower case; i.e. while ignoring case.

In this example, we wish to create a new column col_2 where each row value holds the location of the first character of the first match regardless of case, of the string 'gm' in the corresponding string value of the column col_1.

df_2 = df %>% 
  mutate(
    col_2 = str_locate(col_1, fixed('gm', ignore_case=TRUE))[, 'start']
  )

Here is how this works:

  • To locate the first occurrence of a given substring in a given string, we use str_locate() as described in First Match above.
  • To ignore the case while matching, we wrap the substring in fixed() and pass the parameter ignore_case=TRUE. See Ignore Case under Detecting for more details.

Regular Expression

We wish to obtain the location of the first match of a regular expression in a string irrespective of the case.

In this example, we wish to create two new columns start and end where each row value holds the location of the first character and the last character, respectively, of the first match of a sequence of integers followed by the characters ‘gm’; regardless of case, in the corresponding string value of the column col_1.

df_2 = df %>%
  mutate(
    as_tibble(
      str_locate(col_1, regex('\\d+gm', ignore_case=TRUE))
    ))

Here is how this works:

  • To locate the first occurrence of a given regular expression in a given string, we use str_locate() as described in First Match above.
  • To ignore the case while matching, we wrap the regular expression in regex() and pass the parameter ignore_case=TRUE. See Ignore Case under Detecting for more details.

Complement

In some situations, we wish to obtain the locations of the non-matching segments of the string.

All Matches

We wish to obtain the locations of non-matching segments of a string after locating all occurrences of a pattern in a string.

In this example, we wish to compute the average length (in number of characters) of non-integer character sequences in each value of the column col_1.

df_2 = df %>% 
  mutate(result = 
           map(str_locate_all(col_1, '\\d+'), invert_match), 
         avg_sep = 
           map_dbl(result,
                   ~ (lead(.x[, 'start']) - .x[, 'end']) %>% 
                     mean(na.rm=T))) %>% 
  select(-result)

Here is how this works:

  • At a high level, the approach is to locate all occurrences of integer sequences and then “invert” that to obtain the locations of the other character sequences in each value of the column col_1.
  • We str_locate_all(col_1, '\\d+') to locate all integer sequences captured by the regular expression '\\d+'. The output is a matrix with two columns start and end and as many rows as the number of occurrences of integer sequences in the corresponding value of the column col_1. See All Matches above for a detailed description.
  • We then use map() to iterate over each produced by str_locate_all() and apply the function invert_match() which conveniently returns a matrix of a similar structure but with the start and end locations capturing the inverse. See Working with Lists for coverage of the map() family of functions.

One Match

We wish to obtain the locations of non-matching segments of a string after locating the first occurrence (or one occurrence) of a pattern in a string.

match_locs = str_locate(df$col_1, '\\d+')
inv_match_locs = 
  map(1:nrow(match_locs),
      ~invert_match(match_locs[.x, , drop = FALSE]))

Here is how this works:

  • invert_match() is designed to work with the output of str_locate_all() not the output of str_locate().
  • To invert the output of str_locate(), we need to:
    • Iterate over the rows of the matrix returned by str_locate() which we do via map() iterating over a counter of as many values as there are rows to simulate a loop
    • Extract each row as a matrix which we do via match_locs[.x, , drop = FALSE]; The argument drop=FALSE specifies that a matrix should be returned and not a vector.
    • Pass the extracted matrix to invert_match().

Pattern Column

We wish to match a vector of strings against a vector of patterns of the same size. This is often needed when we wish to locate the value of a column in the value of another column for each row.

Substring

In this example, we have a data frame df with two columns col_1 and col_2, we wish to locate the start of the first occurrence of the value of col_2 in col_1.

df_2 = df %>%
  mutate(start = str_locate(col_1, fixed(col_2))[, 'start'])

Here is how this works:

  • str_locate() and str_locate_all() are vectorized over both the string and the pattern and can operate in one of three modes:
    • Check for one pattern in each element in a vector of strings. This is the mode we used in all the above scenarios.
    • Check n patterns against n strings. The size of both vectors must be the same for this pattern (or a multiple). This is the mode we use in this solution since we have a vector of strings, the column col_1, and a vector of patterns of the same size, the column col_2.
    • Check multiple patterns against a single string. This is the mode we use in Multiple Patterns below.
  • We pass to str_locate():
    • the strings to look into which in this case is the column col_1 as the first argument
    • and the patterns to look for which in this case is the column col_2 and is naturally of the same size as col_1.
  • str_locate() will return a matrix with two columns start and end and one row for each row in the data frame df holding the start and end locations of the value of the column col_2 in the corresponding value of the column col_1.
  • See First Match above for a more detailed description of this code. We can extend the solutions presented in All Matches, Last Match, and nth Match in the same way.

Regular Expression

In this example, we have a data frame df with two columns col_1 and col_2, we wish to locate the start and end of the first occurrence of a sequence of repeated col_2 value in col_1.

df_2 = df %>%
  mutate(as_tibble(str_locate(col_1, str_c(col_2, '+'))))

Here is how this works:

  • This code works similarly to the substring scenario above except that in this case the pattern is a regular expression.
  • We construct the regular expression by adding ‘+’ to each value of the column col_2 via str_c(col_2, '+') to specify a regular expression that matches one or more occurrences of the value of col_2.

Multiple Patterns

We will cover four scenarios of locating multiple patterns:

  • First Match of Any Pattern where we cover how to obtain the start and end locations of the first occurrence of the first occurring pattern of n patterns.
  • First Match of Each Pattern where we cover how to obtain the start and end locations of the first occurrence of each pattern of n patterns.
  • All Matches in One Matrix where we cover how to obtain the start and end locations of all occurrences of all patterns in the order of their occurrence as a single matrix.
  • All Matches in Separate Matrices where we cover how to obtain the start and end locations of all occurrences of all patterns keeping the occurrences of each pattern in a separate matrix.

Note: In the following examples, the output of string location will be left as a column of nested matrices. Typically, we proceed to do something with those nested matrices e.g. substitute with a character value which we cover in Substituting. Or more generally, operate on them via List Operations.

First Match of Any Pattern

We wish to return the start and end locations of the first occurrence of the first occurring pattern of n patterns. In other words, only one location is returned and that is of whichever pattern occurs first.

df_2 = df %>%
   mutate(as_tibble(str_locate(col_1, '\\d+|gm')))

Here is how this works:

  • We use the or operator | of regular expressions to build a regular expression that captures all the patterns we wish to look for or’ed together. In this case that regular expression is '\\d+|gm'.
  • For each value of col_1, str_locate() will return the start and end locations of whichever pattern occurs first (either '\\d+’ or ‘gm’).
  • The output data frame df_2 will be a copy of the input data frame df with two added columns start and end.
  • See Regular Expression under First Match above for a more detailed description.

First Match of Each Pattern

We wish to return the start and end locations of the first occurrence of each pattern of n patterns. In other words, n locations are returned one for each pattern.

df_2 = df %>% 
  mutate(result = 
           map(col_1, ~str_locate(.x, c('\\d+', 'gm'))))

Here is how this works:

  • We pass to str_locate() a vector of patterns c('\\d+', 'gm').
  • For each value of col_1, str_locate() will return a matrix with two columns start and end holding the start and end locations and one row for the first occurrence of each pattern; in this case two rows for the patterns: '\\d+’ and ‘gm’.
  • In this solution, we use map() to iterate over each value of the column col_1 and pass that to str_locate() along with the vector of multiple patterns. See Non-Vectorized Transformation.
  • The output data frame df_2 will be a copy of the input data frame df with an added column result where each cell holds the matrix returned by str_locate() for the corresponding value of col_1.

All Matches in One Matrix

We wish to return the start and end locations of all occurrences of all patterns in the order of their occurrence as a single matrix.

df_2 = df %>%
   mutate(result = str_locate_all(col_1, '\\d+|gm'))

Here is how this works:

  • We use the or operator | of regular expressions to build a regular expression that captures all the patterns we wish to look for or’ed together. In this case that regular expression is '\\d+|gm'.
  • For each value of col_1, str_locate_all() will return a matrix with two columns start and end holding the start and end locations and one row for each occurrence of each pattern.
  • The output data frame df_2 will be a copy of the input data frame df with an added column result where each cell holds the matrix returned by str_locate() for the corresponding value of col_1.

All Matches in Separate Matrices

We wish to return the start and end locations of all occurrences of all patterns keeping the occurrences of each pattern in a separate matrix.

df_2 = df %>% 
  mutate(result = 
           map(col_1, ~str_locate_all(.x, c('\\d+', 'gm'))))

Here is how this works:

  • We pass to str_locate_all() a vector of patterns c('\\d+', 'gm').
  • For each value of col_1, str_locate_all() will return a list of matrices:
    • As many matrices as there are patterns which in this case is 2.
    • Each matrix will have two columns start and end holding the start and end locations and one row for each occurrence of the corresponding pattern.
  • In this solution, we use map() to iterate over each value of the column col_1 and pass that to str_locate_all() along with the vector of multiple patterns. See Non-Vectorized Transformation.
  • The output data frame df_2 will be a copy of the input data frame df with an added column result where each cell holds the list of matrices returned by str_locate() for the corresponding value of col_1.
R
I/O