We wish to obtain the start and end locations (as integers) of a given pattern in a target string.
In this section we will cover the following four common string pattern location scenarios:
For each of these four scenarios, we will cover two cases:
In addition, we cover the following scenarios which can be applied to extend any of the above:
Locating a pattern in a string is often by followed by additional steps to achieve a goal. For instance, a common next step after locating patterns in a string is to substitute them with other values. We cover that in Substituting.
Substring
We wish to obtain the location of the first match of a substring in a string.
In this example, we wish to create a new column col_2
where each row value holds the location of the first character of the first match of the string 'gm'
in the corresponding string value of the column col_1
.
df_2 = df %>%
mutate(col_2 = str_locate(col_1, fixed('gm'))[, 'start'])
Here is how this works:
str_locate()
function from the Stringr
package (part of the tidyverse
) to identify the location of the first match of the substring 'gm'
in each value of the column col_1
.str_locate()
takes as input:col_1
.'gm'
. We wrap the substring in the helper fixed()
because by default str_locate()
assumes that the pattern passed is a regular expression. fixed()
specifies that the pattern is a fixed string.str_locate()
is a matrix with as many rows as the size of the first input which in this case is the column col_1
and two columns; start
and end
holding the location of the first character and the last character of the first match of the given pattern in the string corresponding to the current row.start
of 1.end
index is that of the last character of the match in the given string (inclusive).start
and end
will be NA
for that particular row of the output matrix.[, 'start']
, we select the first column holding the start
location of the substring ‘gm’
in the corresponding value of col_1
. Alternatively, we could use [, 1]
to refer to the first column of the matrix.df_2
that is a copy of df
with an additional column col_2
containing the starting position of the first occurrence of the pattern 'gm'
in each element of col_1
.Alternative: Convert to Data Frame then Pull
df_2 = df %>%
mutate(col_2 = str_locate(col_1, 'gm') %>%
as_tibble() %>%
pull(start))
Here is how this works:
str_locate(col_1, 'gm')
to a tibble via as_tibble()
pull(start)
col_2
Regular Expression
We wish to obtain the location of the first match of a regular expression in a string.
In this example, we wish to create two new columns start
and end
where each row value holds the location of the first character and the last character, respectively, of the first match of a sequence of integers in the corresponding string value of the column col_1
.
df_2 = df %>%
mutate(as_tibble(str_locate(col_1, '\\d+')))
Here is how this works:
str_locate()
function from the Stringr
package (part of the tidyverse
) to identify the start and end locations of the first match of the regular expression '\\d+'
in each value of the column col_1
.str_locate()
takes as input:col_1
.'\\d+'
,str_locate()
is a matrix with as many rows as the size of the first input which in this case is the column col_1
and two columns; start
and end
holding the location of the first character and the last character of the first match of the given regular expression in the string corresponding to the current row.start
and end
will be NA
for that particular row of the output matrix.start
and end
locations because the number of characters matched may vary; in this case, the digit sequences captured may be of different lengths.str_locate()
is passed to as_tibble()
, which converts the result to a tibble (a more functional data frame extension in the tidyverse
).as_tibble()
to any column name, mutate()
expands it creating two new columns start
and end
. See Multi-Value Transformation.df_2
will have the same number of rows as the original data frame df
, with two additional start
and end
columns containing the locations of the first and last characters, respectively, of the first sequence of digits in the corresponding value of the column col_1
.Substring
We wish to obtain the locations of all matches of a given substring in a given string.
In this example, we wish to compute the average separation (in number of characters) between occurrences of the pattern ‘gm’
in each value of the column col_1
.
df_2 = df %>%
mutate(result = str_locate_all(df$col_1, fixed('gm'))) %>%
mutate(
avg_sep =
map_dbl(result,
~ (lead(.x[, 'start']) - .x[, 'end']) %>%
mean(na.rm=T))) %>%
select(-result)
Here is how this works:
str_locate_all()
function from the Stringr
package (part of the tidyverse
) to identify the locations of all matches of the substring 'gm'
in each value of the column col_1
.str_locate_all()
returns a list of matrices where:start
and end
and as many rows as there are matches of the pattern (which in this case is ‘gm’
) in the corresponding value for the column col_1
col_1
mutate()
we create a new column result
where each cell holds the matrix of start
and end
locations returned by str_locate_all()
for the value of col_1
for the current row.mutate()
we compute the average separation between occurrences by:.x[, 'end']
from the start position of the next occurrence lead(.x[, 'start'])
and then taking the mean.map_dbl()
(from the map family of functions from the tidyverse
package purrr
) to iterate over each matrix stored in the column result
. See Non-Vectorized Transformation and Working with Lists.df_2
will be a copy of the input data frame df with an added column avg_sep
holding the average number of characters between occurrences of the substring ‘gm’
for the corresponding value of the column col_1
.Regular Expression
We wish to obtain the locations of all matches of a regular expression in a string.
In this example, we wish to compute the average separation (in number of characters) between occurrences of a sequence of integers in each value of the column col_1
.
df_2 = df %>%
mutate(result = str_locate_all(df$col_1, '\\d+')) %>%
mutate(
avg_sep =
map_dbl(result,
~ (lead(.x[, 'start']) - .x[, 'end']) %>%
mean(na.rm=T))) %>%
select(-result)
Here is how this works:
str_locate_all()
. Note that str_locate_all()
expects a regular expression by default.df_2
will be a copy of the input data frame df
with an added column avg_sep
holding the average number of characters between numbers (digit sequences) for the corresponding value of the column col_1
.Substring
We wish to obtain the location of the last match of a substring in a string.
In this example, we wish to create a new column col_2
where each row value holds the location of the first character of the last match of the string 'gm'
in the corresponding string value of the column col_1
.
library(stringi)
df_2 = df %>%
mutate(
col_2 = stri_locate_last(col_1, fixed='gm')[, 'start'])
Here is how this works:
Stringr
. While we could use str_locate_all()
and pick the last match (see alternative below), we think it is easier to use the stri_locate_last()
function from the Stringi
package (which is the package underlying many of the Stringr
functions).stri_locate_last()
:col_1
.fixed
argument of stri_locate_last()
; in this case fixed='gm'
.col_1
and two columns start
and end
holding the start location and end location of the last match of the substring ‘gm’
in the corresponding value of the column col_1
. For a bit more detail, see First Match above.[, 'start']
, we extract the start column from the matrix returned by stri_locate_last()
which is then assigned to the column col_2
by mutate()
.case_insensitive
to case_insensitive=TRUE
.df_2
that is a copy of df
with an additional column col_2
containing the starting position of the last occurrence of the pattern 'gm'
in each element of col_1
.Regular Expression
We wish to obtain the location of the last match of a regular expression in a string.
In this example, we wish to create two new columns start
and end
where each row value holds the location of the first character and the last character, respectively, of the last match of a sequence of integers in the corresponding string value of the column col_1
.
df_2 = df %>%
mutate(as_tibble(stri_locate_last(col_1, regex='\\d+')))
Here is how this works:
stri_locate_last()
.df_2
will have the same number of rows as the original data frame df
, with two additional start
and end
columns containing the locations of the first and last characters, respectively, of the last sequence of digits in the corresponding value of the column col_1
.Alternative: Locate All and Return Last
df_2 = df %>%
mutate(map_dfr(str_locate_all(col_1, '\\d+'),
~ (.x[nrow(.x), ])))
Here is how this works:
str_locate_all()
to locate all matches. See All Matches above.map_dfr()
to iterate over the matrices returned by str_locate_all()
for the values of the column col_1
and for each, extract the last row via ~ (.x[nrow(.x), ]))
. See Non-Vectorized Transformation and Working with Lists.map_dfr()
to any column name, mutate()
expands it creating two new columns start
and end
. See Multi-Value Transformation.We wish to obtain the location of the nth match of a pattern in a string.
In this example, we wish to create two new columns start
and end
where each row value holds the location of the first character and the last character, respectively, of the nth match of a sequence of integers in the corresponding string value of the column col_1
.
get_nth_match <- function(match_matrix, n) {
if (between(n, 1, nrow(match_matrix))) {
return(match_matrix[n, ])
} else {
return(c(start=NA, end=NA))
}
}
str_locate_nth <- function(p_col, p_pat, p_n) {
map_dfr(str_locate_all(p_col, p_pat),
get_nth_match, n = p_n)
}
df_2 = df %>%
mutate(str_locate_nth(col_1, '\\d+', 2))
Here is how this works:
str_locate_all()
to locate all occurrences and then extract the location of the nth occurrence.str_locate_nth()
to which we pass:col_1
.'\\d+'
and matches one or more consecutive digits.2
.str_locate_nth()
operates as follows:str_locate_all()
to obtain a matrix for each value of the column col_1
where there is one row for each occurrence and two columns start
and end
. See All Matches above for a description of str_locate_all()
.get_nth_match()
which uses the function between()
(covered in Numerical Operations) to check if the index requested is within the number of occurrences and if so returns the corresponding row in the match matrix. If not, it returns an empty row (with named elements).map_dfr()
to iterate over the matrices returned by str_locate_all()
and return a row of two values, the start and end locations, of the nth occurrence. See Working with Lists.map_dfr()
to any column name, mutate()
expands it creating two new columns start
and end
. See Multi-Value Transformation.str_locate_all()
. See Substring under All Matches above.df_2
will have the same number of rows as the original data frame df
, with two additional start
and end
columns containing the locations of the first and last characters, respectively, of the 2nd sequence of digits in the corresponding value of the column col_1
.Substring
We wish to obtain the location of the first match of a substring in a string irrespective of whether the letters are in upper or lower case; i.e. while ignoring case.
In this example, we wish to create a new column col_2
where each row value holds the location of the first character of the first match regardless of case, of the string 'gm'
in the corresponding string value of the column col_1
.
df_2 = df %>%
mutate(
col_2 = str_locate(col_1, fixed('gm', ignore_case=TRUE))[, 'start']
)
Here is how this works:
str_locate()
as described in First Match above.fixed()
and pass the parameter ignore_case=TRUE
. See Ignore Case under Detecting for more details.Regular Expression
We wish to obtain the location of the first match of a regular expression in a string irrespective of the case.
In this example, we wish to create two new columns start
and end
where each row value holds the location of the first character and the last character, respectively, of the first match of a sequence of integers followed by the characters ‘gm’
; regardless of case, in the corresponding string value of the column col_1
.
df_2 = df %>%
mutate(
as_tibble(
str_locate(col_1, regex('\\d+gm', ignore_case=TRUE))
))
Here is how this works:
str_locate()
as described in First Match above.regex()
and pass the parameter ignore_case=TRUE
. See Ignore Case under Detecting for more details.In some situations, we wish to obtain the locations of the non-matching segments of the string.
All Matches
We wish to obtain the locations of non-matching segments of a string after locating all occurrences of a pattern in a string.
In this example, we wish to compute the average length (in number of characters) of non-integer character sequences in each value of the column col_1
.
df_2 = df %>%
mutate(result =
map(str_locate_all(col_1, '\\d+'), invert_match),
avg_sep =
map_dbl(result,
~ (lead(.x[, 'start']) - .x[, 'end']) %>%
mean(na.rm=T))) %>%
select(-result)
Here is how this works:
str_locate_all(col_1, '\\d+')
to locate all integer sequences captured by the regular expression '\\d+'
. The output is a matrix with two columns start
and end
and as many rows as the number of occurrences of integer sequences in the corresponding value of the column col_1
. See All Matches above for a detailed description.map()
to iterate over each produced by str_locate_all()
and apply the function invert_match()
which conveniently returns a matrix of a similar structure but with the start
and end
locations capturing the inverse. See Working with Lists for coverage of the map()
family of functions.One Match
We wish to obtain the locations of non-matching segments of a string after locating the first occurrence (or one occurrence) of a pattern in a string.
match_locs = str_locate(df$col_1, '\\d+')
inv_match_locs =
map(1:nrow(match_locs),
~invert_match(match_locs[.x, , drop = FALSE]))
Here is how this works:
invert_match()
is designed to work with the output of str_locate_all()
not the output of str_locate()
.str_locate()
, we need to:map()
iterating over a counter of as many values as there are rows to simulate a loopmatch_locs[.x, , drop = FALSE]
; The argument drop=FALSE
specifies that a matrix should be returned and not a vector.invert_match()
.We wish to match a vector of strings against a vector of patterns of the same size. This is often needed when we wish to locate the value of a column in the value of another column for each row.
Substring
In this example, we have a data frame df
with two columns col_1
and col_2
, we wish to locate the start
of the first occurrence of the value of col_2
in col_1
.
df_2 = df %>%
mutate(start = str_locate(col_1, fixed(col_2))[, 'start'])
Here is how this works:
str_locate()
and str_locate_all()
are vectorized over both the string and the pattern and can operate in one of three modes:col_1
, and a vector of patterns of the same size, the column col_2
.str_locate()
:col_1
as the first argumentcol_2
and is naturally of the same size as col_1
.str_locate()
will return a matrix with two columns start
and end
and one row for each row in the data frame df
holding the start and end locations of the value of the column col_2
in the corresponding value of the column col_1
.Regular Expression
In this example, we have a data frame df
with two columns col_1
and col_2
, we wish to locate the start
and end of the first occurrence of a sequence of repeated col_2
value in col_1
.
df_2 = df %>%
mutate(as_tibble(str_locate(col_1, str_c(col_2, '+'))))
Here is how this works:
‘+’
to each value of the column col_2
via str_c(col_2, '+')
to specify a regular expression that matches one or more occurrences of the value of col_2
.We will cover four scenarios of locating multiple patterns:
Note: In the following examples, the output of string location will be left as a column of nested matrices. Typically, we proceed to do something with those nested matrices e.g. substitute with a character value which we cover in Substituting. Or more generally, operate on them via List Operations.
First Match of Any Pattern
We wish to return the start and end locations of the first occurrence of the first occurring pattern of n patterns. In other words, only one location is returned and that is of whichever pattern occurs first.
df_2 = df %>%
mutate(as_tibble(str_locate(col_1, '\\d+|gm')))
Here is how this works:
|
of regular expressions to build a regular expression that captures all the patterns we wish to look for or’ed together. In this case that regular expression is '\\d+|gm'
.col_1
, str_locate()
will return the start
and end
locations of whichever pattern occurs first (either '\\d+’
or ‘gm’
).df_2
will be a copy of the input data frame df
with two added columns start
and end
.First Match of Each Pattern
We wish to return the start and end locations of the first occurrence of each pattern of n patterns. In other words, n locations are returned one for each pattern.
df_2 = df %>%
mutate(result =
map(col_1, ~str_locate(.x, c('\\d+', 'gm'))))
Here is how this works:
str_locate()
a vector of patterns c('\\d+', 'gm')
.col_1
, str_locate()
will return a matrix with two columns start
and end
holding the start and end locations and one row for the first occurrence of each pattern; in this case two rows for the patterns: '\\d+’
and ‘gm’
.map()
to iterate over each value of the column col_1
and pass that to str_locate()
along with the vector of multiple patterns. See Non-Vectorized Transformation.df_2
will be a copy of the input data frame df
with an added column result
where each cell holds the matrix returned by str_locate()
for the corresponding value of col_1
.All Matches in One Matrix
We wish to return the start and end locations of all occurrences of all patterns in the order of their occurrence as a single matrix.
df_2 = df %>%
mutate(result = str_locate_all(col_1, '\\d+|gm'))
Here is how this works:
|
of regular expressions to build a regular expression that captures all the patterns we wish to look for or’ed together. In this case that regular expression is '\\d+|gm'
.col_1
, str_locate_all()
will return a matrix with two columns start
and end
holding the start and end locations and one row for each occurrence of each pattern.df_2
will be a copy of the input data frame df
with an added column result
where each cell holds the matrix returned by str_locate()
for the corresponding value of col_1
.All Matches in Separate Matrices
We wish to return the start and end locations of all occurrences of all patterns keeping the occurrences of each pattern in a separate matrix.
df_2 = df %>%
mutate(result =
map(col_1, ~str_locate_all(.x, c('\\d+', 'gm'))))
Here is how this works:
str_locate_all()
a vector of patterns c('\\d+', 'gm')
.col_1
, str_locate_all()
will return a list of matrices:start
and end
holding the start and end locations and one row for each occurrence of the corresponding pattern.map()
to iterate over each value of the column col_1
and pass that to str_locate_all()
along with the vector of multiple patterns. See Non-Vectorized Transformation.df_2
will be a copy of the input data frame df
with an added column result
where each cell holds the list of matrices returned by str_locate()
for the corresponding value of col_1
.