We wish to extract a substring that matches a given pattern from a given string.
Substring extraction via a regular expression is one of the most powerful and frequently used techniques in data manipulation, especially for data cleaning.
We will look at three scenarios of substring extraction via regular expressions:
For each of these scenarios, we will cover two cases:
In addition, we cover the following three scenarios which can be applied to extend any of the above:
First Match
We wish to extract the first match of a regular expression pattern from a given string.
In this example, we wish to extract the first occurrence of a number (a sequence of digits) followed by the letters ‘kg’
with zero or more empty space characters between.
df_2 = df %>%
mutate(col_2 = str_extract(col_1, '\\d+\\s*kg'))
Here is how this works:
str_extract()
from the stringr
package to extract a substring that matches a given regular expression from each value of the column col_1
and return a vector of the same size.'\\d+\\s*kg'
which breaks down as follows:\\d+
matches one or more numeric characters.\\s*
matches zero or more empty space characterskg
matches the string kg
.df_2
will be a copy of the input data frame df
with an added column col_2
holding the extracted substrings from the corresponding value of the column col_1
.All Matches
We wish to extract all matches of a regular expression pattern from a given string.
In this example, we wish to extract all occurrences of a number (a sequence of digits) followed by the letters ‘kg’
with zero or more empty space characters between. We then wish to combine the extracted values into a single string with comma separators.
df_2 = df %>%
mutate(
col_2 = str_extract_all(col_1, '\\d+\\s*kg'),
col_3 = map_chr(col_2, str_flatten, collapse = ', '))
Here is how this works:
str_extract_all()
from the stringr
package to extract all occurrences of a given regular expression from each value of the column col_1
.str_extract_all()
returns a list of character vectors where each vector holds the matches extracted from the corresponding value of the column col_1
.str_flatten()
to reduce each vector of matches to a single string where the extracted matches are separated by commas.map_chr()
to iterate over the vectors of strings extracted from each element of col_1
. See Working with Lists.str_flatten()
.map_*()
for the return type. See Working with Lists.Extension: Nth Match
We wish to extract the nth match of a regular expression pattern from a given string.
df_2 = df %>%
mutate(col_2 = str_extract_all(col_1, '\\d+\\s*kg'),
col_3 = map_chr(col_2, `[`, 2))
Here is how this works:
str_extract_all()
to extract all pattern matches from each element of the column col_1
; as described in the primary solution above.col_2
, which contains the vector of all matching substrings extracted for each element in col_1
, and we select the second element from the list of matches via the list index operator [
with the argument 2
.map_chr()
to iterate over the values of col_2
because we expect the output to be a single character value. See Non-Vectorized Transformation and Working with Lists.Extension: Last Match
We wish to extract the last match of a regular expression pattern from a given string.
df_2 = df %>%
mutate(
col_2 = str_extract_all(col_1, '\\d+\\s*kg'),
col_3 = map_chr(
col_2,
~ ifelse(length(.x) > 0, .x[length(.x)], NA)))
Here is how this works:
str_extract_all()
to extract all pattern matches from each element of the column col_1
; as described in the primary solution above.col_2
, which contains the vector of all matching substrings extracted for each element in col_1
, andcol_2
, the anonymous function will return the last value of the vector of matches and will return a value of NA
if the value of col_2
is an empty vector.First Match
We wish to extract a part, specified by a capture group, of the first match of a regular expression pattern from a given string.
In this example, we wish to extract the first occurrence of a number (a sequence of digits) that is followed by the letters ‘kg’
with zero or more empty space characters between. Note we only wish to extract the number component.
df_2 = df %>%
mutate(
col_2 = str_extract(col_1, "(\\d+)\\s*kg", group = 1))
Here is how this works:
str_extract()
from the stringr
package with the group argument set to group=1
to extract the part of the matching substring that corresponds to the first group defined in the regular expression for each value in the column col_1
.‘(\\d+)\\s*kg’
which works as follows:(\\d+)
defines a capture group that contains a sequence of digits. A capture group is a part of a regular expression that can be referred to.\\s*kg
denotes a pattern where zero or more empty space characters are followed by the string ‘kg’
.'kg'
but we do not wish to include 'kg'
in the output.df_2
will be a copy of the input data frame df
with an added column col_2
holding the extracted substrings (corresponding to the first capture group) from the corresponding value of the column col_1
.Extension: nth Capture Group
df_2 = df %>%
mutate(
col_2 = str_extract(col_1, "(X+)(Y+)", group = 2))
Here is how this works:
group
argument of str_extract()
.col_1
, hence we pass group=2
to str_extract()
.df_2
will be a copy of the input data frame df
with an added column col_2
holding the extracted substrings (corresponding to the second capture group) from the corresponding value of the column col_1
.Alternative: Using str_match()
df_2 = df %>%
mutate(col_2 = str_match(col_1, '(X+)(Y+)')[,3])
Here is how this works:
str_extract()
with a set group
parameter for extracting a capture group, is to use str_match()
.str_match()
is built primarily for substring extraction by capture groups. It returns a matrix where:col_1
'(X+)(Y+)'
.All Matches
We wish to extract a part, specified by a capture group, of each match of a regular expression pattern from a given string.
In this example, we wish to extract and add up all occurrences of a number (a sequence of digits) that is followed by the letters ‘kg’
(with no or more empty space characters between). Note we only wish to extract the number component.
sum_group_matches <- function(p_mat, p_group) {
m_sum = p_mat[, p_group+1] %>%
parse_number() %>%
sum()
}
df_2 = df %>%
mutate(
col_2 = map_dbl(
str_match_all(col_1, "(\\d+)\\s*kg"),
sum_group_matches,
1)
)
Here is how this works:
str_match_all()
from the package stringr
to extract all matches of a regular expression as well as all matches of each capture group from each element of the input column col_1
.str_extract_all()
returns a list of character vectors where each vector holds the matches extracted from the corresponding value of the column col_1
.str_match_all()
is a list of the same length as col_1
containing character matrices; i.e. one character matrix for each value of the column col_1
.map_dbl()
to iterate over the matrices returned by str_match_all()
and apply the custom function sum_group_matches()
to each. See Non-Vectorized Transformation and Working with Lists.sum_group_matches()
operates as follows:p_mat
and p_group
. The argument p_mat
is a matrix that contains the matches of the regular expression pattern in col_1
. The argument p_group
specifies which group of the regular expression should be extracted. In this case, the argument p_group=1
specifies that the first group, which consists of one or more digits, should be extracted.sum_group_matches()
function extracts the specified group from the p_mat
matrix, converts the extracted values to numbers using the parse_number()
function, and calculates their sum using the sum()
function. The resulting sum is then returned as the value for the corresponding element in col_2
.Alternative: Extract and Process in Function
sum_group_matches <- function(p_col, p_pat, p_group) {
m_sum = p_col %>%
str_match_all(p_pat) %>%
map_dbl(
~.x[, p_group+1] %>%
parse_number %>%
sum()
)
}
df_2 = df %>%
mutate(
col_2 = sum_group_matches(col_1, '(\\d+)\\s*kg', 1)
)
Here is how this works:
sum_group_matches()
.sum_group_matches()
is changed as follows:p_col
, p_pat
, and p_group
. The argument p_col
is the column whose values we wish to extract substrings from. The argument p_pat
is the regular expression whose occurrences we wish to extract. The argument p_group
specifies which group of the regular expression should be extracted.str_match_all()
. It then uses map_db()
to iterate and execute an anonymous function that extracts the specified group from the p_mat
matrix, converts the extracted values to numbers using the parse_number()
function, and calculates their sum using the sum()
function.First Match
We wish to extract multiple parts, specified by multiple capture groups, of the first match of a regular expression pattern from a given string.
In this example, we wish to extract two parts of the first occurrence of a string pattern; the first representing a numerical quantity and the second representing the unit from each value of the column col_1
, and store each in its own column.
df_2 = df %>%
mutate(match_res = str_match(col_1, '(\\d+)\\s*([a-zA-Z]{1,4})'),
value = match_res[,2],
unit = match_res[,3]) %>%
select(-match_res)
Here is how this works:
str_match()
to extract the first substrings matching each capture group from each element in the column col_1
.str_match()
is built primarily for substring extraction by capture groups. It returns a matrix where:col_1
'(\\d+)\\s*([a-zA-Z]{1,4})'
which works as follows:(\\d+)
is capture group that matches a sequence of one or more digits.\\s*
a sequence of no or more empty space characters.([a-zA-Z]{1,4})
is a capture group that matches a sequence of one to four lettersstr_match()
is stored in the column match_res
.value
and unit
by extracting the second and third columns from the matrix created by str_match()
and stored in match_res
respectively corresponding to the two capture groups (\\d+)
and ([a-zA-Z]{1,4})
.df_2
is a copy of the input data frame df
with two columns added value
and unit
containing the values captured by the two capture groups in the given regular expression.Extension: Arbitrarily Many Capture Groups
df_2 = df %>%
mutate(
str_match(col_1, '(\\d+)\\s*([a-zA-Z]{1,4})') %>%
as_tibble(.name_repair ="unique"))
Here is how this work:
mutate()
to create one new column for each column of the matrix returned by str_match()
. See Multi-Value Transformation.Alternative: Via str_extract()
df_2 = df %>%
mutate(
str_extract(col_1, '(\\d+)\\s*([a-zA-Z]{1,4})', group = c(1,2)) %>%
as_tibble(.name_repair = ~c('value', 'unit')))
Here is how this works:
str_extract()
while passing the indices of the groups that we wish to have returned. See “First Match” under One Capture Group above for a description of how this works.str_extract()
the argument group = c(1,2)
.str_extract()
in this case is a matrix where:col_1
group = c(1,2)
.as_tibble()
to convert the matrix returned by str_extract()
to a data frame (a tibble). When a data frame is returns to mutate() without being assigned to any column name, the data frame is unpacked to individual columns which is what we are looking for. See Multi-Value Transformation.str_extract()
has no column names, we use the .name_repair
argument of as_tibble()
to specify the names to assign to the columns of the data frame being created. In this case we pass what is called a look-up formula specifying the column name for each column (two columns in this case) as .name_repair = ~c('value', 'unit'))
. See Multi-Value Transformation for a description of this pattern.df_2
will be a copy of the input data frame df
with two columns added called value and unit containing the values of the first match of the first and second capture groups (in the specified regular expression) respectively.All Matches
We wish to extract from a given string all matches of multiple parts, specified as capture groups, of a given regular expression.
In this example, each element of the input column is a string that takes the form "a1 * b1 + a2 * b2 … + an * bn”
where the a1, a2,..
and b1, b2,..
are each a sequence of digits. We wish to parse and evaluate each of these expressions and store the result in a new column col_2
.
sum_prods <- function(p_mat) {
sum(parse_number(p_mat[, 2]) * parse_number(p_mat[, 3]))
}
df_2 = df %>%
mutate(
col_2 = col_1 %>%
str_match_all('(\\d+)\\s*\\*\\s*(\\d+)') %>%
map_dbl(sum_prods)
)
Here is how this work:
str_match_all()
to extract all matches of all capture groups in the given regular expression for each element of the column col_1
as follows;str_match_all()
takes regex with at least one capture group and returns a list of character matrices.str_match()
which we describe in detail under “First Match” in Multiple Capture Group above i.e. the first column has the complete match followed by one column for the output of each capture group.map_dbl()
to iterate over the matrices returned by str_match_all()
and apply the custom function sum_prods()
to each which returns a single numeric value for each matrix.sum_prods()
takes a matrix p_mat
as an input. The function:p_mat
df_2
is a copy of the input data frame df
with an added column col_2
holding the numeric value resulting from processing the corresponding string value of the column col_1
.We wish to extract the first match of a regular expression pattern from a given string while ignoring case.
In this example, we wish to extract the first occurrence of a number (a sequence of digits) followed by the letters ‘kg’
regardless of case with zero or more empty space characters between.
df_2 = df %>%
mutate(
col_2 = str_extract(col_1, regex('\\d+\\s*kg', ignore_case = TRUE)))
Here is how this works:
str_extract()
as described in Entire Pattern above.regex()
and pass the parameter ignore_case=TRUE
. See Ignore Case under Detecting for more details.We wish to match a vector of strings against a vector of patterns of the same size. This is often needed when the regular expression we are matching needs to be different for each row based on the value of a particular column.
In this example, we have a data frame df
with two column col_1
and col_2
, we wish to extract from each value in col_1
a substring composed of a sequence of digits followed by a set of characters representing a unit specified by the corresponding value of the column col_2
.
df_2 = df %>%
mutate(
col_3 = str_extract(col_1, str_c('(\\d+)\\s*', col_2)))
Here is how this works:
str_extract()
is vectorized over both the string and the pattern and can operate in one of three modes:col_1
, and a vector of patterns of the same size.str_extract()
:col_1
as the first argument.col_2
and is naturally of the same size as col_1
.col_2
to the regular expression '(\\d+)\\s*'
for each row. For instance, if the value of the column col_2
is ‘kg’
then the value of the regular expression to extract will be '(\\d+)\\s*kg'
.df_2
will be a copy of the input data frame df but with an additional column col_3
containing the substrings extracted.