We wish to extract a substring that matches a given pattern from a given string.

Substring extraction via a regular expression is one of the most powerful and frequently used techniques in data manipulation, especially for data cleaning.

We will look at three scenarios of substring extraction via regular expressions:

**Entire Pattern**where we cover how to extract a substring from a given string that matches a given regular expression.**One Capture Group**where we cover how to match a given regular expression pattern to a given string and then extract part of the pattern commonly referred to as a capture group.**Multiple Capture Groups**where we cover how to extract multiple parts of a pattern; i.e. multiple capture groups.

For each of these scenarios, we will cover two cases:

- How to extract the
**First Match** - How to extract
**All Matches**

In addition, we cover the following three scenarios which can be applied to extend any of the above:

**Ignore Case**where we cover how to ignore the case (of both pattern and string) while matching.**Pattern Column**where we cover how to match a vector of strings against a vector of patterns of the same size. This is quite often needed when we wish to locate the value of a column in another column for each row.

**First Match**

We wish to extract the first match of a regular expression pattern from a given string.

In this example, we wish to extract the first occurrence of a number (a sequence of digits) followed by the letters `‘kg’`

with zero or more empty space characters between.

```
df_2 = df %>%
mutate(col_2 = str_extract(col_1, '\\d+\\s*kg'))
```

Here is how this works:

- We use the function
`str_extract()`

from the`stringr`

package to extract a substring that matches a given regular expression from each value of the column`col_1`

and return a vector of the same size. - In this example, the regular expression that represents the substrings that we are looking to extract is
`'\\d+\\s*kg'`

which breaks down as follows:`\\d+`

matches one or more numeric characters.`\\s*`

matches zero or more empty space characters`kg`

matches the string`kg`

.

- The output data frame
`df_2`

will be a copy of the input data frame`df`

with an added column`col_2`

holding the extracted substrings from the corresponding value of the column`col_1`

.

**All Matches**

We wish to extract all matches of a regular expression pattern from a given string.

In this example, we wish to extract all occurrences of a number (a sequence of digits) followed by the letters `‘kg’`

with zero or more empty space characters between. We then wish to combine the extracted values into a single string with comma separators.

```
df_2 = df %>%
mutate(
col_2 = str_extract_all(col_1, '\\d+\\s*kg'),
col_3 = map_chr(col_2, str_flatten, collapse = ', '))
```

Here is how this works:

- We use the function
`str_extract_all()`

from the`stringr`

package to extract all occurrences of a given regular expression from each value of the column`col_1`

. - The output of
`str_extract_all()`

returns a list of character vectors where each vector holds the matches extracted from the corresponding value of the column`col_1`

. - In order to reduce each vector of matches to a single string where the extracted matches are separated by commas, we:
- Use
`str_flatten()`

to reduce each vector of matches to a single string where the extracted matches are separated by commas. - Use
`map_chr()`

to iterate over the vectors of strings extracted from each element of`col_1`

. See Working with Lists.

- Use
- In this example, we simply flatten the list of returned matching substrings into one string. To perform, other processing, simply:
- Call the function of choice (build-in or custom) instead of
`str_flatten()`

. - Use the appropriate mapping function
`map_*()`

for the return type. See Working with Lists.

- Call the function of choice (build-in or custom) instead of

*Extension: Nth Match*

We wish to extract the nth match of a regular expression pattern from a given string.

```
df_2 = df %>%
mutate(col_2 = str_extract_all(col_1, '\\d+\\s*kg'),
col_3 = map_chr(col_2, `[`, 2))
```

Here is how this works:

- We use
`str_extract_all()`

to extract all pattern matches from each element of the column`col_1`

; as described in the primary solution above. - In order to extract the second match, we map over
`col_2`

, which contains the vector of all matching substrings extracted for each element in`col_1`

, and we select the second element from the list of matches via the list index operator`[`

with the argument`2`

. - We used
`map_chr()`

to iterate over the values of`col_2`

because we expect the output to be a single character value. See Non-Vectorized Transformation and Working with Lists.

*Extension: Last Match*

We wish to extract the last match of a regular expression pattern from a given string.

```
df_2 = df %>%
mutate(
col_2 = str_extract_all(col_1, '\\d+\\s*kg'),
col_3 = map_chr(
col_2,
~ ifelse(length(.x) > 0, .x[length(.x)], NA)))
```

Here is how this works:

- We use
`str_extract_all()`

to extract all pattern matches from each element of the column`col_1`

; as described in the primary solution above. - In order to extract the last match:
- We map over
`col_2`

, which contains the vector of all matching substrings extracted for each element in`col_1`

, and - for each element in
`col_2`

, the anonymous function will return the last value of the vector of matches and will return a value of`NA`

if the value of`col_2`

is an empty vector.

- We map over

**First Match**

We wish to extract a part, specified by a capture group, of the first match of a regular expression pattern from a given string.

In this example, we wish to extract the first occurrence of a number (a sequence of digits) that is followed by the letters `‘kg’`

with zero or more empty space characters between. Note we only wish to extract the number component.

```
df_2 = df %>%
mutate(
col_2 = str_extract(col_1, "(\\d+)\\s*kg", group = 1))
```

Here is how this works:

- We use the function
`str_extract()`

from the`stringr`

package with the group argument set to`group=1`

to extract the part of the matching substring that corresponds to the first group defined in the regular expression for each value in the column`col_1`

. - In this example the regular expression is
`‘(\\d+)\\s*kg’`

which works as follows:`(\\d+)`

defines a capture group that contains a sequence of digits. A capture group is a part of a regular expression that can be referred to.`\\s*kg`

denotes a pattern where zero or more empty space characters are followed by the string`‘kg’`

.

- The power of capture groups in substring extraction is most pronounced when parts of the regular expression are to localize the desired substring, but we don’t want to include them in the output extraction. For instance, in this case, we wish to capture any integer that comes before the unit
`'kg'`

but we do not wish to include`'kg'`

in the output. - The output data frame
`df_2`

will be a copy of the input data frame`df`

with an added column`col_2`

holding the extracted substrings (corresponding to the first capture group) from the corresponding value of the column`col_1`

.

*Extension: nth Capture Group*

```
df_2 = df %>%
mutate(
col_2 = str_extract(col_1, "(X+)(Y+)", group = 2))
```

Here is how this works:

- If the regular expression pattern we are matching has multiple capture groups, we can refer to and extract any of the groups by passing its index to the
`group`

argument of`str_extract()`

. - In this case, we extract the second capture group from each value of the column
`col_1`

, hence we pass`group=2`

to`str_extract()`

. - The output data frame
`df_2`

will be a copy of the input data frame`df`

with an added column`col_2`

holding the extracted substrings (corresponding to the second capture group) from the corresponding value of the column`col_1`

.

*Alternative: Using* `str_match()`

```
df_2 = df %>%
mutate(col_2 = str_match(col_1, '(X+)(Y+)')[,3])
```

Here is how this works:

- An alternative to
`str_extract()`

with a set`group`

parameter for extracting a capture group, is to use`str_match()`

. `str_match()`

is built primarily for substring extraction by capture groups. It returns a matrix where:- there is one row for each input value, which in this case are the values of the column
`col_1`

- The first column contains the substring matching the entire pattern
- The subsequent columns correspond to capture groups captured within the pattern (one column per capture group). In this case, there will be two columns corresponding to the two capture groups in the pattern
`'(X+)(Y+)'`

.

- there is one row for each input value, which in this case are the values of the column

**All Matches**

We wish to extract a part, specified by a capture group, of each match of a regular expression pattern from a given string.

In this example, we wish to extract and add up all occurrences of a number (a sequence of digits) that is followed by the letters `‘kg’`

(with no or more empty space characters between). Note we only wish to extract the number component.

```
sum_group_matches <- function(p_mat, p_group) {
m_sum = p_mat[, p_group+1] %>%
parse_number() %>%
sum()
}
df_2 = df %>%
mutate(
col_2 = map_dbl(
str_match_all(col_1, "(\\d+)\\s*kg"),
sum_group_matches,
1)
)
```

Here is how this works:

- We use the function
`str_match_all()`

from the package`stringr`

to extract all matches of a regular expression as well as all matches of each capture group from each element of the input column`col_1`

. - The output of
`str_extract_all()`

returns a list of character vectors where each vector holds the matches extracted from the corresponding value of the column`col_1`

. - In this case, the output of
`str_match_all()`

is a list of the same length as`col_1`

containing character matrices; i.e. one character matrix for each value of the column`col_1`

. - Each matrix takes the following form:
- there is one row for each pattern match
- The first column contains the substring matching the entire pattern
- The subsequent columns correspond to capture groups captured within the pattern (one column per capture group).

- After obtaining the pattern matches we need to process them, which in this example is, to obtain the sum of the matches of the first capture group. We do that by using
`map_dbl()`

to iterate over the matrices returned by`str_match_all()`

and apply the custom function`sum_group_matches()`

to each. See Non-Vectorized Transformation and Working with Lists. - The custom function
`sum_group_matches()`

operates as follows:- It takes in two arguments:
`p_mat`

and`p_group`

. The argument`p_mat`

is a matrix that contains the matches of the regular expression pattern in`col_1`

. The argument`p_group`

specifies which group of the regular expression should be extracted. In this case, the argument`p_group=1`

specifies that the first group, which consists of one or more digits, should be extracted. - The
`sum_group_matches()`

function extracts the specified group from the`p_mat`

matrix, converts the extracted values to numbers using the`parse_number()`

function, and calculates their sum using the`sum()`

function. The resulting sum is then returned as the value for the corresponding element in`col_2`

.

- It takes in two arguments:

*Alternative: Extract and Process in Function*

```
sum_group_matches <- function(p_col, p_pat, p_group) {
m_sum = p_col %>%
str_match_all(p_pat) %>%
map_dbl(
~.x[, p_group+1] %>%
parse_number %>%
sum()
)
}
df_2 = df %>%
mutate(
col_2 = sum_group_matches(col_1, '(\\d+)\\s*kg', 1)
)
```

Here is how this works:

- This works similarly to the solution above except that we isolate both the extraction and iteration out of the main chain and into the custom function
`sum_group_matches()`

. - The custom function
`sum_group_matches()`

is changed as follows:- The function takes parameters
`p_col`

,`p_pat`

, and`p_group`

. The argument`p_col`

is the column whose values we wish to extract substrings from. The argument`p_pat`

is the regular expression whose occurrences we wish to extract. The argument`p_group`

specifies which group of the regular expression should be extracted. - The function starts by extracting all pattern matches via
`str_match_all()`

. It then uses`map_db()`

to iterate and execute an anonymous function that extracts the specified group from the`p_mat`

matrix, converts the extracted values to numbers using the`parse_number()`

function, and calculates their sum using the`sum()`

function.

- The function takes parameters
- Whether to isolate iteration out of the main chain or not depends on the situation and preferred coding style.

**First Match**

We wish to extract multiple parts, specified by multiple capture groups, of the first match of a regular expression pattern from a given string.

In this example, we wish to extract two parts of the first occurrence of a string pattern; the first representing a numerical quantity and the second representing the unit from each value of the column `col_1`

, and store each in its own column.

```
df_2 = df %>%
mutate(match_res = str_match(col_1, '(\\d+)\\s*([a-zA-Z]{1,4})'),
value = match_res[,2],
unit = match_res[,3]) %>%
select(-match_res)
```

Here is how this works:

- We use
`str_match()`

to extract the first substrings matching each capture group from each element in the column`col_1`

. `str_match()`

is built primarily for substring extraction by capture groups. It returns a matrix where:- there is one row for each input value, which in this case are the values of the column
`col_1`

- The first column contains the substring matching the entire pattern
- The subsequent columns correspond to capture groups captured within the pattern (one column per capture group. In this case, there will be two columns corresponding to the two capture groups in the pattern (quantity and unit).

- there is one row for each input value, which in this case are the values of the column
- The regular expression we use in this case is
`'(\\d+)\\s*([a-zA-Z]{1,4})'`

which works as follows:`(\\d+)`

is capture group that matches a sequence of one or more digits.`\\s*`

a sequence of no or more empty space characters.`([a-zA-Z]{1,4})`

is a capture group that matches a sequence of one to four letters

- The matrix returned by
`str_match()`

is stored in the column`match_res`

. - We then create two new columns
`value`

and`unit`

by extracting the second and third columns from the matrix created by`str_match()`

and stored in`match_res`

respectively corresponding to the two capture groups`(\\d+)`

and`([a-zA-Z]{1,4})`

. - The output data frame
`df_2`

is a copy of the input data frame`df`

with two columns added`value`

and`unit`

containing the values captured by the two capture groups in the given regular expression.

*Extension: Arbitrarily Many Capture Groups*

```
df_2 = df %>%
mutate(
str_match(col_1, '(\\d+)\\s*([a-zA-Z]{1,4})') %>%
as_tibble(.name_repair ="unique"))
```

Here is how this work:

- This code works similarly to the primary solution above except that we do not. manually extract individual columns, and rather rely on
`mutate()`

to create one new column for each column of the matrix returned by`str_match()`

. See Multi-Value Transformation. - This is a useful approach be when we do not know how many capture groups the regular expression may have; e.g. if the regex is passed dynamically at run time.

*Alternative: Via* `str_extract()`

```
df_2 = df %>%
mutate(
str_extract(col_1, '(\\d+)\\s*([a-zA-Z]{1,4})', group = c(1,2)) %>%
as_tibble(.name_repair = ~c('value', 'unit')))
```

Here is how this works:

- We use
`str_extract()`

while passing the indices of the groups that we wish to have returned. See “First Match” under One Capture Group above for a description of how this works. - In this case, we wish to extract the first and second capture groups, therefore we pass to
`str_extract()`

the argument`group = c(1,2)`

. - The output of
`str_extract()`

in this case is a matrix where:- there is one row for each input value, which in this case are the values of the column
`col_1`

- Two columns corresponding to the two capture groups we selected when we specified
`group = c(1,2)`

.

- there is one row for each input value, which in this case are the values of the column
- See the primary solution above for a description of the regular expression.
- We use
`as_tibble()`

to convert the matrix returned by`str_extract()`

to a data frame (a tibble). When a data frame is returns to mutate() without being assigned to any column name, the data frame is unpacked to individual columns which is what we are looking for. See Multi-Value Transformation. - Since the matrix returns by
`str_extract()`

has no column names, we use the`.name_repair`

argument of`as_tibble()`

to specify the names to assign to the columns of the data frame being created. In this case we pass what is called a look-up formula specifying the column name for each column (two columns in this case) as`.name_repair = ~c('value', 'unit'))`

. See Multi-Value Transformation for a description of this pattern. - The output data frame
`df_2`

will be a copy of the input data frame`df`

with two columns added called value and unit containing the values of the first match of the first and second capture groups (in the specified regular expression) respectively.

**All Matches**

We wish to extract from a given string all matches of multiple parts, specified as capture groups, of a given regular expression.

In this example, each element of the input column is a string that takes the form `"a1 * b1 + a2 * b2 … + an * bn”`

where the `a1, a2,..`

and `b1, b2,..`

are each a sequence of digits. We wish to parse and evaluate each of these expressions and store the result in a new column `col_2`

.

```
sum_prods <- function(p_mat) {
sum(parse_number(p_mat[, 2]) * parse_number(p_mat[, 3]))
}
df_2 = df %>%
mutate(
col_2 = col_1 %>%
str_match_all('(\\d+)\\s*\\*\\s*(\\d+)') %>%
map_dbl(sum_prods)
)
```

Here is how this work:

- First we use
`str_match_all()`

to extract all matches of all capture groups in the given regular expression for each element of the column`col_1`

as follows;`str_match_all()`

takes regex with at least one capture group and returns a list of character matrices.- Each matrix is similar to the output of
`str_match()`

which we describe in detail under “First Match” in Multiple Capture Group above i.e. the first column has the complete match followed by one column for the output of each capture group.

- We then use
`map_dbl()`

to iterate over the matrices returned by`str_match_all()`

and apply the custom function`sum_prods()`

to each which returns a single numeric value for each matrix. - The custom function
`sum_prods()`

takes a matrix`p_mat`

as an input. The function:- Parses the values of the second and third columns of
`p_mat`

- Takes their element wide product
- and finally calculates the sum of the products
- returns the resulting numeric value

- Parses the values of the second and third columns of
- The output data frame
`df_2`

is a copy of the input data frame`df`

with an added column`col_2`

holding the numeric value resulting from processing the corresponding string value of the column`col_1`

.

We wish to extract the first match of a regular expression pattern from a given string while ignoring case.

In this example, we wish to extract the first occurrence of a number (a sequence of digits) followed by the letters `‘kg’`

regardless of case with zero or more empty space characters between.

```
df_2 = df %>%
mutate(
col_2 = str_extract(col_1, regex('\\d+\\s*kg', ignore_case = TRUE)))
```

Here is how this works:

- To extract the first occurrence of a given regular expression in a given string, we use
`str_extract()`

as described in Entire Pattern above. - To ignore case while matching, we wrap the regular expression in
`regex()`

and pass the parameter`ignore_case=TRUE`

. See Ignore Case under Detecting for more details. - We can extend the solutions presented in One Capture Group and Multiple Capture Groups the same way.

We wish to match a vector of strings against a vector of patterns of the same size. This is often needed when the regular expression we are matching needs to be different for each row based on the value of a particular column.

In this example, we have a data frame `df`

with two column `col_1`

and `col_2`

, we wish to extract from each value in `col_1`

a substring composed of a sequence of digits followed by a set of characters representing a unit specified by the corresponding value of the column `col_2`

.

```
df_2 = df %>%
mutate(
col_3 = str_extract(col_1, str_c('(\\d+)\\s*', col_2)))
```

Here is how this works:

`str_extract()`

is vectorized over both the string and the pattern and can operate in one of three modes:- Check for one pattern in each element in a vector of strings. This is the mode we used in all the above scenarios.
- Check n patterns against n strings. The size of both vectors must be the same for this pattern (or a multiple). This is the mode we use in this solution since we have a vector of strings, the column
`col_1`

, and a vector of patterns of the same size. - Check multiple patterns against a single string.

- We pass to
`str_extract()`

:- the strings to look into which in this case is the column
`col_1`

as the first argument. - and the patterns to look for which in this case is a transformation of the column
`col_2`

and is naturally of the same size as`col_1`

.

- the strings to look into which in this case is the column
- We construct the regular expression patterns by appending the value of the column
`col_2`

to the regular expression`'(\\d+)\\s*'`

for each row. For instance, if the value of the column`col_2`

is`‘kg’`

then the value of the regular expression to extract will be`'(\\d+)\\s*kg'`

. - The output data frame
`df_2`

will be a copy of the input data frame df but with an additional column`col_3`

containing the substrings extracted. - See Entire Pattern above for a more detailed description of this code. We can extend the solutions presented in One Capture Group and Multiple Capture Groups the same way.

R