Extracting by Regular Expression

We wish to extract a substring that matches a given pattern from a given string.

Substring extraction via a regular expression is one of the most powerful and frequently used techniques in data manipulation, especially for data cleaning.

We will look at three scenarios of substring extraction via regular expressions:

  • Entire Pattern where we cover how to extract a substring from a given string that matches a given regular expression.
  • One Capture Group where we cover how to match a given regular expression pattern to a given string and then extract part of the pattern commonly referred to as a capture group.
  • Multiple Capture Groups where we cover how to extract multiple parts of a pattern; i.e. multiple capture groups.

For each of these scenarios, we will cover two cases:

  • How to extract the First Match
  • How to extract All Matches

In addition, we cover the following three scenarios which can be applied to extend any of the above:

  • Ignore Case where we cover how to ignore the case (of both pattern and string) while matching.
  • Pattern Column where we cover how to match a vector of strings against a vector of patterns of the same size. This is quite often needed when we wish to locate the value of a column in another column for each row.

Entire Pattern

First Match

We wish to extract the first match of a regular expression pattern from a given string.

In this example, we wish to extract the first occurrence of a number (a sequence of digits) followed by the letters ‘kg’ with zero or more empty space characters between.

df_2 = df %>% 
    mutate(col_2 = str_extract(col_1, '\\d+\\s*kg'))

Here is how this works:

  • We use the function str_extract() from the stringr package to extract a substring that matches a given regular expression from each value of the column col_1 and return a vector of the same size.
  • In this example, the regular expression that represents the substrings that we are looking to extract is '\\d+\\s*kg' which breaks down as follows:
    • \\d+ matches one or more numeric characters.
    • \\s* matches zero or more empty space characters
    • kg matches the string kg.
  • The output data frame df_2 will be a copy of the input data frame df with an added column col_2 holding the extracted substrings from the corresponding value of the column col_1.

All Matches

We wish to extract all matches of a regular expression pattern from a given string.

In this example, we wish to extract all occurrences of a number (a sequence of digits) followed by the letters ‘kg’ with zero or more empty space characters between. We then wish to combine the extracted values into a single string with comma separators.

df_2 = df %>%
  mutate(
    col_2 = str_extract_all(col_1, '\\d+\\s*kg'),
    col_3 = map_chr(col_2, str_flatten, collapse = ', '))

Here is how this works:

  • We use the function str_extract_all() from the stringr package to extract all occurrences of a given regular expression from each value of the column col_1.
  • The output of str_extract_all() returns a list of character vectors where each vector holds the matches extracted from the corresponding value of the column col_1.
  • In order to reduce each vector of matches to a single string where the extracted matches are separated by commas, we:
    • Use str_flatten() to reduce each vector of matches to a single string where the extracted matches are separated by commas.
    • Use map_chr() to iterate over the vectors of strings extracted from each element of col_1. See Working with Lists.
  • In this example, we simply flatten the list of returned matching substrings into one string. To perform, other processing, simply:
    • Call the function of choice (build-in or custom) instead of str_flatten().
    • Use the appropriate mapping function map_*() for the return type. See Working with Lists.

Extension: Nth Match

We wish to extract the nth match of a regular expression pattern from a given string.

df_2 = df %>%
  mutate(col_2 = str_extract_all(col_1, '\\d+\\s*kg'),
         col_3 = map_chr(col_2, `[`, 2))

Here is how this works:

  • We use str_extract_all() to extract all pattern matches from each element of the column col_1; as described in the primary solution above.
  • In order to extract the second match, we map over col_2, which contains the vector of all matching substrings extracted for each element in col_1, and we select the second element from the list of matches via the list index operator [ with the argument 2.
  • We used map_chr() to iterate over the values of col_2 because we expect the output to be a single character value. See Non-Vectorized Transformation and Working with Lists.

Extension: Last Match

We wish to extract the last match of a regular expression pattern from a given string.

df_2 = df %>%
  mutate(
    col_2 = str_extract_all(col_1, '\\d+\\s*kg'),
    col_3 = map_chr(
      col_2,
      ~ ifelse(length(.x) > 0, .x[length(.x)], NA)))

Here is how this works:

  • We use str_extract_all() to extract all pattern matches from each element of the column col_1; as described in the primary solution above.
  • In order to extract the last match:
    • We map over col_2, which contains the vector of all matching substrings extracted for each element in col_1, and
    • for each element in col_2, the anonymous function will return the last value of the vector of matches and will return a value of NA if the value of col_2 is an empty vector.

One Capture Group

First Match

We wish to extract a part, specified by a capture group, of the first match of a regular expression pattern from a given string.

In this example, we wish to extract the first occurrence of a number (a sequence of digits) that is followed by the letters ‘kg’ with zero or more empty space characters between. Note we only wish to extract the number component.

df_2 = df %>% 
    mutate(
        col_2 = str_extract(col_1, "(\\d+)\\s*kg", group = 1))

Here is how this works:

  • We use the function str_extract() from the stringr package with the group argument set to group=1 to extract the part of the matching substring that corresponds to the first group defined in the regular expression for each value in the column col_1.
  • In this example the regular expression is ‘(\\d+)\\s*kg’ which works as follows:
    • (\\d+) defines a capture group that contains a sequence of digits. A capture group is a part of a regular expression that can be referred to.
    • \\s*kg denotes a pattern where zero or more empty space characters are followed by the string ‘kg’.
  • The power of capture groups in substring extraction is most pronounced when parts of the regular expression are to localize the desired substring, but we don’t want to include them in the output extraction. For instance, in this case, we wish to capture any integer that comes before the unit 'kg' but we do not wish to include 'kg' in the output.
  • The output data frame df_2 will be a copy of the input data frame df with an added column col_2 holding the extracted substrings (corresponding to the first capture group) from the corresponding value of the column col_1.

Extension: nth Capture Group

df_2 = df %>% 
  mutate(
    col_2 = str_extract(col_1, "(X+)(Y+)", group = 2))

Here is how this works:

  • If the regular expression pattern we are matching has multiple capture groups, we can refer to and extract any of the groups by passing its index to the group argument of str_extract().
  • In this case, we extract the second capture group from each value of the column col_1, hence we pass group=2 to str_extract().
  • The output data frame df_2 will be a copy of the input data frame df with an added column col_2 holding the extracted substrings (corresponding to the second capture group) from the corresponding value of the column col_1.

Alternative: Using str_match()

df_2 = df %>%
  mutate(col_2 = str_match(col_1, '(X+)(Y+)')[,3])

Here is how this works:

  • An alternative to str_extract() with a set group parameter for extracting a capture group, is to use str_match().
  • str_match() is built primarily for substring extraction by capture groups. It returns a matrix where:
    • there is one row for each input value, which in this case are the values of the column col_1
    • The first column contains the substring matching the entire pattern
    • The subsequent columns correspond to capture groups captured within the pattern (one column per capture group). In this case, there will be two columns corresponding to the two capture groups in the pattern '(X+)(Y+)'.

All Matches

We wish to extract a part, specified by a capture group, of each match of a regular expression pattern from a given string.

In this example, we wish to extract and add up all occurrences of a number (a sequence of digits) that is followed by the letters ‘kg’ (with no or more empty space characters between). Note we only wish to extract the number component.

sum_group_matches <- function(p_mat, p_group) {
  m_sum = p_mat[, p_group+1] %>% 
    parse_number() %>%
    sum()
}

df_2 = df %>%
  mutate(
    col_2 = map_dbl(
      str_match_all(col_1, "(\\d+)\\s*kg"),
      sum_group_matches,
      1)
  )

Here is how this works:

  • We use the function str_match_all() from the package stringr to extract all matches of a regular expression as well as all matches of each capture group from each element of the input column col_1.
  • The output of str_extract_all() returns a list of character vectors where each vector holds the matches extracted from the corresponding value of the column col_1.
  • In this case, the output of str_match_all() is a list of the same length as col_1 containing character matrices; i.e. one character matrix for each value of the column col_1.
  • Each matrix takes the following form:
    • there is one row for each pattern match
    • The first column contains the substring matching the entire pattern
    • The subsequent columns correspond to capture groups captured within the pattern (one column per capture group).
  • After obtaining the pattern matches we need to process them, which in this example is, to obtain the sum of the matches of the first capture group. We do that by using map_dbl() to iterate over the matrices returned by str_match_all() and apply the custom function sum_group_matches() to each. See Non-Vectorized Transformation and Working with Lists.
  • The custom function sum_group_matches() operates as follows:
    • It takes in two arguments: p_matand p_group. The argument p_mat is a matrix that contains the matches of the regular expression pattern in col_1. The argument p_group specifies which group of the regular expression should be extracted. In this case, the argument p_group=1 specifies that the first group, which consists of one or more digits, should be extracted.
    • The sum_group_matches() function extracts the specified group from the p_mat matrix, converts the extracted values to numbers using the parse_number() function, and calculates their sum using the sum() function. The resulting sum is then returned as the value for the corresponding element in col_2.

Alternative: Extract and Process in Function

sum_group_matches <- function(p_col, p_pat, p_group) {
  m_sum = p_col %>% 
    str_match_all(p_pat) %>% 
    map_dbl(
      ~.x[, p_group+1] %>% 
        parse_number %>%
        sum()
    )
}

df_2 = df %>%
  mutate(
    col_2 = sum_group_matches(col_1, '(\\d+)\\s*kg', 1)
  )

Here is how this works:

  • This works similarly to the solution above except that we isolate both the extraction and iteration out of the main chain and into the custom function sum_group_matches().
  • The custom function sum_group_matches() is changed as follows:
    • The function takes parameters p_col, p_pat, and p_group. The argument p_col is the column whose values we wish to extract substrings from. The argument p_pat is the regular expression whose occurrences we wish to extract. The argument p_group specifies which group of the regular expression should be extracted.
    • The function starts by extracting all pattern matches via str_match_all(). It then uses map_db() to iterate and execute an anonymous function that extracts the specified group from the p_mat matrix, converts the extracted values to numbers using the parse_number() function, and calculates their sum using the sum() function.
  • Whether to isolate iteration out of the main chain or not depends on the situation and preferred coding style.

Multiple Capture Groups

First Match

We wish to extract multiple parts, specified by multiple capture groups, of the first match of a regular expression pattern from a given string.

In this example, we wish to extract two parts of the first occurrence of a string pattern; the first representing a numerical quantity and the second representing the unit from each value of the column col_1, and store each in its own column.

df_2 = df %>%
  mutate(match_res = str_match(col_1, '(\\d+)\\s*([a-zA-Z]{1,4})'),
         value = match_res[,2],
         unit = match_res[,3]) %>%
  select(-match_res)

Here is how this works:

  • We use str_match() to extract the first substrings matching each capture group from each element in the column col_1.
  • str_match() is built primarily for substring extraction by capture groups. It returns a matrix where:
    • there is one row for each input value, which in this case are the values of the column col_1
    • The first column contains the substring matching the entire pattern
    • The subsequent columns correspond to capture groups captured within the pattern (one column per capture group. In this case, there will be two columns corresponding to the two capture groups in the pattern (quantity and unit).
  • The regular expression we use in this case is '(\\d+)\\s*([a-zA-Z]{1,4})' which works as follows:
    • (\\d+) is capture group that matches a sequence of one or more digits.
    • \\s* a sequence of no or more empty space characters.
    • ([a-zA-Z]{1,4}) is a capture group that matches a sequence of one to four letters
  • The matrix returned by str_match() is stored in the column match_res.
  • We then create two new columns value and unit by extracting the second and third columns from the matrix created by str_match() and stored in match_res respectively corresponding to the two capture groups (\\d+) and ([a-zA-Z]{1,4}).
  • The output data frame df_2 is a copy of the input data frame df with two columns added value and unit containing the values captured by the two capture groups in the given regular expression.

Extension: Arbitrarily Many Capture Groups

df_2 = df %>%
  mutate(
    str_match(col_1, '(\\d+)\\s*([a-zA-Z]{1,4})') %>% 
      as_tibble(.name_repair ="unique"))

Here is how this work:

  • This code works similarly to the primary solution above except that we do not. manually extract individual columns, and rather rely on mutate() to create one new column for each column of the matrix returned by str_match(). See Multi-Value Transformation.
  • This is a useful approach be when we do not know how many capture groups the regular expression may have; e.g. if the regex is passed dynamically at run time.

Alternative: Via str_extract()

df_2 = df %>%
  mutate(
    str_extract(col_1, '(\\d+)\\s*([a-zA-Z]{1,4})', group = c(1,2)) %>% 
      as_tibble(.name_repair = ~c('value', 'unit')))

Here is how this works:

  • We use str_extract() while passing the indices of the groups that we wish to have returned. See “First Match” under One Capture Group above for a description of how this works.
  • In this case, we wish to extract the first and second capture groups, therefore we pass to str_extract() the argument group = c(1,2).
  • The output of str_extract() in this case is a matrix where:
    • there is one row for each input value, which in this case are the values of the column col_1
    • Two columns corresponding to the two capture groups we selected when we specified group = c(1,2).
  • See the primary solution above for a description of the regular expression.
  • We use as_tibble() to convert the matrix returned by str_extract() to a data frame (a tibble). When a data frame is returns to mutate() without being assigned to any column name, the data frame is unpacked to individual columns which is what we are looking for. See Multi-Value Transformation.
  • Since the matrix returns by str_extract() has no column names, we use the .name_repair argument of as_tibble() to specify the names to assign to the columns of the data frame being created. In this case we pass what is called a look-up formula specifying the column name for each column (two columns in this case) as .name_repair = ~c('value', 'unit')). See Multi-Value Transformation for a description of this pattern.
  • The output data frame df_2 will be a copy of the input data frame df with two columns added called value and unit containing the values of the first match of the first and second capture groups (in the specified regular expression) respectively.

All Matches

We wish to extract from a given string all matches of multiple parts, specified as capture groups, of a given regular expression.

In this example, each element of the input column is a string that takes the form "a1 * b1 + a2 * b2 … + an * bn” where the a1, a2,.. and b1, b2,.. are each a sequence of digits. We wish to parse and evaluate each of these expressions and store the result in a new column col_2.

sum_prods <- function(p_mat) {
  sum(parse_number(p_mat[, 2]) * parse_number(p_mat[, 3]))
}

df_2 = df %>% 
  mutate(
    col_2 = col_1 %>% 
      str_match_all('(\\d+)\\s*\\*\\s*(\\d+)') %>% 
      map_dbl(sum_prods)
  )

Here is how this work:

  • First we use str_match_all() to extract all matches of all capture groups in the given regular expression for each element of the column col_1 as follows;
    • str_match_all() takes regex with at least one capture group and returns a list of character matrices.
    • Each matrix is similar to the output of str_match() which we describe in detail under “First Match” in Multiple Capture Group above i.e. the first column has the complete match followed by one column for the output of each capture group.
  • We then use map_dbl() to iterate over the matrices returned by str_match_all() and apply the custom function sum_prods() to each which returns a single numeric value for each matrix.
  • The custom function sum_prods() takes a matrix p_mat as an input. The function:
    • Parses the values of the second and third columns of p_mat
    • Takes their element wide product
    • and finally calculates the sum of the products
    • returns the resulting numeric value
  • The output data frame df_2 is a copy of the input data frame df with an added column col_2 holding the numeric value resulting from processing the corresponding string value of the column col_1.

Ignore Case

We wish to extract the first match of a regular expression pattern from a given string while ignoring case.

In this example, we wish to extract the first occurrence of a number (a sequence of digits) followed by the letters ‘kg’ regardless of case with zero or more empty space characters between.

df_2 = df %>% 
    mutate(
        col_2 = str_extract(col_1, regex('\\d+\\s*kg', ignore_case = TRUE)))

Here is how this works:

  • To extract the first occurrence of a given regular expression in a given string, we use str_extract() as described in Entire Pattern above.
  • To ignore case while matching, we wrap the regular expression in regex() and pass the parameter ignore_case=TRUE. See Ignore Case under Detecting for more details.
  • We can extend the solutions presented in One Capture Group and Multiple Capture Groups the same way.

Pattern Column

We wish to match a vector of strings against a vector of patterns of the same size. This is often needed when the regular expression we are matching needs to be different for each row based on the value of a particular column.

In this example, we have a data frame df with two column col_1 and col_2, we wish to extract from each value in col_1 a substring composed of a sequence of digits followed by a set of characters representing a unit specified by the corresponding value of the column col_2.

df_2 = df %>%
  mutate(
  col_3 = str_extract(col_1, str_c('(\\d+)\\s*', col_2)))

Here is how this works:

  • str_extract() is vectorized over both the string and the pattern and can operate in one of three modes:
    • Check for one pattern in each element in a vector of strings. This is the mode we used in all the above scenarios.
    • Check n patterns against n strings. The size of both vectors must be the same for this pattern (or a multiple). This is the mode we use in this solution since we have a vector of strings, the column col_1, and a vector of patterns of the same size.
    • Check multiple patterns against a single string.
  • We pass to str_extract():
    • the strings to look into which in this case is the column col_1 as the first argument.
    • and the patterns to look for which in this case is a transformation of the column col_2 and is naturally of the same size as col_1.
  • We construct the regular expression patterns by appending the value of the column col_2 to the regular expression '(\\d+)\\s*' for each row. For instance, if the value of the column col_2 is ‘kg’ then the value of the regular expression to extract will be '(\\d+)\\s*kg'.
  • The output data frame df_2 will be a copy of the input data frame df but with an additional column col_3 containing the substrings extracted.
  • See Entire Pattern above for a more detailed description of this code. We can extend the solutions presented in One Capture Group and Multiple Capture Groups the same way.
R
I/O