Locating

We wish to obtain the start and end locations (as integers) of a given pattern in a target string.

In this section we will cover the following four common string pattern location scenarios:

  • First Match where we cover how to obtain the start and end locations of the first occurrence of a given pattern in a given string.
  • All Matches where we cover how to obtain the start and end locations of all occurrences of a given pattern in a given string.
  • Last Match where we cover how to obtain the start and end locations of the last occurrence of a given pattern in a given string.
  • nth Match where we cover how to obtain the start and end locations of the nth occurrence of a given pattern in a given string.

For each of these four scenarios, we will cover two cases:

  • Substring where the pattern is a plain character sequence
  • Regular Expressions where the pattern is a regular expression

In addition, we cover the following scenarios which can be applied to extend any of the above:

  • Ignore Case where we cover how to ignore the case (of both pattern and string) while matching.
  • Complement where we cover how to obtain the locations of the non-matching segments of the string (the inverse of the matching locations).
  • Pattern Column where we cover how to match a vector of strings against a vector of patterns of the same size. This is often needed when we wish to locate the value of a column in another column for each row.
  • Multiple Patterns where we cover how to extend any of the above scenarios to check against multiple patterns at a time.

Locating a pattern in a string is often by followed by additional steps to achieve a goal. For instance, a common next step after locating patterns in a string is to substitute them with other values. We cover that in Substituting.

First Match

Substring

We wish to obtain the location of the first match of a substring in a string.

In this example, we wish to create a new column col_2 where each row value holds the location of the first character of the first match of the string 'gm' in the corresponding string value of the column col_1.

df_2 = df.assign(
    col_2=df['col_1'].str.find('gm')
)

Here is how this works:

  • We use the str.find() method from the str accessor set of string manipulation methods of Pandas Series to identify the location of the first match of the substring 'gm' in each value of the column col_1.
  • The output of str.find() is the index of the first character of the first match of the provided substring.
  • If there is no match, str.find() will return -1.
  • The output is a new data frame df_2 that is a copy of the input data frame df with an additional column col_2 containing the starting position of the first occurrence of the pattern 'gm' in each element of col_1.

Regular Expression

We wish to obtain the location (start and end) of the first match of a regular expression in a string.

In this example, we wish to create two new columns start and end where each row value holds the location of the first character and the last character, respectively, of the first match of a sequence of integers in the corresponding string value of the column col_1.

import re

def locate(string, pattern, flags=0):
    match = re.search(pattern, string, flags)
    if match:
        return match.start(), match.end() - 1
    else:
        return -1, -1

df_2 = df.copy()
df_2[['start', 'end']] = \
    df_2.apply(lambda x: locate(x['col_1'], '\d+'), axis=1, result_type='expand')

Here is how this works:

  • We use the re.search() method from the re package to identify the start and end locations of the first match of the regular expression '\d+' in each value of the column col_1.
  • The function re.search() takes as input:
    • A string value to look into which in this case is col_1 values.
    • A regular expression to look for which in this case is the substring '\d+'
    • re package flags which can control the way the search will be performed such as re.IGNORECASE.
  • The output of re.search() is a re.Match object with represents the first match of the provided pattern in the string.
  • The re.Match object has the following attributes / methods:
    • .string returns the matched string.
    • .start() returns the position of the first character of the match.
    • .end() returns the position of the last character of the match plus 1.
    • .span() returns both the start and end characters of the match as a tuple.
  • Usually, when matching regular expressions, we need to hold on to both the start and end locations because the number of characters matched may vary; in this case, the digit sequences captured may be of different lengths.
  • We wrap re.search() in a function locate(string, pattern, flags=0) and we return the locations of the first character and the last character of the first match of the given string. We subtract -1 from match.end() to get the index of the last character in the match.
  • If there is no match, we return -1 for both start and end locations.
  • We need to use apply() with axis=1 to apply a lambda in a row wise manner to each row of the data frame df. See Non-Vectorized Transformation.
  • The lambda function passed to apply() takes a row of the data frame (represented by the variable x) and calls locate() to get the first match of gm in the respective row.
  • We assign the values returned by locate() to two new columns start and end. See Multi-Value Transformation.
  • The resulting data frame df_2 will have the same number of rows as the original data frame df, with two additional start and end columns containing the locations of the first and last characters, respectively, of the first sequence of digits in the corresponding value of the column col_1.

All Matches

Substring

We wish to obtain the locations of all matches of a given substring in a given string.

In this example, we wish to compute the average separation (in number of characters) between occurrences of the pattern ‘gm’ in each value of the column col_1.

import re

def find_all(string, pattern, flags=0):
    matches = [[m.start(), m.end() - 1] for m in re.finditer(re.escape(pattern), string, flags)]
    return pd.DataFrame(matches, columns=['start', 'end'])


def calculate_avg_sep(p_df):
    return (p_df.shift(-1)['start'] - p_df['end']).mean(skipna=True)


df_2 = df.assign(
    avg_sep=df.apply(lambda x: find_all(x['col_1'], 'gm'), axis=1) \
        .apply(calculate_avg_sep))

Here is how this works:

  • We use the re.finditer() method from the re package to identify the locations of all matches of the substring 'gm' in each value of the column col_1
  • The function re.finditer() takes as input:
    • A string value to look into which in this case is col_1 values.
    • A regular expression to look for which in this case is the substring 'gm'. We can wrap a substring with re.escape() to escape any special characters. This will enforce passing the pattern as a string and not a regular expression.
    • re package flags which can control the way the search will be performed such as re.IGNORECASE.
  • re.finditer() returns an "iterator" through which we can access re.Match objects for all matches of the pattern in the target string.
  • The re.Match object has the following attributes / methods:
    • .string returns the matched string.
    • .start() returns the position of the first character of the match.
    • .end() returns the position of the last character of the match plus 1.
    • .span() returns both the start and end characters of the match as a tuple.
  • Usually, when matching regular expressions, we need to hold on to both the start and end locations because the number of characters matched may vary; in this case, the digit sequences captured may be of different lengths.
  • Using list comprehension we iterate through all the matches, and we return the start and end location for each match as a list. We subtract -1 from match.end() to get the index of the last character in the match. We convert the list of lists to a data frame, so we can easily perform data manipulation operations.
  • We wrap this logic in a function find_all(string, pattern, flags=0) and we return a data frame of locations of the first character and the last character of all matches in the given string.
  • We need to use apply() with axis=1 to apply a lambda in a row wise manner to each row of the data frame df. See Non-Vectorized Transformation.
  • The lambda function passed to apply() takes a row of the data frame (represented by the variable x) and calls find_all() to get the first match of gm in the respective row.
  • We use the function calculate_avg_sep() to calculate the average separation (in number of characters) between occurrences. We do this by taking the difference between the next start (p_df.shift(-1)['start']) and the current end for each row and then taking the avg using mean().
  • The output data frame df_2 will be a copy of the input data frame df with an added column avg_sep holding the average number of characters between occurrences of the substring ‘gm’ for the corresponding value of the column col_1.

Regular Expression

We wish to obtain the locations of all matches of a regular expression in a string.

In this example, we wish to compute the average separation (in number of characters) between occurrences of a sequence of integers in each value of the column col_1.

import re

def find_all(string, pattern, flags=0):
    matches = [[m.start(), m.end() - 1] for m in re.finditer(pattern, string, flags)]
    return pd.DataFrame(matches, columns=['start', 'end'])


def calculate_avg_sep(p_df):
    return (p_df.shift(-1)['start'] - p_df['end']).mean(skipna=True)


df_2 = df.assign(
    avg_sep=df.apply(lambda x: find_all(x['col_1'], '\d+'), axis=1) \
        .apply(calculate_avg_sep))

Here is how this works:

  • This works just like the Substring case above except that we don't wrap the pattern with re.escape() to escape any special characters. Note that re.finditer() expects a regular expression by default.
  • The output data frame df_2 will be a copy of the input data frame df with an added column avg_sep holding the average number of characters between numbers (digit sequences) for the corresponding value of the column col_1.

Last Match

Substring

We wish to obtain the location of the last match of a substring in a string.

In this example, we wish to create a new column col_2 where each row value holds the location of the first character of the last match of the string 'gm' in the corresponding string value of the column col_1.

df_2 = df.assign(col_2=df['col_1'].str.rfind('gm'))

Here is how this works:

  • We use the str.rfind() method from the str accessor set of string manipulation methods of Pandas Seriesto identify the location of the last match of the substring 'gm' in each value of the column col_1.
  • The output of str.rfind() is the index of the first character of the last match of the provided substring.
  • If there is no match, str.rfind() will return -1.
  • The output is a new data frame df_2 that is a copy of df with an additional column col_2 containing the starting position of the last occurrence of the pattern 'gm' in each element of col_1.

Regular Expression

We wish to obtain the location of the last match of a regular expression in a string.

In this example, we wish to create two new columns start and end where each row value holds the location of the first character and the last character, respectively, of the last match of a sequence of integers in the corresponding string value of the column col_1.

import re

def find_last_match(string, pattern):
    matches = [[m.start(), m.end() - 1] for m in re.finditer(pattern, string)]
    if matches:
        return matches[-1][0], matches[-1][1]
    return -1, -1

df_2=df.copy()
df_2[['start', 'end']] = \
    df_2.apply(lambda x: find_last_match(x['col_1'], '\d+'), axis=1, result_type='expand')

Here is how this works:

  • This works just like All Matches case above except we return the last match returned by re.finditer(). See Working with Lists.
  • The resulting data frame df_2 will have the same number of rows as the original data frame df, with two additional start and end columns containing the locations of the first and last characters, respectively, of the last sequence of digits in the corresponding value of the column col_1.

nth Match

We wish to obtain the location of the nth match of a pattern in a string.

In this example, we wish to create two new columns start and end where each row value holds the location of the first character and the last character, respectively, of the nth match of a sequence of integers in the corresponding string value of the column col_1.

import re

def find_nth_match(string, pattern, n):
    matches = [[m.start(), m.end() - 1] for m in re.finditer(pattern, string)]
    if len(matches) >= n:
        return matches[n - 1][0], matches[n - 1][1]
    return -1, -1


df_2 = df.copy()
df_2[['start', 'end']] = df_2.apply(lambda x: find_nth_match(x['col_1'], '\d+', 2), axis=1, result_type='expand')

Here is how this works:

  • At a high level, the approach we follow here to locate the nth occurrence of a pattern in a string, we use re.finditer() to locate all occurrences and then extract the location of the nth occurrence.
  • This works just like All Matches case above except we return the nth match returned by re.finditer() as a list and not a data frame. See Working with Lists.
  • The resulting data frame df_2 will have the same number of rows as the original data frame df, with two additional start and end columns containing the locations of the first and last characters, respectively, of the 2nd sequence of digits in the corresponding value of the column col_1.
  • Note: We can wrap a substring with re.escape() to escape any special characters. This will enforce passing the pattern as a string and not a regular expression.

Ignore Case

Substring

We wish to obtain the location of the first match of a substring in a string irrespective of whether the letters are in upper or lower case; i.e. while ignoring case.

In this example, we wish to create a new column col_2 where each row value holds the location of the first character of the first match regardless of case, of the string 'gm' in the corresponding string value of the column col_1.

df_2 = df.assign(
    col_2=df['col_1'].str.lower().str.find('gm')
)

Here is how this works:

  • To locate the first occurrence of a given substring in a given string, we use str.find() as described in First Match above.
  • To ignore the case while matching, we use str.lower() to convert values in col_1 to lower case, and we pass a lower case expression. See Ignore Case under Detecting for more details.

Regular Expression

We wish to obtain the location of the first match of a regular expression in a string irrespective of the case.

In this example, we wish to create two new columns start and end where each row value holds the location of the first character and the last character, respectively, of the first match of a sequence of integers followed by the characters ‘gm’; regardless of case, in the corresponding string value of the column col_1.

import re

def locate(string, pattern, flags=0):
    match = re.search(pattern, string, flags)
    if match:
        return match.start(), match.end() - 1
    else:
        return -1, -1


df_2 = df.copy()
df_2[['start', 'end']] = \
    df_2.apply(lambda x: locate(x['col_1'].lower(), '\d+gm'), axis=1, result_type='expand')

Here is how this works:

  • To locate the first occurrence of a given regular expression in a given string, we use re.search() as described in First Match above.
  • To ignore the case while matching, we use lower() to convert values in col_1 to lower case, and we pass a lower case expression. See Ignore Case under Detecting for more details.

Alternative: Via re.IGNORECASE

import re


def locate(string, pattern, flags=0):
    match = re.search(pattern, string, flags)
    if match:
        return match.start(), match.end() - 1
    else:
        return -1, -1


df_2 = df.copy()
df_2[['start', 'end']] = \
    df_2.apply(lambda x: locate(x['col_1'], '\d+gm', flags=re.IGNORECASE), axis=1, result_type='expand')

Here is how this works:

  • This code is similar to the code under Regular Expression above except that we pass flags=re.IGNORECASE to locate() to perform case-insensitive matching.
  • We can extend the solutions presented in Substring in the same way.

Pattern Column

We wish to match a vector of strings against a vector of patterns of the same size. This is often needed when we wish to locate the value of a column in the value of another column for each row.

Substring

In this example, we have a data frame df with two columns col_1 and col_2, we wish to locate the start of the first occurrence of the value of col_2 in col_1.

df_2 = df.assign(
    col_3=df.apply(lambda x: x['col_1'].find(x['col_2']), axis=1))

Here is how this works:

  • str.find() is not vectorized over the pattern, so we use python's find() function which works on string.
  • We need to use apply() with axis=1 to apply a lambda in a row wise manner to each row of the data frame df. See Non-Vectorized Transformation.
  • We can extend the solutions presented in All Matches, Last Match, and nth Match in the same way.

Regular Expression

In this example, we have a data frame df with two columns col_1 and col_2, we wish to locate the start and end of the first occurrence of a sequence of repeated col_2 value in col_1.

import re

def locate(string, pattern, flags=0):
    match = re.search(pattern, string, flags)
    if match:
        return match.start(), match.end() - 1
    else:
        return -1, -1


df_2 = df.copy()
df_2[['start', 'end']] = df_2.apply(lambda x: locate(x['col_1'], x['col_2'] + '+'), axis=1,
                                    result_type='expand')

Here is how this works:

  • This code works similarly to the First Match scenario above except that in this case the pattern is a column in the data frame.
  • We construct the regular expression by adding ‘+’ to each value of the column col_2 via x['col_2'] + '+' to specify a regular expression that matches one or more occurrences of the value of col_2.

Multiple Patterns

We will cover four scenarios of locating multiple patterns:

  • First Match of Any Pattern where we cover how to obtain the start and end locations of the first occurrence of the first occurring pattern of n patterns.
  • First Match of Each Pattern where we cover how to obtain the start and end locations of the first occurrence of each pattern of n patterns.
  • All Matches in One List where we cover how to obtain the start and end locations of all occurrences of all patterns in the order of their occurrence as a single list.
  • All Matches in Separate Lists where we cover how to obtain the start and end locations of all occurrences of all patterns keeping the occurrences of each pattern in a separate list.

Note: In the following examples, the output of string location will be left as a column of nested lists. Typically, we proceed to do something with those nested lists e.g. substitute with a character value which we cover in Substituting. Or more generally, operate on them via List Operations.

First Match of Any Pattern

We wish to return the start and end locations of the first occurrence of the first occurring pattern of n patterns. In other words, only one location is returned and that is of whichever pattern occurs first.

import re

def locate(string, pattern, flags=0):
    match = re.search(pattern, string, flags)
    if match:
        return match.start(), match.end() - 1
    else:
        return -1, -1

df_2 = df.copy()
df_2[['start', 'end']] = \
    df_2.apply(lambda x: locate(x['col_1'], '\d+|gm'), axis=1, result_type='expand')

Here is how this works:

  • We use the or operator | of regular expressions to build a regular expression that captures all the patterns we wish to look for or’ed together. In this case that regular expression is '\d+|gm'.
  • For each value of col_1, locate() will return the start and end locations of whichever pattern occurs first (either '\d+’ or ‘gm’).
  • The output data frame df_2 will be a copy of the input data frame df with two added columns start and end.
  • See Regular Expression under First Match above for a more detailed description.

First Match of Each Pattern

We wish to return the start and end locations of the first occurrence of each pattern of n patterns. In other words, n locations are returned one for each pattern.

import re

def locate_patterns(string, patterns, flags=0):
    matches = []
    for pattern in patterns:
        match = re.search(pattern, string, flags)
        if match:
            matches.append([match.start(), match.end() - 1])
        else:
            matches.append([-1, -1])
    return matches

df_2 = df.assign(result=
                 df.apply(lambda x: locate_patterns(x['col_1'], ['\d+', 'gm']), axis=1))

Here is how this works:

  • We pass to locate_patterns() a string list of patterns ['\d+', 'gm'].
  • We use re.search() to find the first occurence of each pattern and append it to matches list.
  • For each value of col_1, locate_patterns() will return a list of lists where each list is for a pattern and has two items holding the start and end locations and one list for the first occurrence of each pattern; in this case two lists for the patterns: '\d+’ and ‘gm’.
  • In this solution, we use apply() to iterate over each value of the column col_1 and pass that to locate_patterns() along with the list of multiple patterns. See Non-Vectorized Transformation.
  • The output data frame df_2 will be a copy of the input data frame df with an added column result where each cell holds the list returned by locate_patterns() for the corresponding value of col_1.

All Matches in Separate Lists

We wish to return the start and end locations of all occurrences of all patterns keeping the occurrences of each pattern in a separate list.

import re

def find_all(string, patterns):
    matches=[]
    for pattern in patterns:
       matches.append([[m.start(), m.end() - 1] for m in re.finditer(pattern, string)])
    return matches

df_2 = df.assign(result=df.apply(lambda x: find_all(x['col_1'], ['\d+', 'gm']), axis=1))

Here is how this works:

  • We pass to find_all() a list of patterns ['\d+', 'gm'].
  • For each value of col_1, find_all() will return a list of lists:
    • As many lists as there are patterns which in this case is 2.
    • Each list will have multiple list where each nested list holds the start and end locations for the occurrences of the pattern in col_1 value.
  • In this solution, we use apply() to iterate over each value of the column col_1 and pass that to find_all() along with the list of multiple patterns. See Non-Vectorized Transformation.
  • The output data frame df_2 will be a copy of the input data frame df with an added column result where each cell holds the list of lists returned by find_all() for the corresponding value of col_1.
PYTHON
I/O