We wish to obtain the start and end locations (as integers) of a given pattern in a target string.
In this section we will cover the following four common string pattern location scenarios:
For each of these four scenarios, we will cover two cases:
In addition, we cover the following scenarios which can be applied to extend any of the above:
Locating a pattern in a string is often by followed by additional steps to achieve a goal. For instance, a common next step after locating patterns in a string is to substitute them with other values. We cover that in Substituting.
Substring
We wish to obtain the location of the first match of a substring in a string.
In this example, we wish to create a new column col_2
where each row value holds the location of
the first character of the first match of the string 'gm'
in the corresponding string value of the
column col_1
.
df_2 = df.assign(
col_2=df['col_1'].str.find('gm')
)
Here is how this works:
str.find()
method from the str
accessor set of string manipulation methods of
Pandas Series
to identify the location of the first match of the substring 'gm'
in each value
of the column col_1
.str.find()
is the index of the first character of the first match of the provided
substring.str.find()
will return -1
.df_2
that is a copy of the input data frame df
with an additional column col_2
containing the starting position of the first occurrence of the pattern 'gm'
in each element
of col_1
.Regular Expression
We wish to obtain the location (start and end) of the first match of a regular expression in a string.
In this example, we wish to create two new columns start
and end
where each row value holds the
location of the first character and the last character, respectively, of the first match of a
sequence of integers in the corresponding string value of the column col_1
.
import re
def locate(string, pattern, flags=0):
match = re.search(pattern, string, flags)
if match:
return match.start(), match.end() - 1
else:
return -1, -1
df_2 = df.copy()
df_2[['start', 'end']] = \
df_2.apply(lambda x: locate(x['col_1'], '\d+'), axis=1, result_type='expand')
Here is how this works:
re.search()
method from the re
package to identify the start and end locations of
the first match of the regular expression '\d+'
in each value of the column col_1
.re.search()
takes as input:col_1
values.'\d+'
re
package flags which can control the way the search will be performed such
as re.IGNORECASE
.re.search()
is a re.Match
object with represents the first match of the provided
pattern in the string.re.Match
object has the following attributes / methods:.string
returns the matched string..start()
returns the position of the first character of the match..end()
returns the position of the last character of the match plus 1..span()
returns both the start and end characters of the match as a tuple.start
and end
locations because the number of characters matched may vary; in this case, the digit sequences
captured may be of different lengths.re.search()
in a function locate(string, pattern, flags=0)
and we return the locations
of the first character and the last character of the first match of the given string. We subtract
-1
from match.end()
to get the index of the last character in the match.apply()
with axis=1
to apply a lambda
in a row wise manner to each row of
the data frame df
.
See Non-Vectorized Transformation.lambda
function passed to apply()
takes a row of the data frame (represented by the
variable x
) and calls locate()
to get the first match of gm
in the respective row.locate()
to two new columns start
and end
.
See Multi-Value Transformation.df_2
will have the same number of rows as the original data frame df
,
with two additional start
and end
columns containing the locations of the first and last
characters, respectively, of the first sequence of digits in the corresponding value of the
column col_1
.Substring
We wish to obtain the locations of all matches of a given substring in a given string.
In this example, we wish to compute the average separation (in number of characters) between
occurrences of the pattern ‘gm’
in each value of the column col_1
.
import re
def find_all(string, pattern, flags=0):
matches = [[m.start(), m.end() - 1] for m in re.finditer(re.escape(pattern), string, flags)]
return pd.DataFrame(matches, columns=['start', 'end'])
def calculate_avg_sep(p_df):
return (p_df.shift(-1)['start'] - p_df['end']).mean(skipna=True)
df_2 = df.assign(
avg_sep=df.apply(lambda x: find_all(x['col_1'], 'gm'), axis=1) \
.apply(calculate_avg_sep))
Here is how this works:
re.finditer()
method from the re
package to identify the locations of all matches
of the substring 'gm'
in each value of the column col_1
re.finditer()
takes as input:col_1
values.'gm'
. We can wrap a
substring with re.escape()
to escape any special characters. This will enforce passing the
pattern as a string and not a regular expression.re
package flags which can control the way the search will be performed such
as re.IGNORECASE
.re.finditer()
returns an "iterator" through which we can access re.Match
objects for all
matches of the pattern in the target string.re.Match
object has the following attributes / methods:.string
returns the matched string..start()
returns the position of the first character of the match..end()
returns the position of the last character of the match plus 1..span()
returns both the start and end characters of the match as a tuple.start
and end
locations because the number of characters matched may vary; in this case, the digit sequences
captured may be of different lengths.-1
from match.end()
to get the index of the
last character in the match. We convert the list of lists to a data frame, so we can easily
perform data manipulation operations.find_all(string, pattern, flags=0)
and we return a data frame
of locations of the first character and the last character of all matches in the given string.apply()
with axis=1
to apply a lambda
in a row wise manner to each row of
the data frame df
.
See Non-Vectorized Transformation.lambda
function passed to apply()
takes a row of the data frame (represented by the
variable x
) and calls find_all()
to get the first match of gm
in the respective row.calculate_avg_sep()
to calculate the average separation (in number of
characters) between occurrences. We do this by taking the difference between the next
start (p_df.shift(-1)['start']
) and the current end for each row and then taking the avg
using mean()
.df_2
will be a copy of the input data frame df with an added
column avg_sep
holding the average number of characters between occurrences of the
substring ‘gm’
for the corresponding value of the column col_1
.Regular Expression
We wish to obtain the locations of all matches of a regular expression in a string.
In this example, we wish to compute the average separation (in number of characters) between
occurrences of a sequence of integers in each value of the column col_1
.
import re
def find_all(string, pattern, flags=0):
matches = [[m.start(), m.end() - 1] for m in re.finditer(pattern, string, flags)]
return pd.DataFrame(matches, columns=['start', 'end'])
def calculate_avg_sep(p_df):
return (p_df.shift(-1)['start'] - p_df['end']).mean(skipna=True)
df_2 = df.assign(
avg_sep=df.apply(lambda x: find_all(x['col_1'], '\d+'), axis=1) \
.apply(calculate_avg_sep))
Here is how this works:
re.escape()
to escape any special characters. Note that re.finditer()
expects a
regular expression by default.df_2
will be a copy of the input data frame df
with an added
column avg_sep
holding the average number of characters between numbers (digit sequences) for
the corresponding value of the column col_1
.Substring
We wish to obtain the location of the last match of a substring in a string.
In this example, we wish to create a new column col_2
where each row value holds the location of
the first character of the last match of the string 'gm'
in the corresponding string value of the
column col_1
.
df_2 = df.assign(col_2=df['col_1'].str.rfind('gm'))
Here is how this works:
str.rfind()
method from the str
accessor set of string manipulation methods of
Pandas Series
to identify the location of the last match of the substring 'gm'
in each value
of the column col_1
.str.rfind()
is the index of the first character of the last match of the provided
substring.str.rfind()
will return -1
.df_2
that is a copy of df
with an additional column col_2
containing the starting position of the last occurrence of the pattern 'gm'
in each element
of col_1
.Regular Expression
We wish to obtain the location of the last match of a regular expression in a string.
In this example, we wish to create two new columns start
and end
where each row value holds the
location of the first character and the last character, respectively, of the last match of a
sequence of integers in the corresponding string value of the column col_1
.
import re
def find_last_match(string, pattern):
matches = [[m.start(), m.end() - 1] for m in re.finditer(pattern, string)]
if matches:
return matches[-1][0], matches[-1][1]
return -1, -1
df_2=df.copy()
df_2[['start', 'end']] = \
df_2.apply(lambda x: find_last_match(x['col_1'], '\d+'), axis=1, result_type='expand')
Here is how this works:
re.finditer()
. See Working with Lists.df_2
will have the same number of rows as the original data frame df
,
with two additional start
and end
columns containing the locations of the first and last
characters, respectively, of the last sequence of digits in the corresponding value of the
column col_1
.We wish to obtain the location of the nth match of a pattern in a string.
In this example, we wish to create two new columns start
and end
where each row value holds the
location of the first character and the last character, respectively, of the nth match of a sequence
of integers in the corresponding string value of the column col_1
.
import re
def find_nth_match(string, pattern, n):
matches = [[m.start(), m.end() - 1] for m in re.finditer(pattern, string)]
if len(matches) >= n:
return matches[n - 1][0], matches[n - 1][1]
return -1, -1
df_2 = df.copy()
df_2[['start', 'end']] = df_2.apply(lambda x: find_nth_match(x['col_1'], '\d+', 2), axis=1, result_type='expand')
Here is how this works:
re.finditer()
to locate all occurrences and then extract the location of the
nth occurrence.re.finditer()
as a list and not a data frame.
See Working with Lists.df_2
will have the same number of rows as the original data frame df
,
with two additional start
and end
columns containing the locations of the first and last
characters, respectively, of the 2nd sequence of digits in the corresponding value of the
column col_1
.re.escape()
to escape any special characters. This will
enforce passing the pattern as a string and not a regular expression.Substring
We wish to obtain the location of the first match of a substring in a string irrespective of whether the letters are in upper or lower case; i.e. while ignoring case.
In this example, we wish to create a new column col_2
where each row value holds the location of
the first character of the first match regardless of case, of the string 'gm'
in the corresponding
string value of the column col_1
.
df_2 = df.assign(
col_2=df['col_1'].str.lower().str.find('gm')
)
Here is how this works:
str.find()
as
described in First Match above.str.lower()
to convert values in col_1
to lower
case,
and we pass a lower case expression.
See Ignore Case under
Detecting for more details.Regular Expression
We wish to obtain the location of the first match of a regular expression in a string irrespective of the case.
In this example, we wish to create two new columns start
and end
where each row value holds the
location of the first character and the last character, respectively, of the first match of a
sequence of integers followed by the characters ‘gm’
; regardless of case, in the corresponding
string value of the column col_1
.
import re
def locate(string, pattern, flags=0):
match = re.search(pattern, string, flags)
if match:
return match.start(), match.end() - 1
else:
return -1, -1
df_2 = df.copy()
df_2[['start', 'end']] = \
df_2.apply(lambda x: locate(x['col_1'].lower(), '\d+gm'), axis=1, result_type='expand')
Here is how this works:
re.search()
as described in First Match above.lower()
to convert values in col_1
to lower case,
and we pass a lower case expression.
See Ignore Case under
Detecting for more details.Alternative: Via re.IGNORECASE
import re
def locate(string, pattern, flags=0):
match = re.search(pattern, string, flags)
if match:
return match.start(), match.end() - 1
else:
return -1, -1
df_2 = df.copy()
df_2[['start', 'end']] = \
df_2.apply(lambda x: locate(x['col_1'], '\d+gm', flags=re.IGNORECASE), axis=1, result_type='expand')
Here is how this works:
flags=re.IGNORECASE
to locate()
to perform case-insensitive matching.We wish to match a vector of strings against a vector of patterns of the same size. This is often needed when we wish to locate the value of a column in the value of another column for each row.
Substring
In this example, we have a data frame df
with two columns col_1
and col_2
, we wish to locate
the start
of the first occurrence of the value of col_2
in col_1
.
df_2 = df.assign(
col_3=df.apply(lambda x: x['col_1'].find(x['col_2']), axis=1))
Here is how this works:
str.find()
is not vectorized over the pattern, so we use python's find()
function which works
on string.apply()
with axis=1
to apply a lambda
in a row wise manner to each row of
the data frame df
.
See Non-Vectorized Transformation.Regular Expression
In this example, we have a data frame df
with two columns col_1
and col_2
, we wish to locate
the start
and end
of the first occurrence of a sequence of repeated col_2
value in col_1
.
import re
def locate(string, pattern, flags=0):
match = re.search(pattern, string, flags)
if match:
return match.start(), match.end() - 1
else:
return -1, -1
df_2 = df.copy()
df_2[['start', 'end']] = df_2.apply(lambda x: locate(x['col_1'], x['col_2'] + '+'), axis=1,
result_type='expand')
Here is how this works:
‘+’
to each value of the column col_2
via x['col_2'] + '+'
to specify a regular expression that matches one or more occurrences of
the value of col_2
.We will cover four scenarios of locating multiple patterns:
Note: In the following examples, the output of string location will be left as a column of nested lists. Typically, we proceed to do something with those nested lists e.g. substitute with a character value which we cover in Substituting. Or more generally, operate on them via List Operations.
First Match of Any Pattern
We wish to return the start and end locations of the first occurrence of the first occurring pattern of n patterns. In other words, only one location is returned and that is of whichever pattern occurs first.
import re
def locate(string, pattern, flags=0):
match = re.search(pattern, string, flags)
if match:
return match.start(), match.end() - 1
else:
return -1, -1
df_2 = df.copy()
df_2[['start', 'end']] = \
df_2.apply(lambda x: locate(x['col_1'], '\d+|gm'), axis=1, result_type='expand')
Here is how this works:
|
of regular expressions to build a regular expression that captures all
the patterns we wish to look for or’ed together. In this case that regular expression
is '\d+|gm'
.col_1
, locate()
will return the start
and end
locations of whichever
pattern occurs first (either '\d+’
or ‘gm’
).df_2
will be a copy of the input data frame df
with two added
columns start
and end
.First Match of Each Pattern
We wish to return the start and end locations of the first occurrence of each pattern of n patterns. In other words, n locations are returned one for each pattern.
import re
def locate_patterns(string, patterns, flags=0):
matches = []
for pattern in patterns:
match = re.search(pattern, string, flags)
if match:
matches.append([match.start(), match.end() - 1])
else:
matches.append([-1, -1])
return matches
df_2 = df.assign(result=
df.apply(lambda x: locate_patterns(x['col_1'], ['\d+', 'gm']), axis=1))
Here is how this works:
locate_patterns()
a string list of patterns ['\d+', 'gm']
.re.search()
to find the first occurence of each pattern and append it to matches list.col_1
, locate_patterns()
will return a list of lists where each list is for
a pattern and has two items holding the start and end locations and one list for the first
occurrence of each pattern;
in this case two lists for the patterns: '\d+’
and ‘gm’
.apply()
to iterate over each value of the column col_1
and pass that
to locate_patterns()
along with the list of multiple patterns.
See Non-Vectorized Transformation.df_2
will be a copy of the input data frame df
with an added
column result
where each cell holds the list returned by locate_patterns()
for the
corresponding
value of col_1
.All Matches in Separate Lists
We wish to return the start and end locations of all occurrences of all patterns keeping the occurrences of each pattern in a separate list.
import re
def find_all(string, patterns):
matches=[]
for pattern in patterns:
matches.append([[m.start(), m.end() - 1] for m in re.finditer(pattern, string)])
return matches
df_2 = df.assign(result=df.apply(lambda x: find_all(x['col_1'], ['\d+', 'gm']), axis=1))
Here is how this works:
find_all()
a list of patterns ['\d+', 'gm']
.col_1
, find_all()
will return a list of lists:col_1
value.apply()
to iterate over each value of the column col_1
and pass that
to find_all()
along with the list of multiple patterns.
See Non-Vectorized Transformation.df_2
will be a copy of the input data frame df
with an added
column result
where each cell holds the list of lists returned by find_all()
for the
corresponding value of col_1
.