We wish to extract a substring, from a subject string, given the locations of its start and end characters.
We will cover the following scenarios:
We wish to extract a single character from a string given its index.
In this example, we wish to extract the first character from each value of the column col_1
.
df_2 = df.assign(
col_2=df['col_1'].str[0]
)
Here is how this works:
str[]
and pass the index.df_2
will be a copy of the input data frame df
with an added
column col_2
which contains the first character of the corresponding element of the
column col_1
.Alternative: Via Getter Function
df_2 = df.assign(
col_2=df['col_1'].str.get(0)
)
Here is how this works:
This work similarly to the primary solution except we use str.get()
method and pass the index of
the character.
Extension: Right Indexing
We wish to extract a character from a string by indexing relative to the end (right side) of the string.
In this example, we wish to extract the last character from each value of the column col_1
.
df_2 = df.assign(
col_2=df['col_1'].str[-1]
)
Here is how this works:
We can use negative indices to index relative to the right end of the string. In this case we
use -1
to refer to the first character from the end i.e. the last character.
We wish to extract a substring from a string given its start and end location indices.
In this example, we wish to extract the substring starting at the 2nd character and ending at the
5th character from each value of the column col_1
.
df_2 = df.assign(
col_2=df['col_1'].str[1:5]
)
Here is how this works:
str[]
to extract a substring from each element of the column col_1
.str[]
takes two indices start and end. The start index is inclusive while the end index is
exclusive i.e. the character denoted by the start index will be included in the output substring
while the character denoted by the end
index will not be included.str[]
the start
and end
locations of the substring we wish to extract, which in
this case are 1 and 5.df_2
will be a copy of the input data frame df
with an added
column col_2
which contains the extracted substring (characters 2 through to 5) from the
corresponding element of the column col_1
.Extension: nth to End
We wish to drop the first n characters and keep the rest.
df_2 = df.assign(
col_2=df['col_1'].str[2:]
)
Here is how this works:
end
index is last by default, we are passing a value for start
index only.Extension: Start to nth
We wish to keep the first n characters and drop the rest.
df_2 = df.assign(
col_2=df['col_1'].str[:2]
)
Here is how this works:
0
and 2
.0
by default, we are passing a value for end index only which is exclusive.Extension: Start to nth
We wish to drop the last n characters and keep the rest.
df_2 = df.assign(
col_2=df['col_1'].str[:-2]
)
Here is how this works:
0
and index -2
(the second from the end). The end
index here is exclusive, so we pass -2
to drop the last two characters.0
by default, we are passing a value for end only.Extension: nth from End to End
We wish to keep the last n characters and drop the rest.
df_2 = df.assign(
col_2=df['col_1'].str[-2:]
)
Here is how this works:
-2
(the second from the end) and the last
character.We wish to extract multiple substrings given their start and end location indices.
In this example, we wish to extract three substrings given their start and end locations from each
value of the column col_1
. We wish to obtain the extracted substrings as three new columns
named a
, b
, and c
.
start = [0, 1, 2]
end = [2, 3, 3]
df_2 = df.copy()
df_2[['a', 'b', 'c']] = \
df['col_1'].apply(lambda x: pd.Series([x[s:e] for (s, e) in zip(start, end)]))
Here is how this works:
.str[]
here since we are using with a string literal.start
and end
.
See Working with lists. apply()
, the data
frame is unpacked to individual columns which is what we are looking for.
See Multi-Value Transformation.df_2
will be a copy of the input data frame df
with three added
columns 'a', 'b'
, and 'c'
holding the three substrings extracted from the corresponding
elements of the column col_1
.Alternative: Multiple calls to str[]
start = [0, 1, 2]
end = [2, 3, 3]
df_2 = df.assign(
a=df['col_1'].str[start[0]:end[0]],
b=df['col_1'].str[start[1]:end[1]],
c=df['col_1'].str[start[2]:end[2]],
)
Here is how this works:
apply()
and list comprehension, we can make individual calls to str[]
.df_2
that is a copy
of the input data frame df
with three added columns 'a', 'b'
, and 'c'
holding the three
substrings extracted from the corresponding elements of the column col_1
.We wish to extract substrings from each element of a string column by location where the start and / or end locations are provided as values of columns in a data frame.
In this example, we wish to create a new column col_4
where each element is a substring extracted
from the corresponding value of the column col_1
where the start and end location are provided by
the corresponding values of the columns col_2
and col_3
respectively.
df_2 = df.assign(
col_4=df.apply(lambda x: x['col_1'][x['col_2']:x['col_3']],
axis=1)
)
Here is how this works:
df_2
will be a copy of the input data frame df
with an added
column col_4
holding the substring extracted form the corresponding element of col_1
between
start location provided by col_2
and end location provided by col_3
.One Substring
We wish to extract a substring by location from a given string where the locations are obtained in a dynamic manner for said string.
import re
def locate(string, pattern, flags=0):
match = re.search(pattern, string, flags)
if match:
return match.start(), match.end() - 1
return None
def extract(string, locations):
if locations:
return string[locations[0]:locations[1]]
return pd.NA
df_2 = df.assign(
col_2=df.apply(lambda x: extract(x['col_1'], locate(x['col_1'], '(X){2,}')), axis=1)
)
Here is how this works:
re.search()
to obtain the locations of the first sequence of 2 or more ‘X’
characters,
as specified by the regex ‘(X){2,}’
, from each element in the column col_1
.
See Locating.extract()
to extract the corresponding substrings.
See Substring above. Ih there is not match we return NA
.Multiple Substrings
import re
def find_all(string, pattern, flags=0):
matches = [[m.start(), m.end() - 1] for m in re.finditer(pattern, string, flags)]
return matches
def extract(string, locations):
if locations:
return '-'.join([string[location[0]:location[1] + 1] for location in locations])
return pd.NA
df_2 = df.assign(
col_2=df.apply(lambda x: extract(x['col_1'], find_all(x['col_1'], '(X){2,}')), axis=1)
)
Here is how this works:
re.finditer()
to identify the locations of all matching patterns in each element of
the column col_1
and return those as a list of lists with the start and end locations.
See Locating.extract()
function to extract the start
and end
locations
manually, to extract the corresponding substrings. See Multiple Substrings
above.