By Location

We wish to extract a substring, from a subject string, given the locations of its start and end characters.

We will cover the following scenarios:

  • Character: Extract a single character given its index in the parent string.
  • Substring: Extract a substring given its start and end indices in the parent string.
  • Multiple Substrings: Extract multiple substrings given their start and end indices in the parent string.
  • Location Columns: The start and / or end locations are provided as values of columns in a data frame.
  • Dynamic Locations: We first locate a substring (via a pattern match) and then extract it.

Character

We wish to extract a single character from a string given its index.

In this example, we wish to extract the first character from each value of the column col_1.

df_2 = df.assign(
    col_2=df['col_1'].str[0]
)

Here is how this works:

  • We can access individual characters using str[] and pass the index.
  • Since indexing in Python is zero based, the first character is at index 0. If we wanted to refer to say the third character we would use index 2.
  • The output data frame df_2 will be a copy of the input data frame df with an added column col_2 which contains the first character of the corresponding element of the column col_1.

Alternative: Via Getter Function

df_2 = df.assign(
    col_2=df['col_1'].str.get(0)
)

Here is how this works:

This work similarly to the primary solution except we use str.get() method and pass the index of the character.

Extension: Right Indexing

We wish to extract a character from a string by indexing relative to the end (right side) of the string.

In this example, we wish to extract the last character from each value of the column col_1.

df_2 = df.assign(
    col_2=df['col_1'].str[-1]
)

Here is how this works:

We can use negative indices to index relative to the right end of the string. In this case we use -1 to refer to the first character from the end i.e. the last character.

Substring

We wish to extract a substring from a string given its start and end location indices.

In this example, we wish to extract the substring starting at the 2nd character and ending at the 5th character from each value of the column col_1.

df_2 = df.assign(
    col_2=df['col_1'].str[1:5]
)

Here is how this works:

  • We use the function str[] to extract a substring from each element of the column col_1.
  • str[] takes two indices start and end. The start index is inclusive while the end index is exclusive i.e. the character denoted by the start index will be included in the output substring while the character denoted by the end index will not be included.
  • We pass to str[] the start and end locations of the substring we wish to extract, which in this case are 1 and 5.
  • The output data frame df_2 will be a copy of the input data frame df with an added column col_2 which contains the extracted substring (characters 2 through to 5) from the corresponding element of the column col_1.

Extension: nth to End

We wish to drop the first n characters and keep the rest.

df_2 = df.assign(
    col_2=df['col_1'].str[2:]
)

Here is how this works:

  • Our objective is to drop the first n characters (which in this example is 2) and keep the rest; i.e. to capture the characters between third and the last character.
  • Since end index is last by default, we are passing a value for start index only.

Extension: Start to nth

We wish to keep the first n characters and drop the rest.

df_2 = df.assign(
    col_2=df['col_1'].str[:2]
)

Here is how this works:

  • Our objective is to keep the first n characters (which in this example is 2) and drop the rest; i.e. to capture the characters between 0 and 2.
  • Since start index is 0 by default, we are passing a value for end index only which is exclusive.

Extension: Start to nth

We wish to drop the last n characters and keep the rest.

df_2 = df.assign(
    col_2=df['col_1'].str[:-2]
)

Here is how this works:

  • Our objective is to drop the last n characters (which in this example is 2) and keep the rest; i.e. to capture the characters between index 0 and index -2 (the second from the end). The end index here is exclusive, so we pass -2 to drop the last two characters.
  • Since start index is 0 by default, we are passing a value for end only.

Extension: nth from End to End

We wish to keep the last n characters and drop the rest.

df_2 = df.assign(
    col_2=df['col_1'].str[-2:]
)

Here is how this works:

  • Our objective is to keep the last n characters (which in this example is 2) and drop the rest; i.e. to capture the characters between start index -2 (the second from the end) and the last character.
  • Since end index is the last character by default, we are passing a value for start index only.

Multiple Substrings

We wish to extract multiple substrings given their start and end location indices.

In this example, we wish to extract three substrings given their start and end locations from each value of the column col_1. We wish to obtain the extracted substrings as three new columns named a, b, and c.

start = [0, 1, 2]
end = [2, 3, 3]

df_2 = df.copy()
df_2[['a', 'b', 'c']] = \
    df['col_1'].apply(lambda x: pd.Series([x[s:e] for (s, e) in zip(start, end)]))

Here is how this works:

  • We can't use .str[] here since we are using with a string literal.
  • We use the list comprehension to slice the string using two lists of indexes start and end. See Working with lists.
  • When a Pandas Series is returned to apply(), the data frame is unpacked to individual columns which is what we are looking for. See Multi-Value Transformation.
  • The output data frame df_2 will be a copy of the input data frame df with three added columns 'a', 'b', and 'c' holding the three substrings extracted from the corresponding elements of the column col_1.

Alternative: Multiple calls to str[]

start = [0, 1, 2]
end = [2, 3, 3]

df_2 = df.assign(
    a=df['col_1'].str[start[0]:end[0]],
    b=df['col_1'].str[start[1]:end[1]],
    c=df['col_1'].str[start[2]:end[2]],
)

Here is how this works:

  • Instead of using apply() and list comprehension, we can make individual calls to str[].
  • The output will be the same as the primary solution above; i.e. a data frame df_2 that is a copy of the input data frame df with three added columns 'a', 'b', and 'c' holding the three substrings extracted from the corresponding elements of the column col_1.

Location Columns

We wish to extract substrings from each element of a string column by location where the start and / or end locations are provided as values of columns in a data frame.

In this example, we wish to create a new column col_4 where each element is a substring extracted from the corresponding value of the column col_1 where the start and end location are provided by the corresponding values of the columns col_2 and col_3 respectively.

df_2 = df.assign(
    col_4=df.apply(lambda x: x['col_1'][x['col_2']:x['col_3']],
                   axis=1)
)

Here is how this works:

  • We use list slicing to extract a substring given its locations. See Working with lists
  • The output data frame df_2 will be a copy of the input data frame df with an added column col_4 holding the substring extracted form the corresponding element of col_1 between start location provided by col_2 and end location provided by col_3.

Dynamic Locations

One Substring

We wish to extract a substring by location from a given string where the locations are obtained in a dynamic manner for said string.

import re

def locate(string, pattern, flags=0):
    match = re.search(pattern, string, flags)
    if match:
        return match.start(), match.end() - 1
    return None


def extract(string, locations):
    if locations:
        return string[locations[0]:locations[1]]
    return pd.NA


df_2 = df.assign(
    col_2=df.apply(lambda x: extract(x['col_1'], locate(x['col_1'], '(X){2,}')), axis=1)
)

Here is how this works:

  • We use re.search() to obtain the locations of the first sequence of 2 or more ‘X’ characters, as specified by the regex ‘(X){2,}’, from each element in the column col_1. See Locating.
  • We pass the extracted locations to extract() to extract the corresponding substrings. See Substring above. Ih there is not match we return NA.

Multiple Substrings

import re

def find_all(string, pattern, flags=0):
    matches = [[m.start(), m.end() - 1] for m in re.finditer(pattern, string, flags)]
    return matches


def extract(string, locations):
    if locations:
        return '-'.join([string[location[0]:location[1] + 1] for location in locations])
    return pd.NA


df_2 = df.assign(
    col_2=df.apply(lambda x: extract(x['col_1'], find_all(x['col_1'], '(X){2,}')), axis=1)
)

Here is how this works:

  • We use re.finditer() to identify the locations of all matching patterns in each element of the column col_1 and return those as a list of lists with the start and end locations. See Locating.
  • We then use list comprehension in extract() function to extract the start and end locations manually, to extract the corresponding substrings. See Multiple Substrings above.
PYTHON
I/O