Detecting

We wish to check whether a string matches a given pattern and to return a logical value TRUE if that is the case and FALSE otherwise.

In this section we will cover the following four common string pattern detection scenarios:

  • Full Match where we cover how to check if a given string exactly matches a given pattern.
  • Contains where we cover how to check if a given pattern is contained in a given string.
  • Starts With where we cover how to check if a given string starts with a given pattern.
  • Ends With where we cover how to check if a given string ends with a given pattern.

For each of these four scenarios, we will cover two cases:

  • Substring where the pattern is a plain character sequence
  • Regular Expressions where the pattern is a regular expression

In addition, we cover the following scenarios which can be applied to extend any of the above:

  • Ignore Case where we cover how to ignore the case (of both pattern and string) while matching.
  • Complement where we cover how to return the inverse of the outcome of the string pattern matching scenarios described above.
  • Pattern Column where we cover how to match a vector of strings against a vector of patterns of the same size. This is often needed when we wish to check the presence of the value of a column in another column for each row.
  • Multiple Patterns where we cover how to extend any of the above scenarios to check against multiple patterns at a time.

Full Match

String

We wish to check if a string exactly matches another string.

In this example, we wish to filter the rows of the data frame df to retain only rows where the value of the column col_1 exactly equals ‘XXX’.

df_2 = df.loc[df['col_1'] == 'XXX']

Here is how this works:

  • We use the basic equality comparison operator == to check whether two strings are equal.
  • The comparison is between:
    • Each of the values of the column col_1 (== is vectorized)
    • and the character sequence 'XXX'.
  • The output data frame df_2 will have only the rows of the input data frame df where the value of the column col_1 is 'XXX'.

Alternative: Via Function

df_2 = df.loc[df['col_1'].str.fullmatch('XXX')]

Here is how this works:

  • We use the str.fullmatch() method from the str accessor set of string manipulation methods of Pandas Series to check if the provided pattern matches elements in col_1.
  • The str.fullmatch() determines if each string entirely matches a string or a regular expression.
  • The output data frame df_2 will have only the rows of the input data frame df where the value of the column col_1 is 'XXX'.

Regular Expression

We wish to check whether a given string matches a given regular expression and to return TRUE if that is the case and FALSE otherwise.

In this example, we wish to filter the rows of the data frame df to retain only rows where the value of the column col_1 is of the form: ‘x’ followed by digits.

df_2 = df.loc[df['col_1'].str.fullmatch('x\d+')]

Here is how this works:

  • Pandas has a fullmatch() function which returns true if the entire string matches the given string or regular expression.
  • The regular expression in this case is 'x\d+', where
    • x matches the letter x.
    • \d+ matches one or more digits.
  • The output data frame df_2 will have only the rows of the input data frame df where the value of the column col_1 matches the regular expression 'x\d+'.
  • We could have used the generic str.contains() with the regular expression '^\d+g$'; where ^ and $ enforce start and end of string respectively.

Contains

Substring

We wish to check whether a given substring occurs anywhere inside a given string and to return TRUE if there is a match and FALSE otherwise.

In this example, we wish to filter the rows of the data frame df to retain only rows where the value of the column col_1 contains the substring ‘XX’.

df_2 = df.loc[df['col_1'].str.contains('XX', regex=False)]

Here is how this works:

  • We use the str.contains() method from the str accessor set of string manipulation methods of Pandas Series to check for each value of the column col_1 whether that value of the column col_1 contains the substring 'XX'.
  • The str.contains() determines if pattern or regex is contained within a string. We pass regex=False as we are using a substring in this example.
  • The output data frame df_2 will have only the rows of the input data frame df where the value of the column col_1 contains the substring 'XX'.

Regular Expression

We wish to check whether a given regular expression has a match inside a given string and to return TRUE if there is such a match and FALSE otherwise.

In this example, we wish to filter the rows of the data frame df to retain only rows where the value of the column col_1 contains any integers represented by the regular expression '\\d+'.

df_2 = df.loc[df['col_1'].str.contains('\d+')]

Here is how this works:

  • This works similarly to the Substring case above except that we pass a regular expression to str.contains().
  • By default, str.contains() expects the pattern to be a regular expression.
  • The output data frame df_2 will have only the rows of the input data frame df where the value of the column col_1 contains any digits.

Starts With

Substring

We wish to check whether a given substring occurs at the beginning of a given string and to return TRUE if that is the case and FALSE otherwise.

In this example, we wish to filter the rows of the data frame df to retain only rows where the value of the column col_1 starts with the substring ‘XX’.

df_2 = df.loc[df['col_1'].str.startswith('XX')]

Here is how this works:

  • This works similarly to the code under Contains above except that we use str.startswith() instead of str.contains().
  • The function str.startswith() returns TRUE only if the given substring, in this case ‘XX’, occurs at the beginning of the string.
  • The output data frame df_2 will have only the rows of the input data frame df where the value of the column col_1 starts with ‘XX’.

Regular Expression

We wish to check whether a given regular expression occurs at the end of a given string and to return TRUE if that is the case and FALSE otherwise.

In this example, we wish to filter the rows of the data frame df to retain only rows where the value of the column col_1 starts with any integers.

df_2 = df.loc[df['col_1'].str.contains('^\d+')]

Here is how this works:

  • The method str.startswith() works with fixed strings only and not regular expressions.
  • This works similarly to Contains scenario above, but we use a different regular expression.
  • We use the regular expression start anchor ^ to specify that the regular expression must occur at the start of the string.

Alternative: Using the function match()

df_2 = df.loc[df['col_1'].str.match('\d+')]

Here is how this works:

  • str.match() matches the start of a string by default, so we don't need to use the start anchor '^'.
  • Our recommendation is to use just the generic str.contains() function and enforce any start or end constraints via the regular expression anchor symbols for start ^ and end $ as needed.

Ends With

Substring

We wish to check whether a given substring occurs at the end of a given string and to return TRUE if that is the case and FALSE otherwise.

In this example, we wish to filter the rows of the data frame df to retain only rows where the value of the column col_1 ends with the substring ‘XX’.

df_2 = df.loc[df['col_1'].str.endswith('XX')]

Here is how this works:

  • This works similarly to the code under Starts With above except that we use str.endswith() instead of str.startswith().
  • The function str.endswith() returns TRUE only if the given substring, in this case ‘XX’, occurs at the end of the string.
  • The output data frame df_2 will have only the rows of the input data frame df where the value of the column col_1 ends with ‘XX’.

Regular Expression

We wish to check whether a given regular expression occurs at the end of a given string and to return TRUE if that is the case and FALSE otherwise.

In this example, we wish to filter the rows of the data frame df to retain only rows where the value of the column col_1 ends with any integers.

df_2 = df.loc[df['col_1'].str.contains('\d+$')]

Here is how this works:

  • The method str.endsswith() works with fixed strings only and not regular expressions.
  • This works similarly to Contains scenario above, but we use a different regular expression.
  • We use the regular expression end anchor $ to specify that the regular expression must occur at the end of the string.

Ignore Case

Substring

We wish to check whether a given substring occurs anywhere inside a given string while ignoring case and to return TRUE if there is a match and FALSE otherwise.

In this example, we wish to filter the rows of the data frame df to retain only rows where the value of the column col_1 contains the substring ‘XX’ regardless of case.

df_2 = df.loc[df['col_1'].str.lower().str.contains('xx', regex=False)]

Here is how this works:

  • One way to ignore case is to set the case of the string we are looking into and the pattern we are looking for to the same case, say lower case.
  • In str.lower(), we lower the case of all the values of the column col_1 which are the strings we are looking into.
  • … and we pass a lower case expression to test against 'xx'.
  • This solution can be used with any str function.

Alternative: Case Flag

df_2 = df.loc[df['col_1'].str.contains('XX', regex=False, case=False)]

Here is how this works:

  • To check whether a string contains a given pattern, we use str.contains(). See Contains above.
  • When we wish to pass a plain string as a pattern for detection, we set regex=False because str.contains() expect a regular expression by default.
  • In order to ignore case, we set the argument case to case=False.
  • This solution only works with some functions in str, we recommend using the primary solution as it can be used with any string function.

Regular Expression

We wish to check whether a given regular expression has a match inside a given string and to return TRUE if there is such a match and FALSE otherwise.

In this example, we wish to filter the rows of the data frame df to retain only rows where the value of the column col_1 is of the form: ‘x’ followed by digits while ignoring the case.

df_2 = df.loc[df['col_1'].str.lower().str.contains('^x\d+')]

Here is how this works:

  • We didn't pass the argument regex=True because it is the default.
  • One way to ignore case is to set the case of the string we are looking into and the pattern we are looking for to the same case, say lower case.
  • In str.lower(), we lower the case of all the values of the column col_1 which are the strings we are looking into.
  • … and we pass a lower case expression to test against '^x\d+'.
  • We can also achieve the same by setting case=False, by as with substring we recommend using lower() approach.

Complement

We wish to return the inverse of the outcome of the string pattern matching operations described above; i.e. if the outcome is TRUE, return FALSE and vice versa.

In this example, we wish to filter the rows of the data frame df to retain only rows where the value of the column col_1 does not contain any integers.

df_2 = df.loc[~ df['col_1'].str.contains('\d+')]

Here is how this works:

  • To check whether a string contains a given pattern, we use str.contains(). See Contains above.
  • To check for the complement, i.e. to return TRUE when there is no match and FALSE when there is a match, we can use the complement operator ~.
  • The regular expression '\d+' checks for the occurrence of a substring comprised of one or more digits.
  • The output data frame df_2 will have only the rows of the input data frame df where the value of the column col_1 does not contain any digits.

Pattern Column

We wish to match a column of strings against a column of patterns of the same size. This is often needed when we wish to check the presence of the value of a column in another column for each row.

In this example, we wish to filter the rows of the data frame df to retain only rows where the value of the column col_1 contains the value of the column col_2.

df_2 = df.loc[df.apply(lambda x: x['col_2'] in x['col_1'], axis=1)]

Here is how this works:

  • The built-in str.contains() method is not vectorized, therefore we need to use apply() with axis=1 to apply a lambda in a row wise manner to each row of the data frame df. See Non-Vectorized Transformation.
  • The lambda function passed to apply() takes a row of the data frame (represented by the variable x) and uses the python operator in to check if the value in col_2 is contained in col_1.
  • The output data frame df_2 will have only the rows of the input data frame df where the value of the column col_1 contains the value of the column col_2.

Multiple Patterns

OR

We wish to check whether any of a set of patterns occurs in a given string and to return TRUE if that is the case and FALSE otherwise.

In this example, we wish to filter the rows of the data frame df to retain only rows where the value of the column col_1 contains the string ‘XX’ or the string ‘YY’.

df_2 = df.loc[df['col_1'].str.contains('XX', regex=False) |
              df['col_1'].str.contains('YY', regex=False)]

Here is how this works:

  • We call the function str.contains() twice:
    • once to check whether a value of the column col_1 contains 'XX' and
    • another to check whether a value fo the column col_1 contains 'YY'.
  • Each of these calls returns a vector of logical values (TRUE or FALSE) that has as many elements as the size of col_1.
  • We use the or operator | to combine these two logical vectors element wise into one logical vector that is TRUE if either call to str.contains() returns TRUE for that value of col_1. This final logical vector is passed to loc[].
  • The output data frame df_2 will have only the rows of the input data frame df where the value of the column col_1 contains the string ‘XX’ or ‘YY’.

Alternative: Via Regular Expression

df_2 = df.loc[df['col_1'].str.contains('XX|YY', regex=True)]

Here is how this works:

  • An alternative approach that scales well when we have multiple patterns is to use the or operator | of regular expressions to build a regular expression that captures all the patterns we wish to look for or’ed together. In this case that regular expression is 'XX|YY'.
  • We pass that regular expression to str.contains() which then returns TRUE if the string being checked (a value of the column col_1) contains either the substring ‘XX’ or the substring ‘YY’.
  • The output is the same as in the primary solution above.

AND

We wish to check whether each of a set of patterns occurs in a given string and to return TRUE if that is the case and FALSE otherwise.

In this example, we wish to filter the rows of the data frame df to retain only rows where the value of the column col_1 contains the ends with the string ‘x’ and contains a sequence of one or more digits.

df_2 = df.loc[df['col_1'].str.contains('^x', regex=True) &
              df['col_1'].str.contains('\d+', regex=True)]

Here is how this works:

  • We call the function str.contains() twice:
    • once to check whether a value of the column col_1 matches the regular expression '^x' which checks if the string starts with the letter ‘x’.
    • another to check whether a value fo the column col_1 matches the regular expression '\d+' which checks if the string contains a sequence of one or more digits.
  • Each of these calls returns a vector of logical values (TRUE or FALSE) that has as many elements as the size of col_1.
  • We can use the and operator & to combine these two logical vectors element wise into one logical vector that is TRUE if and only if both calls to str.contains() returns TRUE for that value of col_1. This final logical vector is passed to loc[].
  • The output data frame df_2 will have only the rows of the input data frame df where the value of the column col_1 starts with 'x' and contains at least one integer.

Alternative: Via Regular Expression

df_2 = df.loc[df['col_1'].str.contains('^x.*\d+', regex=True)]

Here is how this works:

  • An alternative approach that scales well when we have multiple patterns is to use one regular expression that matches all patterns at once. In this case that regular expression is ^x.*\d+.
  • .* matches any number of characters.
  • We pass that regular expression to str.contains() which then returns TRUE if the string being checked (a value of the column col_1) starts with 'x' and contains at least one integer.
  • The output is the same as in the primary solution above.
PYTHON
I/O