Detecting

We wish to check whether a string matches a given pattern and to return a logical value TRUE if that is the case and FALSE otherwise.

In this section we will cover the following four common string pattern detection scenarios:

  • Full Match where we cover how to check if a given string exactly matches a given pattern.
  • Contains where we cover how to check if a given pattern is contained in a given string.
  • Starts With where we cover how to check if a given string starts with a given pattern.
  • Ends With where we cover how to check if a given string ends with a given pattern.

For each of these four scenarios, we will cover two cases:

  • Substring where the pattern is a plain character sequence
  • Regular Expressions where the pattern is a regular expression

In addition, we cover the following three scenarios which can be applied to extend any of the above:

  • Ignore Case where we cover how to ignore the case (of both pattern and string) while matching.
  • Complement where we cover how to return the inverse of the outcome of the string pattern matching scenarios described above.
  • Pattern Column where we cover how to match a column of strings against a column of patterns of the same size. This is often needed when we wish to check the presence of the value of a column in another column for each row.
  • Multiple Patterns where we cover how to extend any of the above scenarios to check against multiple patterns at a time.

Full Match

String

We wish to check if a string exactly matches another string.

In this example, we wish to filter the rows of the table table_1 to retain only rows where the value of the column col_1 exactly equals ‘XXX’.

SELECT *
FROM table_1
WHERE col_1 = 'XXX';

Here is how this works:

  • We use the basic equality comparison operator = to check whether two strings are equal.
  • The comparison is between:
    • Each of the values of the column col_1.
    • and the character sequence 'XXX'.
  • The output table will have only the rows of the input table table_1 where the value of the column col_1 is 'XXX'.

Regular Expression

We wish to check whether a given string matches a given regular expression and to return TRUE if that is the case and FALSE otherwise.

In this example, we wish to filter the rows of the table table_1 to retain only rows where the value of the column col_1 is of the form: ‘x’ followed by digits.

SELECT *
FROM table_1
WHERE REGEXP_CONTAINS(col_1, '^x\\d+$');

Here is how this works:

  • We use the REGEXP_CONTAINS() function to check, for each value of the column col_1, whether a regular expression is a match.
  • The REGEXP_CONTAINS() function takes the following arguments:
    • The column whose values we wish to check against; which in this case is col_1.
    • A regular expression; which in this case is '^x\\d+$'.
  • The regular expression in this case is '^x\\d+$', where
    • ^ matches the start of a string.
    • x matches the letter x.
    • \\d+ matches one or more digits.
    • $ matches the end of a string.
  • By including the start ^ and end $ of the string in the regular expression we specify that it must be a full match.
  • The output table will have only the rows of the input table table_1 where the value of the column col_1 matches the regular expression '^x\\d+$'.

Contains

Substring

We wish to check whether a given substring occurs anywhere inside a given string and to return TRUE if there is a match and FALSE otherwise.

In this example, we wish to filter the rows of the table table_1 to retain only rows where the value of the column col_1 contains the substring ‘XX’.

SELECT *
FROM table_1
WHERE col_1 LIKE '%XX%';

Here is how this works:

  • We use the LIKE to check for each value of the column col_1, whether that value of the column col_1 contains the substring 'XX'.
  • The X [NOT] LIKE Y operator checks if the string in the first operand X matches a pattern specified by the second operand Y. Expressions can contain these characters:
    • A percent sign % matches any number of characters or bytes. %XX% means that the string contains XX.
    • An underscore _ matches a single character or byte.
    • We can escape \,_, or % using two backslashes. For example, \\%. If we are using raw strings, only a single backslash is required. For example, r'\%'.
  • The output table will have only the rows of the input table table_1 where the value of the column col_1 contains the substring 'XX'.
  • The LIKE operator covers most matching use-cases except those that require regular expressions. We recommend using LIKE operator whenever possible as it is more convenient.

Regular Expression

We wish to check whether a given regular expression has a match inside a given string and to return TRUE if there is such a match and FALSE otherwise.

In this example, we wish to filter the rows of the table table_1 to retain only rows where the value of the column col_1 contains any integers represented by the regular expression '\\d+'.

SELECT *
FROM table_1
WHERE REGEXP_CONTAINS(col_1, '\\d+');

Here is how this works:

  • We use the REGEXP_CONTAINS() function to check, for each value of the column col_1, whether a regular expression is a match.
  • The REGEXP_CONTAINS() function takes the following arguments:
    • The column whose values we wish to check against; which in this case is col_1.
    • A regular expression; which in this case is '\\d+'.
  • The output table will have only the rows of the input table table_1 where the value of the column col_1 contains any digits.

Starts With

Substring

We wish to check whether a given substring occurs at the beginning of a given string and to return TRUE if that is the case and FALSE otherwise.

In this example, we wish to filter the rows of the table table_1 to retain only rows where the value of the column col_1 starts with the substring ‘XX’.

SELECT *
FROM table_1
WHERE col_1 LIKE 'XX%';

Here is how this works:

  • This works similarly to the code under Contains above except that we use 'XX%' instead of '%XX%'.
  • The output table will have only the rows of the input table table_1 where the value of the column col_1 starts with ‘XX’.

Alternative: STARTS_WITH functions

SELECT *
FROM table_1
WHERE STARTS_WITH(col_1, 'XX');

Here is how this works:

  • We use the STARTS_WITH() function to check, for each value of the column col_1, whether it starts with a given substring.
  • The STARTS_WITH() function takes the following arguments:
    • The column whose values we wish to check against; which in this case is col_1.
    • A string value; which in this case is 'XX'.
  • The output table will have only the rows of the input table table_1 where the value of the column col_1 starts with XX.

Regular Expression

We wish to check whether a given regular expression occurs at the beginning of a given string and to return TRUE if that is the case and FALSE otherwise.

In this example, we wish to filter the rows of the table table_1 to retain only rows where the value of the column col_1 starts with any integers represented by the regular expression '^\\d+'.

SELECT *
FROM table_1
WHERE REGEXP_CONTAINS(col_1, '^\\d+');

Here is how this works:

  • This works similarly to the code under Contains above except that we use . However, we use the regular expression start anchor ^ to specify that the regular expression must occur at the start of the string.

Ends With

Substring

We wish to check whether a given substring occurs at the end of a given string and to return TRUE if that is the case and FALSE otherwise.

In this example, we wish to filter the rows of the table table_1 to retain only rows where the value of the column col_1 ends with the substring ‘XX’.

SELECT *
FROM table_1
WHERE col_1 LIKE '%XX';

Here is how this works:

  • This works similarly to the code under Contains above except that we use '%XX' instead of '%XX%'.
  • The output table will have only the rows of the input table table_1 where the value of the column col_1 ends with ‘XX’.

Alternative: ENDS_WITH functions

SELECT *
FROM table_1
WHERE ENDS_WITH(col_1, 'XX');

Here is how this works:

  • We use the ENDS_WITH() function to check, for each value of the column col_1, whether it ends with a given substring.
  • The ENDS_WITH() function takes the following arguments:
    • The column whose values we wish to check against; which in this case is col_1.
    • A string value; which in this case is 'XX'.
  • The output table will have only the rows of the input table table_1 where the value of the column col_1 ends with XX.

Regular Expression

We wish to check whether a given regular expression occurs at the end of a given string and to return TRUE if that is the case and FALSE otherwise.

In this example, we wish to filter the rows of the table table_1 to retain only rows where the value of the column col_1 ends with any integers.

SELECT *
FROM table_1
WHERE REGEXP_CONTAINS(col_1, '\\d+$');

Here is how this works:

  • This works similarly to the code under Contains above except that we use . However, we use the regular expression end anchor $ to specify that the regular expression must occur at the end of the string.

Ignore Case

Substring

We wish to check whether a given substring occurs anywhere inside a given string while ignoring case and to return TRUE if there is a match and FALSE otherwise.

In this example, we wish to filter the rows of the table table_1 to retain only rows where the value of the column col_1 contains the substring ‘XX’ regardless of case.

SELECT *
FROM table_1
WHERE CONTAINS_SUBSTR(col_1, 'XX');

Here is how this works:

  • To check whether a string contains a given pattern, we use CONTAINS_SUBSTR() which ignores the case of the string.
  • The CONTAINS_SUBSTR() function takes the following arguments:
    • The column whose values we wish to check against; which in this case is col_1.
    • A string value; which in this case is 'XX'.

Alternative: Lower Case

SELECT *
FROM table_1
WHERE LOWER(col_1) LIKE '%xx%';

Here is how this works:

  • One way to ignore case is to set the case of the string we are looking into and the pattern we are looking for to the same case, say lower case.
  • In LOWER(col_1), we lower the case of all the values of the column col_1 which are the strings we are looking into.
  • … and we pass a lower case expression to test against 'xx'.

Regular Expression

We wish to check whether a given regular expression has a match inside a given string and to return TRUE if there is such a match and FALSE otherwise.

In this example, we wish to filter the rows of the table table_1 to retain only rows where the value of the column col_1 is of the form: ‘x’ followed by digits while ignoring the case.

SELECT *
FROM table_1
WHERE REGEXP_CONTAINS(LOWER(col_1), '^x\\d+');

Here is how this works:

  • To check whether a string contains a given pattern, we use REGEXP_CONTAINS(). See Contains above.
  • We convert the input column to lower case using LOWER() function.
  • … and we pass a lower case regular expression to test against ^x\\d+.

Complement

We wish to return the inverse of the outcome of the string pattern matching operations described above; i.e. if the outcome is TRUE, return FALSE and vice versa.

In this example, we wish to filter the rows of the table table_1 to retain only rows where the value of the column col_1 does not contain any integers.

SELECT *
FROM table_1
WHERE NOT REGEXP_CONTAINS(col_1, '\\d+');

Here is how this works:

  • To check whether a string contains a given pattern, we use REGEXP_CONTAINS(). See Contains above.
  • To check for the complement, i.e. to return TRUE when there is no match and FALSE when there is a match, we can use the complement operator NOT.
  • The regular expression '\\d+' checks for the occurrence of a substring comprised of one or more digits.
  • The output table will have only the rows of the input table table_1 where the value of the column col_1 does not contain any digits.
  • Note: For LIKE operator we can use either NOT X LIKE Y or X NOT LIKE Y.

Pattern Column

We wish to match a vector of strings against a vector of patterns of the same size. This is often needed when we wish to check the presence of the value of a column in another column for each row.

In this example, we wish to filter the rows of the data frame df to retain only rows where the value of the column col_1 contains the value of the column col_2.

SELECT *
FROM table_1
WHERE REGEXP_CONTAINS(col_1, col_2);

Here is how this works:

  • This works similarly to Contains scenario above except we pass col_2 instead of a constant string.
  • REGEXP_CONTAINS() will return TRUE for rows where the value of col_2 is contained in col_1.
  • The output table will have only the rows of the input table table_1 where the value of the column col_1 contains the value of the column col_2.

Multiple Patterns

OR

We wish to check whether any of a set of patterns occurs in a given string and to return TRUE if that is the case and FALSE otherwise.

In this example, we wish to filter the rows of the table table_1 to retain only rows where the value of the column col_1 contains the string ‘XX’ or the string ‘YY’.

SELECT *
FROM table_1
WHERE col_1 LIKE '%XX%'
   OR col_1 LIKE '%YY%';

Here is how this works:

  • We use the LIKE operator twice:
    • once to check whether a value of the column col_1 contains 'XX' and
    • another to check whether a value fo the column col_1 contains 'YY'.
  • We use the or operator OR which will return TRUE if either col_1 LIKE '%XX%' or col_1 LIKE '%YY%'returns TRUE for that value of col_1.
  • The output table will have only the rows of the input table table_1 where the value of the column col_1 contains the string ‘XX’ or ‘YY’.

Alternative: Via Regular Expression

SELECT *
FROM table_1
WHERE REGEXP_CONTAINS(col_1, 'XX|YY');

Here is how this works:

  • An alternative approach that scales well when we have multiple patterns is to use the or operator | of regular expressions to build a regular expression that captures all the patterns we wish to look for or’ed together. In this case that regular expression is 'XX|YY'.
  • We pass that regular expression to REGEXP_CONTAINS() which then returns TRUE if the string being checked (a value of the column col_1) contains either the substring ‘XX’ or the substring ‘YY’.
  • The output is the same as in the primary solution above.

AND

We wish to check whether each of a set of patterns occurs in a given string and to return TRUE if that is the case and FALSE otherwise.

In this example, we wish to filter the rows of the table table_1 to retain only rows where the value of the column col_1 contains the ends with the string ‘x’ and contains a sequence of one or more digits.

SELECT *
FROM table_1
WHERE REGEXP_CONTAINS(LOWER(col_1), '^x') 
AND REGEXP_CONTAINS(LOWER(col_1), '\\d+');

Here is how this works:

  • We call the function REGEXP_CONTAINS() twice:
    • once to check whether a value of the column col_1 matches the regular expression '^x' which checks if the string starts with the letter ‘x’.
    • another to check whether a value fo the column col_1 matches the regular expression '\\d+' which checks if the string contains a sequence of one or more digits.
  • We use the AND operator which will return TRUE if both calls to REGEXP_CONTAINS() returns TRUE for that value of col_1.
  • The output table will have only the rows of the input table table_1 where the value of the column col_1 starts with 'x' and contains at least one integer.
SQL
I/O