We wish to check whether a string matches a given pattern and to return a logical value TRUE
if
that is the case and FALSE
otherwise.
In this section we will cover the following four common string pattern detection scenarios:
For each of these four scenarios, we will cover two cases:
In addition, we cover the following three scenarios which can be applied to extend any of the above:
String
We wish to check if a string exactly matches another string.
In this example, we wish to filter the rows of the table table_1
to retain only rows where the
value of the column col_1
exactly equals ‘XXX’
.
SELECT *
FROM table_1
WHERE col_1 = 'XXX';
Here is how this works:
=
to check whether two strings are equal.col_1
.'XXX'
.table_1
where the value of
the column col_1
is 'XXX'
.Regular Expression
We wish to check whether a given string matches a given regular expression and to return TRUE
if
that is the case and FALSE
otherwise.
In this example, we wish to filter the rows of the table table_1
to retain only rows where the
value of the column col_1
is of the form: ‘x’
followed by digits.
SELECT *
FROM table_1
WHERE REGEXP_CONTAINS(col_1, '^x\\d+$');
Here is how this works:
REGEXP_CONTAINS()
function to check, for each value of the column col_1
, whether a
regular expression is a match.REGEXP_CONTAINS()
function takes the following arguments:col_1
.'^x\\d+$'
.'^x\\d+$'
, where^
matches the start of a string.x
matches the letter x.\\d+
matches one or more digits.$
matches the end of a string.^
and end $
of the string in the regular expression we specify that it
must be a full match.table_1
where the value
of the column col_1
matches the regular expression '^x\\d+$'
.Substring
We wish to check whether a given substring occurs anywhere inside a given string and to
return TRUE
if there is a match and FALSE
otherwise.
In this example, we wish to filter the rows of the table table_1
to retain only rows where the
value of the column col_1
contains the substring ‘XX’
.
SELECT *
FROM table_1
WHERE col_1 LIKE '%XX%';
Here is how this works:
LIKE
to check for each value of the column col_1
, whether that value of the
column col_1
contains the substring 'XX'
.X [NOT] LIKE Y
operator checks if the string in the first operand X matches a pattern
specified by the second operand Y. Expressions can contain these characters:%
matches any number of characters or bytes. %XX%
means that the string
contains XX
.\
,_
, or %
using two backslashes. For example, \\%
. If we are using raw
strings,
only a single backslash is required. For example, r'\%'
.table_1
where the value
of the column col_1
contains the substring 'XX'
.LIKE
operator covers most matching use-cases except those that require regular expressions.
We recommend using LIKE
operator whenever possible as it is more convenient.Regular Expression
We wish to check whether a given regular expression has a match inside a given string and to
return TRUE
if there is such a match and FALSE
otherwise.
In this example, we wish to filter the rows of the table table_1
to retain only rows where the
value
of the column col_1
contains any integers represented by the regular expression '\\d+'
.
SELECT *
FROM table_1
WHERE REGEXP_CONTAINS(col_1, '\\d+');
Here is how this works:
REGEXP_CONTAINS()
function to check, for each value of the column col_1
, whether a
regular expression is a match.REGEXP_CONTAINS()
function takes the following arguments:col_1
.'\\d+'
.table_1
where the value
of the column col_1
contains any digits.Substring
We wish to check whether a given substring occurs at the beginning of a given string and to
return TRUE
if that is the case and FALSE
otherwise.
In this example, we wish to filter the rows of the table table_1
to retain only rows where the
value of the column col_1
starts with the substring ‘XX’
.
SELECT *
FROM table_1
WHERE col_1 LIKE 'XX%';
Here is how this works:
'XX%'
instead of '%XX%'
.table_1
where the value
of the column col_1
starts with ‘XX’
.Alternative: STARTS_WITH functions
SELECT *
FROM table_1
WHERE STARTS_WITH(col_1, 'XX');
Here is how this works:
STARTS_WITH()
function to check, for each value of the column col_1
, whether it
starts with a given substring.STARTS_WITH()
function takes the following arguments:col_1
.'XX'
.table_1
where the value
of the column col_1
starts with XX
.Regular Expression
We wish to check whether a given regular expression occurs at the beginning of a given string and to
return TRUE
if that is the case and FALSE
otherwise.
In this example, we wish to filter the rows of the table table_1
to retain only rows where the
value of the column col_1
starts with any integers represented by the regular
expression '^\\d+'
.
SELECT *
FROM table_1
WHERE REGEXP_CONTAINS(col_1, '^\\d+');
Here is how this works:
^
to specify that the regular expression must occur
at the start of the string.Substring
We wish to check whether a given substring occurs at the end of a given string and to return TRUE
if that is the case and FALSE
otherwise.
In this example, we wish to filter the rows of the table table_1
to retain only rows where the
value of the column col_1
ends with the substring ‘XX’
.
SELECT *
FROM table_1
WHERE col_1 LIKE '%XX';
Here is how this works:
'%XX'
instead of '%XX%'
.table_1
where the value
of the column col_1
ends with ‘XX’
.Alternative: ENDS_WITH functions
SELECT *
FROM table_1
WHERE ENDS_WITH(col_1, 'XX');
Here is how this works:
ENDS_WITH()
function to check, for each value of the column col_1
, whether it
ends with a given substring.ENDS_WITH()
function takes the following arguments:col_1
.'XX'
.table_1
where the value
of the column col_1
ends with XX
.Regular Expression
We wish to check whether a given regular expression occurs at the end of a given string and to
return TRUE
if that is the case and FALSE
otherwise.
In this example, we wish to filter the rows of the table table_1
to retain only rows where the
value of the column col_1
ends with any integers.
SELECT *
FROM table_1
WHERE REGEXP_CONTAINS(col_1, '\\d+$');
Here is how this works:
$
to specify that the regular expression must occur
at the end of the string.Substring
We wish to check whether a given substring occurs anywhere inside a given string while ignoring case
and to return TRUE
if there is a match and FALSE
otherwise.
In this example, we wish to filter the rows of the table table_1
to retain only rows where the
value of the column col_1
contains the substring ‘XX’
regardless of case.
SELECT *
FROM table_1
WHERE CONTAINS_SUBSTR(col_1, 'XX');
Here is how this works:
CONTAINS_SUBSTR()
which ignores the
case of the string.CONTAINS_SUBSTR()
function takes the following arguments:col_1
.'XX'
.Alternative: Lower Case
SELECT *
FROM table_1
WHERE LOWER(col_1) LIKE '%xx%';
Here is how this works:
LOWER(col_1)
, we lower the case of all the values of the column col_1
which are the
strings we are looking into.'xx'
.Regular Expression
We wish to check whether a given regular expression has a match inside a given string and to
return TRUE
if there is such a match and FALSE
otherwise.
In this example, we wish to filter the rows of the table table_1
to retain only rows where the
value of the column col_1
is of the form: ‘x’
followed by digits while ignoring the case.
SELECT *
FROM table_1
WHERE REGEXP_CONTAINS(LOWER(col_1), '^x\\d+');
Here is how this works:
REGEXP_CONTAINS()
.
See Contains above.LOWER()
function.^x\\d+
.We wish to return the inverse of the outcome of the string pattern matching operations described
above; i.e. if the outcome is TRUE
, return FALSE
and vice versa.
In this example, we wish to filter the rows of the table table_1
to retain only rows where the
value of the column col_1
does not contain any integers.
SELECT *
FROM table_1
WHERE NOT REGEXP_CONTAINS(col_1, '\\d+');
Here is how this works:
REGEXP_CONTAINS()
.
See Contains above.TRUE
when there is no match and FALSE
when there
is a match, we can use the complement operator NOT
.'\\d+'
checks for the occurrence of a substring comprised of one or more
digits.table_1
where the value
of the column col_1
does not contain any digits.LIKE
operator we can use either NOT X LIKE Y
or X NOT LIKE Y
.We wish to match a vector of strings against a vector of patterns of the same size. This is often needed when we wish to check the presence of the value of a column in another column for each row.
In this example, we wish to filter the rows of the data frame df
to retain only rows where the
value of the column col_1
contains the value of the column col_2
.
SELECT *
FROM table_1
WHERE REGEXP_CONTAINS(col_1, col_2);
Here is how this works:
REGEXP_CONTAINS()
will return TRUE
for rows where the value of col_2
is contained
in col_1
.table_1
where the value of the
column col_1
contains the value of the column col_2
.OR
We wish to check whether any of a set of patterns occurs in a given string and to return TRUE
if
that is the case and FALSE
otherwise.
In this example, we wish to filter the rows of the table table_1
to retain only rows where the
value of the column col_1
contains the string ‘XX’
or the string ‘YY’
.
SELECT *
FROM table_1
WHERE col_1 LIKE '%XX%'
OR col_1 LIKE '%YY%';
Here is how this works:
LIKE
operator twice:col_1
contains 'XX'
andcol_1
contains 'YY'
.OR
which will return TRUE
if either col_1 LIKE '%XX%'
or col_1 LIKE '%YY%'
returns TRUE
for that value of col_1
.table_1
where the value
of the column col_1
contains the string ‘XX’
or ‘YY’
.Alternative: Via Regular Expression
SELECT *
FROM table_1
WHERE REGEXP_CONTAINS(col_1, 'XX|YY');
Here is how this works:
|
of regular expressions to build a regular expression that captures all the patterns
we wish to look for or’ed together. In this case that regular expression is 'XX|YY'
.REGEXP_CONTAINS()
which then returns TRUE
if the string
being checked (a value of the column col_1
) contains either the substring ‘XX’
or the
substring ‘YY’
.AND
We wish to check whether each of a set of patterns occurs in a given string and to return TRUE
if
that is the case and FALSE
otherwise.
In this example, we wish to filter the rows of the table table_1
to retain only rows where the
value of the column col_1
contains the ends with the string ‘x’
and contains a sequence of one
or more digits.
SELECT *
FROM table_1
WHERE REGEXP_CONTAINS(LOWER(col_1), '^x')
AND REGEXP_CONTAINS(LOWER(col_1), '\\d+');
Here is how this works:
REGEXP_CONTAINS()
twice:col_1
matches the regular expression '^x'
which checks if the string starts with the letter ‘x’
.col_1
matches the regular expression '\\d+'
which checks if the string contains a sequence of one or more digits.AND
operator which will return TRUE
if both calls to REGEXP_CONTAINS()
returns TRUE
for that value of col_1
.table_1
where the value
of the column col_1
starts with 'x'
and contains at least one integer.