Selecting by Name Pattern

Oftentimes, we wish to select columns whose names fit a particular pattern. This is helpful in situations where we would otherwise need to type in many similar column names explicitly or in situations where we don’t know the exact column names in advance (but we know the pattern they would follow).

Column names are strings. We can therefore apply string matching operations to them. We will cover the string matching operations most commonly used in a column selection context on this page. Please see String Operations for a detailed coverage of string matching.

Starts With

We wish to select columns whose names start with a given prefix.

In this example, we select all columns whose name starts with the prefix ‘col’.

df_2 = df.loc[:, lambda x: x.columns.str.startswith('col')]

Here is how this works:

  • df.columns returns the column names of the data frame as a Series.
  • str.startswith(’col’) returns a Series of booleans where an element is True if the corresponding column’s name starts with the string ‘col’.
  • In addition to its ability to take column names as input, loc[] can take a Series of booleans as input and would select and return the columns where the corresponding input Series element is True.

Ends With

We wish to select columns whose names end with a given suffix.

In this example, we select all columns whose name ends with the suffix ‘_id’.

df_2 = df.loc[:, lambda x: x.columns.str.endswith('_id')]

Here is how this works:

This works similarly to the scenario described above, except that we use endswith() instead of startswith().

Contains

We wish to select columns whose names contain a given substring.

In this example, we select all columns whose name contains the substring ‘col’.

df_2 = df.loc[:, lambda x: x.columns.str.contains('col', regex=False)]

Here is how this works:

  • str.contains() checks if a string contains a given substring and returns True if so and False otherwise.
  • str.contains() assumes a regex pattern by default. To specify that we wish to match a simple string (character sequence), we set the argument regex=False.

Alternatively:

For consistency, we recommend using loc[] and str.contains() as shown above. However, there are two alternative methods that involve significantly fewer keystrokes and may be more fitting in some contexts e.g. interactive inspection.

Alternative: Via a list comprehension

df_2 = df[[c for c in df.columns if 'col' in c]]

The list comprehension uses the Python in operator to identify matching strings and returns a list of column names that satisfy the condition. That list of matching column names is then passed to the bracket operator [] to select those columns.

Alternative: Via filter()

df_2 = df.filter(like='col')

filter() allows us to select columns via column names only and not the data values.

Matches RegEx

We wish to select columns whose names match a given regular expression.

In this example we wish to select all columns whose names end with the digit 3 matched by the regular expression '3$'.

df_2 = df.loc[:, lambda x: x.columns.str.contains('3$')]

Here is how this works:

  • str.contains() assumes a regex pattern by default i.e. regex=True by default.
  • str.contains(regex) returns a Series of booleans where an element is True if the corresponding column’s name contains the passed regular expression regex.

Multiple Tokens

We wish to select columns whose names include any one of a set of substrings.

In this example, we wish to select columns whose names contain any of the words ‘lat’, or ‘long’, or ‘location’ .

df_2 = df.loc[:, lambda x: x.columns.str.contains('lat|long|location')]

Here is how this works:

  • We are using the regex or | character to specify that a match should happen if one or more of the substrings occurs in it.
  • x.columns.str.contains() matches the passed regular expression against each column name and returns True where there is a match and False otherwise.
  • The boolean Series returned by str.contains() is passed to loc[] which selects and returns only the columns corresponding to a value of True.

Extensions:

  • Here we spelt out the pattern explicitly which is okay for short patterns. For longer patterns or when the substrings to be concatenated are defined in a list, a better approach is to use the Python join() function like this: '|'.join(substring_list).
  • Here we are assuming that the substrings to match against have no special characters such as $ and ^ (that we want to match literally). See String Operations for how to deal with (escape) such special characters.
  • Here we covered the or scenario. In String Operations we cover how to execute an “and” in regular expressions (via what is called a Lookahead).
PYTHON
I/O