Oftentimes, we wish to select columns whose names fit a particular pattern. This is helpful in situations where we would otherwise need to type in many similar column names explicitly or in situations where we don’t know the exact column names in advance (but we know the pattern they would follow).
Column names are strings. We can therefore apply string matching operations to them. We will cover the string matching operations most commonly used in a column selection context on this page. Please see String Operations for a detailed coverage of string matching.
We wish to select columns whose names start with a given prefix.
In this example, we select all columns whose name starts with the prefix ‘col’
.
df_2 = df.loc[:, lambda x: x.columns.str.startswith('col')]
Here is how this works:
df.columns
returns the column names of the data frame as a Series
.str.startswith(’col’)
returns a Series
of booleans where an element is True
if the corresponding column’s name starts with the string ‘col’
.loc[]
can take a Series
of booleans as input and would select and return the columns where the corresponding input Series
element is True
.We wish to select columns whose names end with a given suffix.
In this example, we select all columns whose name ends with the suffix ‘_id’
.
df_2 = df.loc[:, lambda x: x.columns.str.endswith('_id')]
Here is how this works:
This works similarly to the scenario described above, except that we use endswith()
instead of startswith()
.
We wish to select columns whose names contain a given substring.
In this example, we select all columns whose name contains the substring ‘col’
.
df_2 = df.loc[:, lambda x: x.columns.str.contains('col', regex=False)]
Here is how this works:
str.contains()
checks if a string contains a given substring and returns True
if so and False
otherwise.str.contains()
assumes a regex pattern by default. To specify that we wish to match a simple string (character sequence), we set the argument regex=False
.Alternatively:
For consistency, we recommend using loc[]
and str.contains()
as shown above. However, there are two alternative methods that involve significantly fewer keystrokes and may be more fitting in some contexts e.g. interactive inspection.
Alternative: Via a list comprehension
df_2 = df[[c for c in df.columns if 'col' in c]]
The list comprehension uses the Python in
operator to identify matching strings and returns a list of column names that satisfy the condition. That list of matching column names is then passed to the bracket operator []
to select those columns.
Alternative: Via filter()
df_2 = df.filter(like='col')
filter()
allows us to select columns via column names only and not the data values.
We wish to select columns whose names match a given regular expression.
In this example we wish to select all columns whose names end with the digit 3 matched by the regular expression '3$'
.
df_2 = df.loc[:, lambda x: x.columns.str.contains('3$')]
Here is how this works:
str.contains()
assumes a regex pattern by default i.e. regex=True
by default.str.contains(regex)
returns a Series
of booleans where an element is True
if the corresponding column’s name contains the passed regular expression regex
.We wish to select columns whose names include any one of a set of substrings.
In this example, we wish to select columns whose names contain any of the words ‘lat’
, or ‘long’
, or ‘location’
.
df_2 = df.loc[:, lambda x: x.columns.str.contains('lat|long|location')]
Here is how this works:
|
character to specify that a match should happen if one or more of the substrings occurs in it.x.columns.str.contains()
matches the passed regular expression against each column name and returns True
where there is a match and False
otherwise.Series
returned by str.contains()
is passed to loc[]
which selects and returns only the columns corresponding to a value of True
.Extensions:
join()
function like this: '|'.join(substring_list)
.$
and ^
(that we want to match literally). See String Operations for how to deal with (escape) such special characters.