We wish to pass the names of the columns to be selected dynamically.
We will cover the following:
We wish to specify the names of the columns to be selected as a variable in the environment.
In this example, we specify the names of the columns we wish to select as a variable cols_to_select
and then use that variable for column selection.
cols_to_select = ['col_1','col_2','col_3']
df_2 = df.loc[:, cols_to_select]
Here is how this works:
cols_to_select
.loc[]
. We cover loc[]
in detail in Section 1.2.KeyError
. To ignore list elements that do not match any of the data frame’s column names, see “Flexible Matching” below.We wish to pass the names of the columns to be selected as ab argument to a function. The actual column selection happens inside the function.
In this example, column selection happens inside the function pipeline()
which takes the names of the columns to be selected as an argument cols_to_select
.
def pipeline(df, cols_to_select):
df = df\
.loc[:, cols_to_select]
return df
df_2 = df \
.pipe(pipeline, ['col_1','col_2','col_3'])
Here is how this works:
pipeline()
. In real scenarios this would usually be a more elaborate chain of data transformations.cols_to_select
.pipe()
function from Pandas to chain together operations (like the %>%
operation in R). See Chapter X for more on chaining and the pipe()
function. We could also do with out it like so: df_2 = pipeline(df, ['col_1','col_2','col_3'])
KeyError
. To ignore list elements that do not match any of the data frame’s column names, see “Flexible Matching” below.We wish to select any columns whose names are in a list of strings where the list may have strings that do not match any column names. We wish to select columns with matching names and ignore the non matching strings (do not throw an error).
possible_cols = ['col_1','col_2','col_x']
df_2 = df.loc[:, lambda x: x.columns.isin(possible_cols)]
Here is how this works:
isin()
(a method of Pandas Series
objects) that returns True
for each column whose name exists in the list possible_cols
resulting in a Series of boolean values.loc[]
accepts a Series of boolean values and returns the columns where the corresponding boolean value is True
. See Section 1.2 for more on loc[]
.lambda
function inside loc[]
so it is robust to use in a chain where columns may have been altered in previous steps. If that is not the case we can refer to the data frame, here df
, directly without a lambda
function like so df.loc[:, df.columns.isin(possible_cols)]
.