Dynamic Column Specification

We wish to pass the names of the columns to be selected dynamically.

We will cover the following:

  1. As Environment Variable: The names of the columns to be selected are specified as a variable in the environment. This is useful for instance when structuring a script and wish to have column specification as part of a configuration section at the beginning of the script separate from the logic.
  2. As Function Argument: Column selection happens inside a function and the names of the columns to be selected are passed to the function as an argument. This is useful when we wish to write our data manipulation pipeline as a reusable function.
  3. Flexible Matching: We see how to ignore elements in the passed list of column names that do not match any actual column names.

As Environment Variable

We wish to specify the names of the columns to be selected as a variable in the environment.

In this example, we specify the names of the columns we wish to select as a variable cols_to_select and then use that variable for column selection.

cols_to_select = ['col_1','col_2','col_3']
df_2 = df.loc[:, cols_to_select] 

Here is how this works:

  • We specify the columns we wish to select as a variable; here cols_to_select.
  • We then pass that variable to loc[]. We cover loc[] in detail in Section 1.2.
  • If the list of column names passed may contain elements that do not match any column names, execution will fail and return a KeyError. To ignore list elements that do not match any of the data frame’s column names, see “Flexible Matching” below.

As Function Argument

We wish to pass the names of the columns to be selected as ab argument to a function. The actual column selection happens inside the function.

In this example, column selection happens inside the function pipeline() which takes the names of the columns to be selected as an argument cols_to_select.

def pipeline(df, cols_to_select):
    df = df\
        .loc[:, cols_to_select]
    return df

df_2 = df \
    .pipe(pipeline, ['col_1','col_2','col_3'])

Here is how this works:

  • Column selection happens inside the custom function pipeline(). In real scenarios this would usually be a more elaborate chain of data transformations.
  • We pass the column names to select to the function via the argument cols_to_select.
  • We use the pipe() function from Pandas to chain together operations (like the %>% operation in R). See Chapter X for more on chaining and the pipe() function. We could also do with out it like so: df_2 = pipeline(df, ['col_1','col_2','col_3'])
  • If the list of column names passed may contain elements that do not match any column names, execution will fail and return a KeyError. To ignore list elements that do not match any of the data frame’s column names, see “Flexible Matching” below.

Flexible Matching

We wish to select any columns whose names are in a list of strings where the list may have strings that do not match any column names. We wish to select columns with matching names and ignore the non matching strings (do not throw an error).

possible_cols = ['col_1','col_2','col_x']
df_2 = df.loc[:, lambda x: x.columns.isin(possible_cols)] 

Here is how this works:

  • We use isin() (a method of Pandas Series objects) that returns True for each column whose name exists in the list possible_cols resulting in a Series of boolean values.
  • loc[] accepts a Series of boolean values and returns the columns where the corresponding boolean value is True. See Section 1.2 for more on loc[].
  • We use a lambda function inside loc[] so it is robust to use in a chain where columns may have been altered in previous steps. If that is not the case we can refer to the data frame, here df, directly without a lambda function like so df.loc[:, df.columns.isin(possible_cols)].
PYTHON
I/O