Dynamic Function Specification

We wish to specify the function to use to select columns dynamically.

In Implicit Selection, we covered how to use a function that evaluates a column and returns a boolean True or False value indicating whether a column should be selected or not (aka a predicate function). In this Section we cover how to pass such a function dynamically.

We will cover three scenarios as following:

  1. As Function Argument where selection happens inside a function A to which we pass a column selection function B as an argument.
  2. As Reference Variable where the function to use for column selection is specified as an environment variable.
  3. As String Variable where the name of the function to use for column selection is specified as a string (stored in an environment variable).

n this section we use an example function that returns True if the proportion of missing values in a column is less than 10% and False otherwise.

As Function Argument

We wish to pass the function to use to select columns as an argument to another function where the actual column selection takes place.

Named Function

In this example, we wish to pass a named function that selects the columns of a data farme that meet certain criteria col_select_fun() to a function where actual column selection happens pipeline().

def col_select_fun(col):
    return col.isna().mean().lt(0.1)

def pipeline(df, fun):
    df = df \
        .loc[:, lambda x: fun(x)]
    return df

df_2 = df \
    .pipe(pipeline, col_select_fun)

Here is how this works:

  • Column selection happens inside the custom function pipeline(). In real scenarios this would usually be a more elaborate chain of data transformations.
  • We pass the column selection function col_select_fun() to the pipeline() function via its fun argument.
  • loc[] passes the data frame df through the lambda function to our custom column selection function col_select_fun().
  • col_select_fun() takes the entire data frame at once and returns a Series of logical values where values corresponding to columns with less than 10% missing values are True and False otherwise.
  • We use the pipe() function to chain together operations (like the %>% operation in R). We could also do with out it like so: df_2 = pipeline(df, col_select_fun)

Lambda Function

In this example, we wish to pass a lambda function that selects the columns of a data farme that meet certain criteria to a function where actual column selection happens pipeline().

def pipeline(df, fun):
    df_2 = df.loc[:, fun(df)]
    return df_2

df_2 = df\
    .pipe(pipeline,
          lambda x: x.columns.str.contains('col', regex=False))

Here is how this works:

  • We pass the column selection lambda function to the pipeline() function via its fun argument.
  • loc[] passes the data frame df to the lambda function referred to by fun.
  • The rest of the code works similarly to the “Named Function” scenario above.

As Reference Variable

We wish to specify the function to use for column selection via an environment variable that holds a reference to the function.

def col_select_fun(col):
    return col.isna().mean().lt(0.1)

fun = col_select_fun

df_2 = df \
    .loc[:, lambda x: fun(x)]

Here is how this works:

  • The environment variable fun holds the function that we wish to use for column selection. Here that is col_select_fun().
  • loc[] passes the data frame through thelambda function to the function referred to by the environment variable fun which is our custom column selection function col_select_fun().
  • col_select_fun() takes the entire data frame at once and returns a Series of logical values where values corresponding to columns with less than 10% missing values are True and False otherwise.

As String Variable

We wish to specify the function to use for column selection via an environment variable that holds the name of the function as a string.

def col_select_fun(col):
    return col.isna().mean().lt(0.1)

fun = 'col_select_fun'

df_2 = df \
    .loc[:, lambda x: globals()[fun](x)]

Here is how this works:

  • globals() and locals() return a dictionary representing the current global or local (in function body for instance) symbol table.
  • If the function is part of a module (not simply defined in the current environment, we can use Python's getattr() function which takes two arguments: the module to which the function belongs (e.g. pd.Series) and the function's name as a string.
PYTHON
I/O