We wish to specify the function to use to select columns dynamically.
In Implicit Selection, we covered how to use a function that evaluates a column and returns a boolean True
or False
value indicating whether a column should be selected or not (aka a predicate function). In this Section we cover how to pass such a function dynamically.
We will cover three scenarios as following:
n this section we use an example function that returns True
if the proportion of missing values in a column is less than 10% and False
otherwise.
We wish to pass the function to use to select columns as an argument to another function where the actual column selection takes place.
Named Function
In this example, we wish to pass a named function that selects the columns of a data farme that meet certain criteria col_select_fun()
to a function where actual column selection happens pipeline()
.
def col_select_fun(col):
return col.isna().mean().lt(0.1)
def pipeline(df, fun):
df = df \
.loc[:, lambda x: fun(x)]
return df
df_2 = df \
.pipe(pipeline, col_select_fun)
Here is how this works:
pipeline()
. In real scenarios this would usually be a more elaborate chain of data transformations.col_select_fun()
to the pipeline()
function via its fun
argument.loc[]
passes the data frame df
through the lambda
function to our custom column selection function col_select_fun()
.col_select_fun()
takes the entire data frame at once and returns a Series
of logical values where values corresponding to columns with less than 10% missing values are True
and False
otherwise.pipe()
function to chain together operations (like the %>%
operation in R). We could also do with out it like so: df_2 = pipeline(df, col_select_fun)
Lambda Function
In this example, we wish to pass a lambda function that selects the columns of a data farme that meet certain criteria to a function where actual column selection happens pipeline()
.
def pipeline(df, fun):
df_2 = df.loc[:, fun(df)]
return df_2
df_2 = df\
.pipe(pipeline,
lambda x: x.columns.str.contains('col', regex=False))
Here is how this works:
lambda
function to the pipeline()
function via its fun
argument.loc[]
passes the data frame df
to the lambda
function referred to by fun
.We wish to specify the function to use for column selection via an environment variable that holds a reference to the function.
def col_select_fun(col):
return col.isna().mean().lt(0.1)
fun = col_select_fun
df_2 = df \
.loc[:, lambda x: fun(x)]
Here is how this works:
fun
holds the function that we wish to use for column selection. Here that is col_select_fun()
.loc[]
passes the data frame through thelambda
function to the function referred to by the environment variable fun
which is our custom column selection function col_select_fun()
.col_select_fun()
takes the entire data frame at once and returns a Series
of logical values where values corresponding to columns with less than 10% missing values are True
and False
otherwise.We wish to specify the function to use for column selection via an environment variable that holds the name of the function as a string.
def col_select_fun(col):
return col.isna().mean().lt(0.1)
fun = 'col_select_fun'
df_2 = df \
.loc[:, lambda x: globals()[fun](x)]
Here is how this works:
globals()
and locals()
return a dictionary representing the current global or local (in function body for instance) symbol table.getattr()
function which takes two arguments: the module to which the function belongs (e.g. pd.Series
) and the function's name as a string.