We wish to dynamically specify the predicate function(s) that will be applied to filter the rows of a data frame.
A predicate function is one that returns a logical value; TRUE
or FALSE
.
We will cover how to dynamically specify functions in each of the following scenarios:
isna()
.<
, >
, or ==
.lambda x: x > 5
.[pd.isna, np.isinf]
.We wish to dynamically specify a named function to be used to filter the rows of a data frame.
As Function Variable
We wish to use a function referred to via an environment variable to filter the rows of a data frame
In this example, we wish to apply a function referred to via an environment variable fun
to the column ‘col_1'
to filter the rows of the data frame df
.
fun = pd.isna
df_2 = df.loc[fun(df['col_1'])]
Here is how this works:
fun
holds a reference to the Pandas
function isna()
.fun
just like we would use isna()
. In other words, in this case fun(df['col_1'])
is equivalent to pd.isna(df['col_1'])
which is equivalent to df['col_1'].isna()
.loc()
returns the rows of the data frame df
for which the function referred to by fun
returns True
.As Function Argument
We wish to pass a function as an argument to another function to be used (the first function) to filter the rows of a data frame.
In this example, we wish to filter the rows of the data frame df
via a custom function that takes a predicate function as input and applies it to the column ‘col_1'
.
def m_filter(df, fun):
df_2 = df.loc[fun(df['col_1'])]
return df_2
df_2 = df.pipe(m_filter, pd.isna)
Here is how this works:
m_filter()
which takes a data frame df
and a function fun
as inputs and uses the function fun
to filter the rows of the data frame df
.df.pipe(m_filter, pd.isna)
, we use pipe()
to pass to m_filter()
the data frame df and the named function pd.is.na
.As String Variable
We wish to specify the function to use for row filtering via an environment variable that holds the name of the function as a string.
In this example, we wish to apply a function whose name is available as a string variable fun
to the column ‘col_1'
to filter the rows of the data frame df
.
mod = 'pd'
fun = 'isna'
df_2 = df.loc[getattr(globals()[mod], fun)(df['col_1'])]
Here is how this works:
getattr(pd, fun)
, we use Python's getattr()
function which takes two arguments: the module to which the function belongs, to which we pass pd
, and the function's name as a string, to which we pass the string variable fun
which holds the string ‘isna’
.getattr()
just like we would use isna()
. In other words, in this case getattr(pd, fun)(df['col_1'])
is equivalent to pd.isna(df['col_1'])
which is equivalent to df['col_1'].isna()
.getattr()
in this case because we wish to obtain a function that is part of a module. If we wish to obtain a custom function defined in the current environment we can use globals()
or locals()
which return a dictionary representing the current global or local symbol table. See the “As String Variable” scenario under “Multiple Functions” below for an example.We wish to dynamically specify a comparison operator, e.g. greater than, to be used to filter the rows of a data frame.
In this example, we wish to filter the rows of the data frame df
via a custom function that takes a comparison operator as input and applies it to filter the rows of the data frame df
by comparing the value of the column ‘col_1'
with the integer 0.
def m_filter(df, fun):
df_2 = df.loc[fun(df['col_1'], 0)]
return df_2
df_2 = df.pipe(m_filter, pd.Series.gt)
Here is how this works:
Pandas
provides function equivalents to the common comparison operators such as gt()
for greater than >
and lt()
for less than <
and eq()
for equal ==
(see Numerical Operations).fun(df['col_1'], 0)
, we pass to the comparison function referred to by fun
the column whose values we wish to compare df['col_1']
and the threshold we wish to compare against 0
.fun(df['col_1'], 0)
is equivalent to all the following: pd.Series.gt(df['col_1'], 0)
, df['col_1'].gt(0)
, and df['col_1'] > 0
.We wish to dynamically specify a lambda function to be used to filter the rows of a data frame.
As Function Variable
In this example, we wish to apply a lambda function referred to via an environment variable fun
to the column ‘col_1'
to filter the rows of the data frame df
.
fun = lambda x: x > x.mean()
df_2 = df.loc[fun(df['col_1'])]
Here is how this works:
fun
holds a reference to the lambda function.fun
just like we would use a named function (See the “Named Function” scenario above).As Function Argument
In this example, we wish to pass a lambda function that contains the row filtering logic to a function m_filter()
where actual row filtering happens.
def m_filter(df, fun):
df_2 = df.loc[fun(df['col_1'])]
return df_2
df_2 = df\
.pipe(m_filter,
lambda x: x > x.mean())
Here is how this works:
lambda x: x > x.mean()
, expects a Series of numerical values and returns a structure of the same length where a value is True
if the corresponding element in the input has a value greater than the mean.pipe()
to pass the data frame df
and the lambda
function to the m_filter()
function.As String Variable
fun = "lambda x: x > x.mean()"
df_2 = df.loc[eval(fun)(df['col_1'])]
Here is how this works:
fun = "lambda x: x > x.mean()"
, we create an environment variable holding a lambda function as a string.eval(fun)
we evaluate the string to create a lambda
function.df['col_1']
to the lambda function.As Function Variable: Explicit Application
funs = [pd.isna, np.isinf, lambda x: x < 0]
df_2 = df.loc[funs[0](df['col_1'])
| funs[1](df['col_2'])
| funs[1](df['col_3'])]
Here is how this works:
funs
.funs
via its index. Thus, funs[0]
returns the function at position 0 in the list.As Function Variable: Implicit Application
We wish to apply multiple predicate functions to the same columns.
In this example, we wish to apply three predicate functions isna()
, isinf()
, and a lambda
function that checks if the value is less than zero lambda x: x < 0
to each column whose name contains the string ‘cvr_’
and to return any row for which any of the function column pairs returns True
.
funs = [pd.isna, np.isinf, lambda x: x < 0]
df_2 = df.loc[(df
.loc[:, df.columns.str.contains('cvr_', regex=False)]
.apply(funs)
.any(axis=1))]
Here is how this works:
loc[:, df.columns.str.contains('cvr_', regex=False)]
, we select all columns whose name contains the string ‘cvr_’
. See Column Selection by Name Pattern.apply()
a list of multiple functions.apply()
will be applied separately to each of the columns (the argument axis
of apply()
is axis=0
by default) of the data frame that apply()
is called on.pd.isna()
from Pandas and which checks if a value (of data frame or a Series) is missing (see Missing values).np.isinf()
from NumPy and which checks if a value (of an array like object like a data frame or Series) is infinite (see Numerical Operations).ambda x: x < 0
is a lambda
function that checks if the value is less than zero.As Function Argument: Explicit Application
def m_filter(df, funs):
df_2 = df.loc[funs[0](df['col_1'])
| funs[1](df['col_2'])
| funs[1](df['col_3'])]
return df_2
df_2 = df.pipe(m_filter, [pd.isna, np.isinf, lambda x: x < 0])
Here is how this works:
pipe()
to pass the data frame df
and a list of functions to the custom function m_filter()
.m_filter()
, the individual functions, e.g. funs[0]
, are applied to realize the desired custom logic. See “As Function Variable” above.As Function Argument: Implicit Application
We wish to create a function that accepts a column selection function, multiple row filtering predicate functions and a function that determines if the results of applying each predicate function to each column are AND’ed or OR’ed.
def m_filter(df, select_fn, filter_fn, rel_fn):
selected_cols = select_fn(df)
selected_rows = rel_fn((df
.loc[:, selected_cols]
.apply(filter_fn)), axis=1)
df_2 = df.loc[selected_rows]
return df_2
df_2 = df\
.pipe(m_filter,
lambda x: x.columns.str.contains('cvr_', regex=False),
[pd.isna, np.isinf, lambda x: x < 0],
pd.DataFrame.any)
Here is how this works:
m_filter()
.df
: A data frame select_fn
: A function to apply to df
obtain the columns to apply the filtering logic to (See Dynamic Selection Function Specification).apply_fn
: A list of predicate functions (those that returns True
of False
) to apply to each of the selected functions.rel_fn
: The function to use to combine the results of the apply_fn
for each row (can be any()
or all()
).pipe()
method to pass to m_filter()
df
in a chained mannerlambda
function lambda x: x.columns.str.contains('cvr_', regex=False)
that returns a logical Series with the same number of elements as the number of columns and which has a value of True
for the columns that satisfy the selection criteria.[pd.isna, np.isinf, lambda x: x < 0]
.pd.DataFrame.any
As String Variable: Explicit Application
We wish to filter the rows of a data frame by using functions whose names are specified as a list of string values.
def fun_1(col):
return col.isna()
def fun_2(col):
return np.isinf(col)
def fun_3(col):
return col.lt(0)
funs = ['fun_1', 'fun_2', 'fun_3']
df_2 = df.loc[globals()[funs[0]](df['col_1'])
| globals()[funs[1]](df['col_2'])
| globals()[funs[2]](df['col_3'])]
Here is how this works:
globals()[funs[0]]
, we use the global symbol table returned by global()
to obtain a reference to a function given its name as a string.globals()[funs[0]](df['col_5'])
is equivalent to fun_1(df['col_5'])
.isna()
from Pandas
, we would use the Python getattr()
function to obtain a reference to the function. See the “As String Variable” scenario under “Named Function” below for an example.As String Variable: Implicit Application
We wish to filter the rows of a data frame by using functions whose names are specified as a list of string values. We wish to apply each functions to each of a set of selected columns and return rows for which any column function pair returns True
.
def fun_1(col):
return col.isna()
def fun_2(col):
return np.isinf(col)
def fun_3(col):
return col.lt(0)
funs = ['fun_1', 'fun_2', 'fun_3']
df_2 = df.loc[(df
.loc[:, df.columns.str.contains('cvr_', regex=False)]
.apply([globals()[x] for x in funs])
.any(axis=1))]
Here is how this works:
[globals()[x] for x in funs]
, we use a list comprehension to iterate over the list of string function names and use the global symbol table returned by global()
to obtain a reference to each function given its name as a string.apply()
applies each function in the list of functions we crated via the list comprehension to each of the selected columns.