Dynamic Function Specification

We wish to dynamically specify the predicate function(s) that will be applied to filter the rows of a data frame.

A predicate function is one that returns a logical value; TRUE or FALSE.

We will cover how to dynamically specify functions in each of the following scenarios:

  • Named Function such as isna().
  • Comparison Operation such as <, >, or ==.
  • Lambda Function expressed in formula notation such as lambda x: x > 5.
  • Multiple Functions specified as a list such as [pd.isna, np.isinf].

Named Function

We wish to dynamically specify a named function to be used to filter the rows of a data frame.

As Function Variable

We wish to use a function referred to via an environment variable to filter the rows of a data frame

In this example, we wish to apply a function referred to via an environment variable fun to the column ‘col_1' to filter the rows of the data frame df.

fun = pd.isna

df_2 = df.loc[fun(df['col_1'])]

Here is how this works:

  • The environment variable fun holds a reference to the Pandas function isna().
  • We can use fun just like we would use isna(). In other words, in this case fun(df['col_1']) is equivalent to pd.isna(df['col_1']) which is equivalent to df['col_1'].isna().
  • loc() returns the rows of the data frame df for which the function referred to by fun returns True.

As Function Argument

We wish to pass a function as an argument to another function to be used (the first function) to filter the rows of a data frame.

In this example, we wish to filter the rows of the data frame df via a custom function that takes a predicate function as input and applies it to the column ‘col_1'.

def m_filter(df, fun):
    df_2 = df.loc[fun(df['col_1'])]
    return df_2

df_2 = df.pipe(m_filter, pd.isna)

Here is how this works:

  • We have a custom function m_filter() which takes a data frame df and a function fun as inputs and uses the function fun to filter the rows of the data frame df.
  • In df.pipe(m_filter, pd.isna), we use pipe() to pass to m_filter() the data frame df and the named function pd.is.na.

As String Variable

We wish to specify the function to use for row filtering via an environment variable that holds the name of the function as a string.

In this example, we wish to apply a function whose name is available as a string variable fun to the column ‘col_1' to filter the rows of the data frame df.

mod = 'pd'
fun = 'isna'

df_2 = df.loc[getattr(globals()[mod], fun)(df['col_1'])]

Here is how this works:

  • In getattr(pd, fun), we use Python's getattr() function which takes two arguments: the module to which the function belongs, to which we pass pd, and the function's name as a string, to which we pass the string variable fun which holds the string ‘isna’.
  • We can use the output of getattr() just like we would use isna(). In other words, in this case getattr(pd, fun)(df['col_1']) is equivalent to pd.isna(df['col_1']) which is equivalent to df['col_1'].isna().
  • We used getattr() in this case because we wish to obtain a function that is part of a module. If we wish to obtain a custom function defined in the current environment we can use globals() or locals() which return a dictionary representing the current global or local symbol table. See the “As String Variable” scenario under “Multiple Functions” below for an example.

Comparison Operation

We wish to dynamically specify a comparison operator, e.g. greater than, to be used to filter the rows of a data frame.

In this example, we wish to filter the rows of the data frame df via a custom function that takes a comparison operator as input and applies it to filter the rows of the data frame df by comparing the value of the column ‘col_1' with the integer 0.

def m_filter(df, fun):
    df_2 = df.loc[fun(df['col_1'], 0)]
    return df_2

df_2 = df.pipe(m_filter, pd.Series.gt)

Here is how this works:

  • Pandas provides function equivalents to the common comparison operators such as gt() for greater than > and lt() for less than < and eq() for equal == (see Numerical Operations).
  • In fun(df['col_1'], 0), we pass to the comparison function referred to by fun the column whose values we wish to compare df['col_1'] and the threshold we wish to compare against 0.
  • fun(df['col_1'], 0) is equivalent to all the following: pd.Series.gt(df['col_1'], 0), df['col_1'].gt(0), and df['col_1'] > 0.
  • Thanks to those function equivalents of comparison operators, the “Operation” scenario reduces to the “Named Function” scenario above for all three sub scenarios “Function Variable”, “Function Argument” and “String Variable”.

Lambda Function

We wish to dynamically specify a lambda function to be used to filter the rows of a data frame.

As Function Variable

In this example, we wish to apply a lambda function referred to via an environment variable fun to the column ‘col_1' to filter the rows of the data frame df.

fun = lambda x: x > x.mean()

df_2 = df.loc[fun(df['col_1'])]

Here is how this works:

  • The environment variable fun holds a reference to the lambda function.
  • We can use fun just like we would use a named function (See the “Named Function” scenario above).

As Function Argument

In this example, we wish to pass a lambda function that contains the row filtering logic to a function m_filter() where actual row filtering happens.

def m_filter(df, fun):
    df_2 = df.loc[fun(df['col_1'])]
    return df_2

df_2 = df\
    .pipe(m_filter,
          lambda x: x > x.mean())

Here is how this works:

  • The lambda function lambda x: x > x.mean(), expects a Series of numerical values and returns a structure of the same length where a value is True if the corresponding element in the input has a value greater than the mean.
  • We use pipe() to pass the data frame df and the lambda function to the m_filter() function.

As String Variable

fun = "lambda x: x > x.mean()"

df_2 = df.loc[eval(fun)(df['col_1'])]

Here is how this works:

  • In fun = "lambda x: x > x.mean()", we create an environment variable holding a lambda function as a string.
  • In eval(fun) we evaluate the string to create a lambda function.
  • We then pass df['col_1'] to the lambda function.
  • Please note that evaluating strings is generally considered unsafe.

Multiple Functions

As Function Variable: Explicit Application

funs = [pd.isna, np.isinf, lambda x: x < 0]

df_2 = df.loc[funs[0](df['col_1']) 
              | funs[1](df['col_2']) 
              | funs[1](df['col_3'])]

Here is how this works:

  • We define a list of functions funs.
  • We refer to each function in the vector funs via its index. Thus, funs[0] returns the function at position 0 in the list.
  • The rest of the code works as described under “Named Function” above.

As Function Variable: Implicit Application

We wish to apply multiple predicate functions to the same columns.

In this example, we wish to apply three predicate functions isna(), isinf(), and a lambda function that checks if the value is less than zero lambda x: x < 0 to each column whose name contains the string ‘cvr_’ and to return any row for which any of the function column pairs returns True.

funs = [pd.isna, np.isinf, lambda x: x < 0]

df_2 = df.loc[(df
               .loc[:, df.columns.str.contains('cvr_', regex=False)]
               .apply(funs)
               .any(axis=1))]

Here is how this works:

  • In loc[:, df.columns.str.contains('cvr_', regex=False)], we select all columns whose name contains the string ‘cvr_’. See Column Selection by Name Pattern.
  • We can pass to apply() a list of multiple functions.
  • Each of the functions in the list passed to apply() will be applied separately to each of the columns (the argument axis of apply() is axis=0 by default) of the data frame that apply() is called on.
  • In this example, we pass three functions:
    • pd.isna() from Pandas and which checks if a value (of data frame or a Series) is missing (see Missing values).
    • np.isinf() from NumPy and which checks if a value (of an array like object like a data frame or Series) is infinite (see Numerical Operations).
    • ambda x: x < 0 is a lambda function that checks if the value is less than zero.
  • See Implicit Filtering for a coverage for filtering rows implicitly.

As Function Argument: Explicit Application

def m_filter(df, funs):
    df_2 = df.loc[funs[0](df['col_1']) 
              | funs[1](df['col_2']) 
              | funs[1](df['col_3'])]
    return df_2

df_2 = df.pipe(m_filter, [pd.isna, np.isinf, lambda x: x < 0])

Here is how this works:

  • We use pipe() to pass the data frame df and a list of functions to the custom function m_filter().
  • Inside m_filter(), the individual functions, e.g. funs[0], are applied to realize the desired custom logic. See “As Function Variable” above.

As Function Argument: Implicit Application

We wish to create a function that accepts a column selection function, multiple row filtering predicate functions and a function that determines if the results of applying each predicate function to each column are AND’ed or OR’ed.

def m_filter(df, select_fn, filter_fn, rel_fn):
    selected_cols = select_fn(df)
    selected_rows = rel_fn((df
                            .loc[:, selected_cols]
                            .apply(filter_fn)), axis=1)
    df_2 = df.loc[selected_rows]
    return df_2

df_2 = df\
    .pipe(m_filter,
          lambda x: x.columns.str.contains('cvr_', regex=False),
          [pd.isna, np.isinf, lambda x: x < 0],
          pd.DataFrame.any)

Here is how this works:

  • We isolate the implicit filtering logic into the custom function m_filter().
  • The function takes as input
    1. df: A data frame
    2. select_fn: A function to apply to df obtain the columns to apply the filtering logic to (See Dynamic Selection Function Specification).
    3. apply_fn: A list of predicate functions (those that returns True of False) to apply to each of the selected functions.
    4. rel_fn: The function to use to combine the results of the apply_fn for each row (can be any() or all()).
  • We use the pipe() method to pass to m_filter()
    • the data frame df in a chained manner
    • a lambda function lambda x: x.columns.str.contains('cvr_', regex=False) that returns a logical Series with the same number of elements as the number of columns and which has a value of True for the columns that satisfy the selection criteria.
    • the predicate functions to apply [pd.isna, np.isinf, lambda x: x < 0].
    • the logical combination function pd.DataFrame.any

As String Variable: Explicit Application

We wish to filter the rows of a data frame by using functions whose names are specified as a list of string values.

def fun_1(col):
    return col.isna()

def fun_2(col):
    return np.isinf(col)

def fun_3(col):
    return col.lt(0)

funs = ['fun_1', 'fun_2', 'fun_3']

df_2 = df.loc[globals()[funs[0]](df['col_1'])
              | globals()[funs[1]](df['col_2'])
              | globals()[funs[2]](df['col_3'])]

Here is how this works:

  • In globals()[funs[0]], we use the global symbol table returned by global() to obtain a reference to a function given its name as a string.
  • Therefore, in this case, globals()[funs[0]](df['col_5']) is equivalent to fun_1(df['col_5']).
  • Had a function been from a module e.g. isna() from Pandas, we would use the Python getattr() function to obtain a reference to the function. See the “As String Variable” scenario under “Named Function” below for an example.

As String Variable: Implicit Application

We wish to filter the rows of a data frame by using functions whose names are specified as a list of string values. We wish to apply each functions to each of a set of selected columns and return rows for which any column function pair returns True.

def fun_1(col):
    return col.isna()

def fun_2(col):
    return np.isinf(col)

def fun_3(col):
    return col.lt(0)

funs = ['fun_1', 'fun_2', 'fun_3']

df_2 = df.loc[(df
               .loc[:, df.columns.str.contains('cvr_', regex=False)]
               .apply([globals()[x] for x in funs])
               .any(axis=1))]

Here is how this works:

  • In [globals()[x] for x in funs], we use a list comprehension to iterate over the list of string function names and use the global symbol table returned by global() to obtain a reference to each function given its name as a string.
  • apply() applies each function in the list of functions we crated via the list comprehension to each of the selected columns.
  • The rest of the code works as described in “As Function Variable: Implicit Application” above.
PYTHON
I/O