Function Specification

We wish to specify one or more logical expression(s) or predicate function(s) (functions that return TRUE or FALSE) to apply to each of the selected columns in an implicit filtering context.

In this section, we cover the following function specification scenarios:

  • A named function which may be a built-in function or a custom function.
  • A lambda function (an anononymous function).
  • Multiple functions each of which is applied separately to each of the selected columns.
  • A Non Vectorized Function i.e. one that acts on one row at a time.

This section is complemented by

  • Column Selection where we cover how to select the columns to each of which we will apply filtering logic.
  • Relationship Specification where we cover how to combine the results of applying the specified function(s) to the specified column(s) in either an AND manner or an OR manner.

Named Function

We wish to filter rows of a data frame by applying a named predicate function to each of a selected set of columns and then taking a logical combination of the results.

In this example, we wish to filter the rows of the data frame df for which the value of any column whose name contains the string ‘cvr’ is missing (NA).

df_2 = df.loc[(df
               .loc[:, df.columns.str.contains('cvr_', regex=False)]
               .apply(pd.isna)
               .any(axis=1))]

Here is how this works:

  • We use df.columns.str.contains('cvr_', regex=False) to select all columns whose name starts with the substring ‘cvr_’. See Column Selection.
  • We pass a named function pd.isna() to apply() to execute on the data frame of selected columns.
  • In df.apply(pd.isna) we go over each value of the data frame df and check if it is missing. The output is a data frame of the same size as df and where a value is True if the corresponding value in the data frame df is missing (see Missing Values).
  • We use any() with axis=1 to logically combine the values for each row of the data frame of logical True or False values produced by apply(pd.isna). The output is True for a row if the filtering condition evaluates to True for any column.
  • Extensions:
    • The counterpart of any() is all() which requires that the logical expression evaluates to TRUE for all columns. See Relationship Selection.
    • We get a patchy index i.e. rows have their original index value. To get a fresh index, we add .reset_index(drop=True). See Patchy Index under Common Scenarios.
    • If we are filtering rows as part of a chain of data manipulation operations and need to refer to a row that was created earlier in the same chain, we need to refer to it via a lambda function. See New Columns under Common Scenarios.

Anonymous function

We wish to filter rows of a data frame by applying an anonymous predicate function to each of a selected set of columns and then taking a logical combination of the results.

In this example, we wish to filter the rows of the data frame df for which the value of any column whose name contains the string ‘cvr’ is less than 0.1.

df_2 = df.loc[(df
               .loc[:, df.columns.str.contains('cvr_', regex=False)]
               .apply(lambda x: x < 0.1)
               .any(axis=1))]

Here is how this works:

  • We pass to apply() a lambda function lambda x: x < 0.1 that checks whether a value is less than 0.1.
  • The lambda function passed to apply() will be applied separately to each of the columns (axis=0 by default) of the data frame that apply() is called on.
  • The rest of the code works as described in the “All Columns” scenario above.

Multiple functions

We wish to filter rows of a data frame by applying multiple predicate function to each of a selected set of columns and then taking a logical combination of the results.

In this example, we wish to filter the rows of the data frame df for which the value of any column whose name contains the string ‘cvr’ is missing (NA) or infinite (Inf).

df_2 = df.loc[(df
               .loc[:, df.columns.str.contains('cvr_', regex=False)]
               .apply([pd.isna, np.isinf])
               .any(axis=1))]

Here is how this works:

  • We can pass to apply() a list of multiple functions.
  • Each of the functions in the list passed to apply() will be applied separately to each of the columns (axis=0 by default) of the data frame that apply() is called on.
  • In this example, we pass two named functions:
    • pd.isna() from Pandas and which checks if a value (of data frame or a Series) is missing (see Missing values).
    • np.ising() from NumPy and which checks if a value (of an array like object like a data frame or Series) is infinite (see Numerical Operations).
  • The rest of the code works as described in the “All Columns” scenario above.

Non Vectorized Function

We wish to filter rows of a data frame by a logical expression that involves applying a non-vectorized function (i.e. one that acts on one row at a time) to a set of selected columns.

In this example, we wish to filter the rows of the data frame df for which the mean of the values of the columns whose name contains the string ‘cvr’ is less than 0.1.

df_2 = df.loc[(df
               .loc[:, df.columns.str.contains('cvr_', regex=False)]
               .apply(np.mean, axis=1)
               .lt(0.1))]

Here is how this works:

  • In loc[:, df.columns.str.contains('cvr_', regex=False)], we select all columns whose name contains the string ‘cvr’. Our purpose is to compute the mean of the values of those columns for each row.
  • In order to have apply() apply a particular function to the rows of a data frame (the default is columns) we set the argument axis to axis=1 (instead of the default axis=0). Therefore, apply(np.mean, axis=1) computes the mean of the selected columns for each row returning one numerical value for each row.
  • We use the function lt() (which is a function form of the simple less than operator <) to check whether the numerical value returned for each row is less than 0.1. See Numerical Operations.
  • See Non Vectorized Transformation and Implicit Non Vectorized Transformation for a deeper coverage of non vectorized operations. All the scenarios covered there can also be applied for filtering.
PYTHON
I/O