Dynamic Column Specification

We wish to dynamically specify the columns to which filtering logic would be applied.

We will cover two scenarios:

  1. As Environment Variable: The columns to which row filtering logic will be applied are specified dynamically via an environment variable. This is useful, for instance, when structuring a script and wish to have column specification as part of a configuration section at the beginning of the script separate from the logic.
  2. As Function Argument: Row filtering happens inside a function and the names of the columns to which row filtering logic will be applied are passed to the function as an argument. This is useful when we wish to write our data manipulation pipeline as a reusable function.

As Environment Variable

One Column

We wish to dynamically specify a column via an environment variable to use as part of a logical expression to filter the rows of a data frame.

In this example, we wish to filter the rows of a data frame df where the value of a dynamically specified column has a missing value NA.

col = 'col_1'

df_2 = df.loc[df[col].isna()]

Here is how this works:

  • In col = 'col_1', we specify the name of the column to which filtering logic will be applied as an environment variable.
  • In df[col], we select the column whose name is in the variable col.

Multiple Columns: Explicit Application

We wish to dynamically specify a set of column via an environment variable to use as part of a logical expression to filter the rows of a data frame.

In this example, we wish to filter the rows of a data frame df by the values of three dynamically specified columns. In particular, we wish to return rows where the value of the first column is not NA and the values of the second and third columns are equal.

cols = ['col_5','col_3','col_6']

df_2 = df.loc[~ df[cols[0]].isna() &
              df[cols[1]] == df[cols[2]]]

Here is how this works:

  • We specify the columns to which filtering logic will be applied as a list cols.
  • We write our logical expression inside loc[] referring to each column name in the vector cols via it’s index; e.g. cols[0] returns the column name at position 0 in the list.
  • If the list of column names passed may contain elements that do not match any column names, execution will fail and return a KeyError. To ignore list elements that do not match any of the data frame’s column names, see “Flexible Matching” in Dynamic Column Selection.

Multiple Columns: Implicit Application

We wish to dynamically specify a set of column via an environment variable. We then wish to apply a logical condition (predicate function) to each of these columns and then filter the rows where the logical condition returns True for either all or any of the columns.

In this example, we wish to return the rows of a data frame df where any of a set of dynamically specified columns has a missing value NA.

cols = ['col_5','col_3','col_6']

df_2 = df.loc[(df
               .loc[:, cols]
               .isna
               .any(axis=1))]

Here is how this works:

  • We specify the columns to which filtering logic will be applied as a list cols.
  • In df.loc[:, cols], we select the columns to which filtering logic will be applied which returns a data frame.
  • We then apply isna() to the data frame returned by df.loc[:, cols] resulting in a data frame of the same dimensions where an element is True if the corresponding value is the input data frame is NA.
  • Finally, we use any(axis=1) to reduce the data frame of logical values to a Series where a value is True if the corresponding row has any value that is True.
  • See Implicit Filtering for a coverage for filtering rows implicitly.

As Function Argument

One Column

We wish to pass to a function the name of a column to which row filtering logic will be applied.

In this example, we wish to create a function that filters the rows of a data frame df where the value of a dynamically specified column has a missing value NA.

def m_filter(df, col):
    df_2 = df.loc[df[col].isna()]
    return df_2

df_2 = df.pipe(m_filter, 'col_1')

Here is how this works:

  • We use pipe() to pass to the custom function m_filter() the data frame df and the name of the column to which filtering logic will be applied ‘col_1’.
  • The rest of the code works similarly to the “One Column” scenario under “As Environment Variable” above.

Multiple Columns: Explicit Application

We wish to pass to a function the names of the columns to which row filtering logic will be applied.

In this example, we wish to create a function that filters the rows of a data frame df by the values of three dynamically specified columns. In particular, we wish to return rows where the value of the first column is not NA and the values of the second and third columns are equal.

def m_filter(df, cols):
    df_2 = df.loc[~ df[cols[0]].isna() &
              df[cols[1]] == df[cols[2]]]
    return df_2

df_2 = df.pipe(m_filter, ['col_1','col_2','col_3'])

Here is how this works:

  • We use pipe() to pass to the custom function m_filter() the data frame df and the name of the columns to which filtering logic will be applied as a list ['col_1','col_2','col_3'].
  • The rest of the code works similarly to the “Multiple Columns: Explicit Application” scenario under “As Environment Variable” above.

Multiple Columns: Implicit Application

We wish to create a function that accepts a column selection function, a row filtering predicate functions and a function that determines if the results of applying each predicate function to each column are AND’ed or OR’ed.

In this example, we wish to return the rows of a data frame df where any of a set of dynamically specified columns has a missing value NA.

def m_filter(df, select_fn, filter_fn, rel_fn):
    selected_cols = select_fn(df)
    selected_rows = rel_fn((df
                            .loc[:, selected_cols]
                            .apply(filter_fn)), axis=1)
    df_2 = df.loc[selected_rows]
    return df_2

df_2 = df\
    .pipe(m_filter,
          lambda x: x.columns[x.columns.str.contains('col', regex=False)],
          pd.isna,
          pd.DataFrame.any)

Here is how this works:

  • We isolate the implicit filtering logic into the custom function m_filter() (See Implicit Filtering).
  • The function takes as input
    1. df: A data frame
    2. select_fn: A function to apply to df obtain the columns to apply the filtering logic to (See Dynamic Selection Function Specification).
    3. apply_fn: A predicate function (one that returns True of False) to apply to each of the selected functions.
    4. rel_fn: The function to use to combine the results of the apply_fn for each row (can be any() or all()).
  • We use the pipe() method to pass to m_filter()
    • the data frame df in a chained manner
    • a lambda function lambda x: x.columns.str.contains('cvr_', regex=False) that returns a logical Series with the same number of elements as the number of columns and which has a value of True for the columns that satisfy the selection criteria.
    • the predicate function to apply pd.isna
    • the logical combination function pd.DataFrame.any
PYTHON
I/O