We wish to dynamically specify the columns to which filtering logic would be applied.
We will cover two scenarios:
One Column
We wish to dynamically specify a column via an environment variable to use as part of a logical expression to filter the rows of a data frame.
In this example, we wish to filter the rows of a data frame df
where the value of a dynamically specified column has a missing value NA
.
col = 'col_1'
df_2 = df.loc[df[col].isna()]
Here is how this works:
col = 'col_1'
, we specify the name of the column to which filtering logic will be applied as an environment variable.df[col]
, we select the column whose name is in the variable col.Multiple Columns: Explicit Application
We wish to dynamically specify a set of column via an environment variable to use as part of a logical expression to filter the rows of a data frame.
In this example, we wish to filter the rows of a data frame df
by the values of three dynamically specified columns. In particular, we wish to return rows where the value of the first column is not NA
and the values of the second and third columns are equal.
cols = ['col_5','col_3','col_6']
df_2 = df.loc[~ df[cols[0]].isna() &
df[cols[1]] == df[cols[2]]]
Here is how this works:
cols
.loc[]
referring to each column name in the vector cols
via it’s index; e.g. cols[0]
returns the column name at position 0 in the list.KeyError
. To ignore list elements that do not match any of the data frame’s column names, see “Flexible Matching” in Dynamic Column Selection.Multiple Columns: Implicit Application
We wish to dynamically specify a set of column via an environment variable. We then wish to apply a logical condition (predicate function) to each of these columns and then filter the rows where the logical condition returns True
for either all or any of the columns.
In this example, we wish to return the rows of a data frame df
where any of a set of dynamically specified columns has a missing value NA
.
cols = ['col_5','col_3','col_6']
df_2 = df.loc[(df
.loc[:, cols]
.isna
.any(axis=1))]
Here is how this works:
cols
.df.loc[:, cols]
, we select the columns to which filtering logic will be applied which returns a data frame.isna()
to the data frame returned by df.loc[:, cols]
resulting in a data frame of the same dimensions where an element is True
if the corresponding value is the input data frame is NA
.any(axis=1)
to reduce the data frame of logical values to a Series
where a value is True
if the corresponding row has any value that is True
.One Column
We wish to pass to a function the name of a column to which row filtering logic will be applied.
In this example, we wish to create a function that filters the rows of a data frame df
where the value of a dynamically specified column has a missing value NA
.
def m_filter(df, col):
df_2 = df.loc[df[col].isna()]
return df_2
df_2 = df.pipe(m_filter, 'col_1')
Here is how this works:
pipe()
to pass to the custom function m_filter()
the data frame df
and the name of the column to which filtering logic will be applied ‘col_1’
.Multiple Columns: Explicit Application
We wish to pass to a function the names of the columns to which row filtering logic will be applied.
In this example, we wish to create a function that filters the rows of a data frame df
by the values of three dynamically specified columns. In particular, we wish to return rows where the value of the first column is not NA
and the values of the second and third columns are equal.
def m_filter(df, cols):
df_2 = df.loc[~ df[cols[0]].isna() &
df[cols[1]] == df[cols[2]]]
return df_2
df_2 = df.pipe(m_filter, ['col_1','col_2','col_3'])
Here is how this works:
pipe()
to pass to the custom function m_filter()
the data frame df
and the name of the columns to which filtering logic will be applied as a list ['col_1','col_2','col_3']
.Multiple Columns: Implicit Application
We wish to create a function that accepts a column selection function, a row filtering predicate functions and a function that determines if the results of applying each predicate function to each column are AND’ed or OR’ed.
In this example, we wish to return the rows of a data frame df
where any of a set of dynamically specified columns has a missing value NA
.
def m_filter(df, select_fn, filter_fn, rel_fn):
selected_cols = select_fn(df)
selected_rows = rel_fn((df
.loc[:, selected_cols]
.apply(filter_fn)), axis=1)
df_2 = df.loc[selected_rows]
return df_2
df_2 = df\
.pipe(m_filter,
lambda x: x.columns[x.columns.str.contains('col', regex=False)],
pd.isna,
pd.DataFrame.any)
Here is how this works:
m_filter()
(See Implicit Filtering).df
: A data frame select_fn
: A function to apply to df
obtain the columns to apply the filtering logic to (See Dynamic Selection Function Specification).apply_fn
: A predicate function (one that returns True of False) to apply to each of the selected functions.rel_fn
: The function to use to combine the results of the apply_fn
for each row (can be any()
or all()
).pipe()
method to pass to m_filter()
df
in a chained mannerlambda
function lambda x: x.columns.str.contains('cvr_', regex=False)
that returns a logical Series with the same number of elements as the number of columns and which has a value of True
for the columns that satisfy the selection criteria.pd.isna
pd.DataFrame.any