We wish to specify one or more logical expression(s) or predicate function(s) (functions that return TRUE
or FALSE
) to apply to each of the selected columns in an implicit filtering context.
In this section, we cover the following function specification scenarios:
lambda
function (an anononymous function).This section is complemented by
We wish to filter rows of a data frame by applying a named predicate function to each of a selected set of columns and then taking a logical combination of the results.
In this example, we wish to filter the rows of the data frame df
for which the value of any column whose name contains the string ‘cvr’
is missing (NA
).
df_2 = df.loc[(df
.loc[:, df.columns.str.contains('cvr_', regex=False)]
.apply(pd.isna)
.any(axis=1))]
Here is how this works:
df.columns.str.contains('cvr_', regex=False)
to select all columns whose name starts with the substring ‘cvr_’
. See Column Selection.pd.isna()
to apply()
to execute on the data frame of selected columns.df.apply(pd.isna)
we go over each value of the data frame df
and check if it is missing. The output is a data frame of the same size as df
and where a value is True
if the corresponding value in the data frame df
is missing (see Missing Values).any()
with axis=1
to logically combine the values for each row of the data frame of logical True
or False
values produced by apply(pd.isna)
. The output is True
for a row if the filtering condition evaluates to True
for any column.any()
is all()
which requires that the logical expression evaluates to TRUE
for all columns. See Relationship Selection..reset_index(drop=True)
. See Patchy Index under Common Scenarios.lambda
function. See New Columns under Common Scenarios.We wish to filter rows of a data frame by applying an anonymous predicate function to each of a selected set of columns and then taking a logical combination of the results.
In this example, we wish to filter the rows of the data frame df
for which the value of any column whose name contains the string ‘cvr’
is less than 0.1
.
df_2 = df.loc[(df
.loc[:, df.columns.str.contains('cvr_', regex=False)]
.apply(lambda x: x < 0.1)
.any(axis=1))]
Here is how this works:
apply()
a lambda
function lambda x: x < 0.1
that checks whether a value is less than 0.1.apply()
will be applied separately to each of the columns (axis=0
by default) of the data frame that apply()
is called on.We wish to filter rows of a data frame by applying multiple predicate function to each of a selected set of columns and then taking a logical combination of the results.
In this example, we wish to filter the rows of the data frame df
for which the value of any column whose name contains the string ‘cvr’
is missing (NA
) or infinite (Inf
).
df_2 = df.loc[(df
.loc[:, df.columns.str.contains('cvr_', regex=False)]
.apply([pd.isna, np.isinf])
.any(axis=1))]
Here is how this works:
apply()
a list of multiple functions.apply()
will be applied separately to each of the columns (axis=0
by default) of the data frame that apply()
is called on.pd.isna()
from Pandas and which checks if a value (of data frame or a Series) is missing (see Missing values).np.ising()
from NumPy and which checks if a value (of an array like object like a data frame or Series) is infinite (see Numerical Operations).We wish to filter rows of a data frame by a logical expression that involves applying a non-vectorized function (i.e. one that acts on one row at a time) to a set of selected columns.
In this example, we wish to filter the rows of the data frame df
for which the mean of the values of the columns whose name contains the string ‘cvr’
is less than 0.1
.
df_2 = df.loc[(df
.loc[:, df.columns.str.contains('cvr_', regex=False)]
.apply(np.mean, axis=1)
.lt(0.1))]
Here is how this works:
loc[:, df.columns.str.contains('cvr_', regex=False)]
, we select all columns whose name contains the string ‘cvr’
. Our purpose is to compute the mean of the values of those columns for each row.apply()
apply a particular function to the rows of a data frame (the default is columns) we set the argument axis
to axis=1
(instead of the default axis=0
). Therefore, apply(np.mean, axis=1) computes the mean of the selected columns for each row returning one numerical value for each row.lt()
(which is a function form of the simple less than operator <
) to check whether the numerical value returned for each row is less than 0.1. See Numerical Operations.