In some situations, the filtering logic we wish to carry out can not be applied in a vectorized manner column wise, rather it needs to be applied in a non-vectorized manner to each row individually.
In this example, we wish to filter rows where the mean of the values of the columns ‘col_1'
and ‘col_2'
is greater than 0.
df_2 = df.loc[(df
.apply(lambda x: np.mean([x['col_1'], x['col_2']]), axis=1)
.gt(0))]
Here is how this works:
loc[]
to subset rows so long as it results in a Series
of logical True
or False
values with the same length as the number of rows in the data frame.apply()
while setting axis=1
to compute the mean of the values of ‘col_1’
and ‘col_2’
for each row.apply()
with 0 via gt(0)
(a function form of the greater than operator covered in more details in Numerical Operations) to get a logical Series
that has a value of True
where the mean of the values of the columns ‘col_1'
and ‘col_2'
is greater than 0 and False
otherwise.Series
is then passed to loc[]
to return the rows corresponding to values of True
.Alternatively,
df_2 = df.loc[(df[['col_1', 'col_2']]
.apply(np.mean, axis=1)
.gt(0))]
Here is how this works:
'col_1'
and 'col_2'
as a Series
to the np.mean()
function.df[['col_1', 'col_2']
.apply()
the np.pean()
function while setting axis=1
to specify that apply
should act on rows one at a time.