Non Vectorized Filtering

In some situations, the filtering logic we wish to carry out can not be applied in a vectorized manner column wise, rather it needs to be applied in a non-vectorized manner to each row individually.

In this example, we wish to filter rows where the mean of the values of the columns ‘col_1' and ‘col_2' is greater than 0.

df_2 = df.loc[(df
        .apply(lambda x: np.mean([x['col_1'], x['col_2']]), axis=1)
        .gt(0))]

Here is how this works:

  • As covered in Custom Filtering, we can run a function inside loc[] to subset rows so long as it results in a Series of logical True or False values with the same length as the number of rows in the data frame.
  • We use apply() while setting axis=1 to compute the mean of the values of ‘col_1’ and ‘col_2’ for each row.
  • We compare the output of apply() with 0 via gt(0) (a function form of the greater than operator covered in more details in Numerical Operations) to get a logical Series that has a value of True where the mean of the values of the columns ‘col_1' and ‘col_2' is greater than 0 and False otherwise.
  • The resulting logical Series is then passed to loc[] to return the rows corresponding to values of True.
  • See Non Vectorized Transformation for a deeper coverage of non vectorized operations. All the scenarios covered there can also be applied for filtering.

Alternatively,

df_2 = df.loc[(df[['col_1', 'col_2']]
        .apply(np.mean, axis=1)
        .gt(0))]

Here is how this works:

  • In this particular case we can pass the values of 'col_1' and 'col_2' as a Series to the np.mean() function.
  • To do so, we select the columns we wish to pass to the function via basic indexing in df[['col_1', 'col_2'] .
  • We then apply() the np.pean() function while setting axis=1 to specify that apply should act on rows one at a time.
PYTHON
I/O