Filtering

This section is concerned with extracting a subset of rows by column values which we will refer to going forward as filtering.

The output of the filtering operation is a subset of the rows of the input Data Frame appearing in the same order and with the columns of the original data frame unchanged.

The input to the filtering operation is essentially a logical vector with as many elements as there are rows in the data frame or more commonly a logical expression on the data frame columns that evaluates to True or False for each row. Rows for which the corresponding value is True are retained (included in the output).

This section is organized as follows:

  • Basic Filtering: We cover the basic yet very common filtering scenarios e.g. filter rows where a numerical column has a particular value.
  • Custom Filtering: We look at scenarios where the filtering logic is more involved than a simple comparison usually involving transformations or aggregations of the original data e.g. filter rows where a numerical column has a value that is higher than the mean value.
  • Grouped Filtering: Where we apply filters within groups e.g. for each group, filter the row where a numerical column has the highest value.
  • Implicit Filtering: The columns to which we wish to apply the filtering logic are not specified explicitly rather by criteria they meet e.g. their name starts with a particular sub string.
  • Dynamic Filtering: The columns to filter or the functions to apply for filtering are passed dynamically as function arguments or variables in the environment and applied at “run time”.
  • Filtering Joins: Filtering joins allow us to filter rows from a data frame df_1 based on the presence or absence of matches in another data frame df_2.
PYTHON
I/O