Filtering Joins

Filtering joins allow us to filter rows from a data frame based on the presence or absence of matches in another data frame.

There are two types of filtering joins which are:

  1. Semi Join which allows us to filter (return) rows of a data frame that have a match in another data frame.
  2. Anti Join which allows us to filter (return) rows of a data frame that have no match in another data frame.

These operations are useful in some filtering scenarios and also for debugging when carrying out data frame join operations (see Joining).

Semi Join

We wish to return rows of a data frame that have a match in another data frame.

In this example, we wish to return rows in data frame df_1 that match rows in another data frame df_2. A row in df_1 is considered a match to a row in df_2 if the values of columns col_1 and col_2 in both data frames are a match.

df_3 = df_1 %>% semi_join(df_2, by = c("col_1", "col_2"))

Here is how this works:

  • We use the semi_join() function from the dplyr package.
  • semi_join() expects two data frames and a set of join columns.
  • We pass the data frame that we wish to filter df_1 to semi_join() via the pipe %>%. Of course, we could pass df_1 as the first argument to semi_join() without the pipe.
  • We then pass to semi_join() the data frame to use for filtering df_2 as well as the join columns c("col_1", "col_2").
  • semi_join() returns the rows of data frame df_1 where the combination of values of col_1 and col_2 appears in the corresponding columns in the data frame df_2.
  • In this case the join columns have the same names in both data frames. Had they been named differently we would use a named vector as follows by = c("col_1" = "col_a", "col_2" = "col_b").

Anti Join

We wish to return rows of a data frame that do not have a match in another data frame.

In this example, we have two data frames df_1 and df_2 and we wish to retain rows that are in df_1 but not in df_2. A row in df_1 is considered a match to a row in df_2 if the values of columns col_1 and col_2 in both data frames are a match.

df_3 = df_1 %>%  anti_join(df_2, by = c("col_1", "col_2"))

Here is how this works:

  • We use the anti_join() function from the dplyr package.
  • anti_join() expects two data frames and a set of join columns.
  • We pass the data frame that we wish to filter df_1 to anti_join() via the pipe %>%. Of course, we could pass df_1 as the first argument to anti_join() without the pipe.
  • We then pass to anti_join() the data frame to use for filtering df_2 as well as the join columns c("col_1", "col_2").
  • anti_join() returns the rows of data frame df_1 where the combination of values of col_1 and col_2 do not appear in the corresponding columns in the data frame df_2.
  • In this case the join columns have the same names in both data frames. Had they been named differently we would use a named vector as follows by = c("col_1" = "col_a", "col_2" = "col_b").
R
I/O