Filtering joins allow us to filter rows from a data frame based on the presence or absence of matches in another data frame.
There are two types of filtering joins which are:
These operations are useful in some filtering scenarios and also for debugging when carrying out data frame join operations (see Joining).
We wish to return rows of a data frame that have a match in another data frame.
In this example, we wish to return rows in data frame df_1
that match rows in another data frame df_2
. A row in df_1
is considered a match to a row in df_2
if the values of columns col_1
and col_2
in both data frames are a match.
df_3 = df_1 %>% semi_join(df_2, by = c("col_1", "col_2"))
Here is how this works:
semi_join()
function from the dplyr
package.semi_join()
expects two data frames and a set of join columns.df_1
to semi_join()
via the pipe %>%
. Of course, we could pass df_1
as the first argument to semi_join()
without the pipe.semi_join()
the data frame to use for filtering df_2
as well as the join columns c("col_1", "col_2")
.semi_join()
returns the rows of data frame df_1
where the combination of values of col_1
and col_2
appears in the corresponding columns in the data frame df_2
.by = c("col_1" = "col_a", "col_2" = "col_b")
.We wish to return rows of a data frame that do not have a match in another data frame.
In this example, we have two data frames df_1
and df_2
and we wish to retain rows that are in df_1
but not in df_2
. A row in df_1
is considered a match to a row in df_2
if the values of columns col_1
and col_2
in both data frames are a match.
df_3 = df_1 %>% anti_join(df_2, by = c("col_1", "col_2"))
Here is how this works:
anti_join()
function from the dplyr
package.anti_join()
expects two data frames and a set of join columns.df_1
to anti_join()
via the pipe %>%
. Of course, we could pass df_1
as the first argument to anti_join()
without the pipe.anti_join()
the data frame to use for filtering df_2
as well as the join columns c("col_1", "col_2")
.anti_join()
returns the rows of data frame df_1
where the combination of values of col_1
and col_2
do not appear in the corresponding columns in the data frame df_2
.by = c("col_1" = "col_a", "col_2" = "col_b")
.