Filtering joins allow us to filter rows from a data frame based on the presence or absence of matches in another data frame.
There are two types of filtering joins which are:
These operations are useful in some filtering scenarios and also for debugging when carrying out data frame join operations (see Joining).
We wish to return rows of a data frame that have a match in another data frame.
In this example, we wish to return rows in data frame df_1
that match rows in another data frame df_2
. A row in df_1
is considered a match to a row in df_2
if the combination of values of columns col_1
and col_2
in both data frames are a match.
df_t = df_1.merge(df_2, how='inner', on=['col_1', 'col_2'])
df_3 = df_1.merge(df_t[['col_1', 'col_2']], how='inner')
Here is how this works:
- There is no dedicated semi join function in Pandas.
- We apply a series of join operations (see Joining).
- We first use merge()
execute an inner join between df_1
and df_2
on the specified join columns ['col_1', 'col_2']
.
- We then use merge()
again to execute a second inner join between the output of the first inner join df_t
and df_1
.
- We include only the join columns of df_t
in the second join so the final output df_3
has only columns of df_1
unaltered.
- Note that the resulting data frame will have a fresh index starting at 0 and until n-1 (where n is the number of rows returned). If we wish to retain the index of the original data frame df_1
, we can use the alternative solution below.
Alternatively:
One Join Column
df_3 = df_1.loc[df_1['col_1'].isin(df_2['col_1'])]
Here is how this works:
isin()
method to identify rows in the data frame df_1
that have a match in the data frame df_2
. isin()
returns a Series
of the same size as the column df_1['col_1']
that has the value True
if the corresponding value of df_1['col_1']
appears in df_2['col_1']
. For more on isin()
, see List Operations.loc[]
to return the rows of df_1
for which df_1['col_1'].isin(df_2['col_1'])
is True
i.e. rows of df_1
that have a match in df_2
. df_1['col_1'].isin(df_2['col_a'])
.Many Join Columns
df_3 = df_1.loc[df_1[['col_1', 'col_2']].apply(tuple, axis=1).isin(
df_2[['col_1', 'col_2']].apply(tuple, axis=1))]
Here is how this works:
isin()
to identify matching rows like we did in the "One Join Column" scenario described above. isin()
to identify matches defined by multiple columns, we convert the values of the join columns for each row into a tuple
. We do so via apply(tuple, axis=1)
. tuples can then be matched by isin()
as if they are single elements.loc[]
to return the rows of df_1
for which isin()
returns True
i.e. rows of df_1
that have a match in df_2
.df_1
. See Patchy Index in Basic Filtering.We wish to return rows of a data frame that do not have a match in another data frame.
In this example, we have two data frames df_1
and df_2
and we wish to retain rows that are in df_1
but not in df_2
. A row in df_1
is considered a match to a row in df_2
if the values of columns col_1
and col_2
in both data frames are a match.
df_x = df_1.merge(df_2, how='left', on=['col_1', 'col_2'], indicator=True)
df_y = df_x.loc[df_x['_merge'] == 'left_only', ['col_1', 'col_2']]
df_3 = df_1.merge(df_y, how='inner')
Here is how this works:
merge()
execute a left between df_1
and df_2
on the specified join columns ['col_1', 'col_2']
. merge()
has a parameter indicator
which we set to indicator=True
to add a column called _merge
which takes the value left_only
to tell us that the row is coming from the left data frame (df_1
) only or the value both
to tell us that the row (as defined by the join columns) exists in both data frames.df_x['_merge'] == 'left_only'
i.e. rows that exist in the data frame df_1
but not in df_2
.df_x
so the final output df_3
has only columns of df_1
unaltered.merge()
again to execute an inner join between the output of the filtering operation df_y
and df_1
.df_1
, we can use the alternative solution below.Alternatively:
One Join Column
df_3 = df_1.loc[~df_1['col_1'].isin(df_2['col_1'])]
Here is how this works:
~
) operator to take the complement.Many Join Columns
df_3 = df_1.loc[~df_1[['col_1', 'col_2']].apply(tuple, axis=1).isin(
df_2[['col_1', 'col_2']].apply(tuple, axis=1))]
Here is how this works:
~
) operator to take the complement.