Filtering Joins

Filtering joins allow us to filter rows from a data frame based on the presence or absence of matches in another data frame.

There are two types of filtering joins which are:

  1. Semi Join which allows us to filter (return) rows of a data frame that have a match in another data frame.
  2. Anti Join which allows us to filter (return) rows of a data frame that have no match in another data frame.

These operations are useful in some filtering scenarios and also for debugging when carrying out data frame join operations (see Joining).

Semi Join

We wish to return rows of a data frame that have a match in another data frame.

In this example, we wish to return rows in data frame df_1 that match rows in another data frame df_2. A row in df_1 is considered a match to a row in df_2 if the combination of values of columns col_1 and col_2 in both data frames are a match.

df_t = df_1.merge(df_2, how='inner', on=['col_1', 'col_2'])
df_3 = df_1.merge(df_t[['col_1', 'col_2']], how='inner')

Here is how this works: - There is no dedicated semi join function in Pandas. - We apply a series of join operations (see Joining). - We first use merge() execute an inner join between df_1 and df_2 on the specified join columns ['col_1', 'col_2']. - We then use merge() again to execute a second inner join between the output of the first inner join df_t and df_1. - We include only the join columns of df_t in the second join so the final output df_3 has only columns of df_1 unaltered. - Note that the resulting data frame will have a fresh index starting at 0 and until n-1 (where n is the number of rows returned). If we wish to retain the index of the original data frame df_1, we can use the alternative solution below.

Alternatively:

One Join Column

df_3 = df_1.loc[df_1['col_1'].isin(df_2['col_1'])]

Here is how this works:

  • The approach we use here involves using the isin() method to identify rows in the data frame df_1 that have a match in the data frame df_2.
  • isin() returns a Series of the same size as the column df_1['col_1'] that has the value True if the corresponding value of df_1['col_1'] appears in df_2['col_1']. For more on isin(), see List Operations.
  • We use loc[] to return the rows of df_1 for which df_1['col_1'].isin(df_2['col_1']) is True i.e. rows of df_1 that have a match in df_2.
  • In this case the join columns have the same names in both data frames. Had they been named differently we simply use the corresponding column names for each data frame e.g. df_1['col_1'].isin(df_2['col_a']).

Many Join Columns

df_3 = df_1.loc[df_1[['col_1', 'col_2']].apply(tuple, axis=1).isin(
    df_2[['col_1', 'col_2']].apply(tuple, axis=1))]

Here is how this works:

  • We use isin() to identify matching rows like we did in the "One Join Column" scenario described above.
  • In order to get isin() to identify matches defined by multiple columns, we convert the values of the join columns for each row into a tuple. We do so via apply(tuple, axis=1). tuples can then be matched by isin() as if they are single elements.
  • We use loc[] to return the rows of df_1 for which isin() returns True i.e. rows of df_1 that have a match in df_2.
  • The resulting data frame has the same index as the input data frame df_1. See Patchy Index in Basic Filtering.
  • In this case the join columns have the same names in both data frames. Had they been named differently we would simply use the corresponding column names for each data frame.

Anti Join

We wish to return rows of a data frame that do not have a match in another data frame.

In this example, we have two data frames df_1 and df_2 and we wish to retain rows that are in df_1 but not in df_2. A row in df_1 is considered a match to a row in df_2 if the values of columns col_1 and col_2 in both data frames are a match.

df_x = df_1.merge(df_2, how='left', on=['col_1', 'col_2'], indicator=True)
df_y = df_x.loc[df_x['_merge'] == 'left_only', ['col_1', 'col_2']]
df_3 = df_1.merge(df_y, how='inner')

Here is how this works:

  • There is no dedicated anti join function in Pandas.
  • We apply a series of join operations (see Joining).
  • We first use merge() execute a left between df_1 and df_2 on the specified join columns ['col_1', 'col_2'].
  • merge() has a parameter indicator which we set to indicator=True to add a column called _merge which takes the value left_only to tell us that the row is coming from the left data frame (df_1) only or the value both to tell us that the row (as defined by the join columns) exists in both data frames.
  • We then filter the output of the left join df_x to retain only rows where df_x['_merge'] == 'left_only' i.e. rows that exist in the data frame df_1 but not in df_2.
  • We retain only the join columns of df_x so the final output df_3 has only columns of df_1 unaltered.
  • We then use merge() again to execute an inner join between the output of the filtering operation df_y and df_1.
  • Note that the resulting data frame will have a fresh index starting at 0 and until n-1 (where n is the number of rows returned). If we wish to retain the index of the original data frame df_1, we can use the alternative solution below.

Alternatively:

One Join Column

df_3 = df_1.loc[~df_1['col_1'].isin(df_2['col_1'])]

Here is how this works:

  • This is the same as the alternative semi join solution described above.
  • We add a not (~) operator to take the complement.

Many Join Columns

df_3 = df_1.loc[~df_1[['col_1', 'col_2']].apply(tuple, axis=1).isin(
    df_2[['col_1', 'col_2']].apply(tuple, axis=1))]

Here is how this works:

  • This is the same as the alternative semi join solution described above.
  • We add a not (~) operator to take the complement.
PYTHON
I/O