Join Special Cases

Joining data frames is a common task in data analysis and manipulation, but it can also present some special cases that we need to be mindful of and handle as appropriate for the purpose at hand.

In this section, we will discuss some of these special cases, including:

Duplicate Column Names: Join two data frames with duplicate column names. ****
Duplicate Join Values: Join two data frames with duplicate values in the join columns.
Missing Join Values: Join two data frames with missing values in the join columns.
Different Join Data Types: Join two data frames on join columns with different data types.

This section is complemented by:

Join Types: where we cover how to specify the join type.
Join Columns: where we cover how to specify join columns.

Duplicate Column Names

We wish to rename columns that have the same name in both data frames being joined.

Suffix

We wish to add a suffix per data frame to columns with the same name in both data frames.

In this example, we wish to rename col_2 coming from df_1 to col_2_1 and rename col_2 coming from df_2 to col_2_2.

df = df_1.merge(df_2, how='inner', on="col_1", suffixes=("_1", "_2"))

Here is how this works:

We use the merge() function in pandas to combine df_1 and df_2 based on the common column col_1.
We specify the join type by passing inner as the how argument.
By default, merge() will join the two data frames on all the columns with the same name in both data frames. Here we specify that we wish to join on col_1 only by using on="col_1". See Join Columns for coverage of different scenarios.
By default, merge() function will use _x and _y as suffixes to resolve duplicate column names. We can change these suffixes by passing a tuple of suffixes ("_1", "_2") in this example.
The output data frame df will contain the rows that have the same value for col_1 in both df_1 and df_2 with three columns col_1, col_2_1, and col_2_2.

Rename

We wish to rename columns with the same name in both data frames.

In this example, we wish to rename col_3 coming from df_1 to col_2 and rename col_4 coming from df_2 to col_3 and col_5 to col_4.

df = df_1.rename(columns={'col_3': 'col_2'}). \
    merge(df_2.rename(columns={'col_4': 'col_3', 'col_5': 'col_4'}),
          how='inner')

Here is how this works:

We use the merge() function in pandas to combine df_1 and df_2 based on the common column col_1.
We specify the join type by passing inner as the how argument.
By default, merge() function will use _x and _y as suffixes to resolve duplicate column names. We can change these suffixes by passing a tuple of suffixes ("_1", "_2") in this example.
We use the data frame method rename() to change column names from their current values to the desired values.
To rename columns, rename() expects a Python dictionary mapping between the current column names (as dictionary keys) and desired column names (as dictionary values); i.e. {current_name : desired_name}. The name-mapping dictionary in this case is {'col_a': 'col_1', 'col_b': 'col_2'}.
We only need to include the dictionary mapping for the columns we wish to rename.
See Renaming for a coverage of different renaming scenarios.

Alternative: Rename After Join

df = df_1.merge(df_2, how='inner', on='col_1').rename(
    columns={'col_4_x': 'col_2', 'col_4_y': 'col_3', 'col_5': 'col_4'})

Here is how this works:

This works similarly to the primary solution except we rename the columns after performing the join operation.
The pandas merge function automatically add _x for columns coming from df_1 and _y suffix for columns coming from df2 if they have the same name.

Duplicate Join Values

We wish to join two data frames and handle duplicate values in the join columns by removing them. If the desired output is a Cartesian product, then there is no need to remove duplicate values.

In this example, we wish to drop duplicates in df_1 and then perform the join operation.

df = df_1.drop_duplicates(subset=["col_1"]).merge(df_2, how='inner')

Here is how this works:

We use the drop_duplicates() function to drop duplicate rows in df_1.
We pass the subset of columns we wish to use to check for duplicates, col_1 in this example.
The output data frame df will contain the rows that have the same value for col_1 in both df_1 and df_2 (except second row from df_1) and all columns in both data frames with only one col_1.

We wish to join two data frames with missing values in the join columns. NA values should not be matched because NA values are not comparable. Unfortunately, Pandas has a bug in which this is not the current behavior where NA values will be matched. We describe the needed step to get the expected output based on the join type.

Inner Join

We wish to perform an inner join between two data frames with missing values in the join column. Any rows with missing values in the join column will be excluded which means that we need to drop NA values from either data frame.

In this example, we need to drop rows with NA values in the join column col_1.

df = df_1.dropna(subset=["col_1"]).merge(df_2, how='inner')

Here is how this works:

We use the merge() function in pandas to combine df_1 and df_2 based on the common column col_1.
We specify the join type by passing inner as the how argument.
By default, merge() will join the two data frames on all the columns with the same name in both data frames. See Join Columns for coverage of join columns.
Pandas will match the forth row with NA value in col_1. We use dropna() function to drop rows with missing values, and we pass col_1 as the subset of columns we wish to use.
The output data frame df will contain the rows that have the same value for col_1 in both df_1 and df_2 excluding any rows with NA values in col_1 and all columns in both data frames with only one col_1.

Left Join

We wish to perform a left join between two data frames with missing values in the join column. The resulting data frame will include all rows from the left data frame, along with any matching rows from the right data frame. Rows with missing values in the join column will be included in the resulting data frame and will have NA values for the columns from the right data frame.

In this example, we need to drop rows with NA in the values in the join column col_1 in df_2.

df = df_1.merge(df_2.dropna(subset=["col_1"]), how='left')

Here is how this works:

This works similarly to the scenario above, except we drop rows with NA values in the join from the right data frame, and we use a left join instead of an inner join.
Right Join works in a similar manner except we drop rows with NA values in the join column from the left data frame.

Full Join

We wish to perform a full join between two data frames with missing values in the join column. The resulting data frame will include all rows from both data frames. Rows with no matching or missing values in the join column will be included in the resulting data frame and will have NA values for the columns from the other data frame.

df = df_1.dropna(subset=["col_1"]).merge(df_2.dropna(subset=["col_1"]),
                                         how='outer')
df = pd.concat([df,
                df_1.loc[df_1['col_1'].isnull()],
                df_2.loc[df_2['col_1'].isnull()]
                ])

Here is how this works:

We use merge() function to perform a full join between df_1 and df_2.
We first drop rows with NA values on col_1 from both data frames. Rows that don’t match the join condition will have NA for all columns coming from the other data frame.
We then append rows with NA values in col_1 to df using concat() function. See Binding for a coverage of binding.
We filter rows with missing values in col_1 using df_1.loc[df_1['col_1'].isnull()]. See Filtering for a coverage of filtering.
The output data frame df will contain all rows from df_1 and df_2 where rows that have the same value for col_1 in both df_1 and df_2 are combined. Rows that don’t match the join condition will have NA for all columns coming from the other data frame.

Different Join Data Types

We wish to join two data frames on a common column with a different data type.

In this example we wish to join df_1 and df_2 on col_1 which is of type integer in df_1 and of type string in df_2.

df = df_1.merge(df_2.astype({'col_1': int}), how='inner', on='col_1')

Here is how this works:

We cast col_1 to integer before joining using astype() function. See Type Casting.

Optima.io Reference beta

Join Special Cases

Duplicate Column Names

Duplicate Join Values

Missing Join Values

Different Join Data Types