Join Columns

We wish to join two data frames based on specific columns.

This section is organized as follows:

  • All Common Columns: We wish to join two data frames based on all common columns between them. This is often referred to as a Natural Join.
  • One Column: We wish to join two data frames based on one column.
  • Multiple Columns: We wish to join two data frames based on multiple columns.
  • Index Join: We wish to join two data frames based on their index.
  • Column and Index Join: We wish to join two data frames based on the index from one data frame and a column from the other data frame.

This section is complemented by:

  • Join Types: where we cover how to specify the join type.
  • Join Special Cases: where we cover how to deal with special cases when performing a join operation.

All Common Columns

We wish to join two data frames based on all common columns between them. This is often referred to as a Natural Join.

In this example, we wish to join df_1 and df_2 based on the columns named col_1 and col_2 in both data frames.

df = df_1.merge(df_2, how='inner')

Here is how this works:

  • We use the merge() function in pandas to combine df_1 and df_2 based on the common column col_1.
  • We specify the join type by passing inner as the how argument. See Join Types.
  • By default, merge() will join the two data frames on all the columns with the same name in both data frames.
  • The output data frame df will contain the rows that have the same value for both col_1 and col_2 in both df_1 and df_2 and all columns in both data frames with only one col_1 and col_2.

One Column

We wish to join two data frames based on one column.

Common Column Name

We wish to join two data frames based on one column that have the same name in the data frames to be joined.

In this example, we wish to join df_1 and df_2 based on a column named col_1 in both data frames.

df = df_1.merge(df_2, how='inner', on='col_1')

Here is how this works:

  • We use the merge() function in pandas to combine df_1 and df_2 based on the common column col_1.
  • We specify the join type by passing inner as the how argument. See Join Types.
  • By default, merge() will join the two data frames on all the columns with the same name in both data frames. To change this behavior, we specify the column we wish to join on by passing its name as a string to the on argument; which in this example is on=’col_1’.
  • The output data frame df will contain the rows that have the same value for col_1 in both df_1 and df_2 and all columns in both data frames with only one col_1.

Different Column Name

We wish to join two data frames based on one column that has a different name in the data frames.

In this example, we wish to join df_1 and df_2 based on col_1_x from df_1 and col_1_y in df_2 only.

df = df_1.merge(df_2, how='inner', left_on='col_1_x', right_on='col_1_y')

Here is how this works:

  • We use the merge() function in pandas to combine df_1 and df_2 based on col_1_x and col_1_y.
  • We specify the join type by passing inner as the how argument. See Join Types.
  • We use left_on and right_on arguments to specify the column we wish to use in the join condition. In this example, df_1 is the left data frame and df_2 is the right one.
  • The output data frame df will contain the rows that have the same value for col_1_x and col_1_y and all columns in both data frames.

Multiple Columns

We wish to join two data frames based on multiple columns.

Common Columns Names

We wish to join two tables based on multiple columns that have the same name in the data frames to be joined.

In this example, we wish to join df_1 and df_2 based on the columns named col_1 and col_2 in both data frames.

df = df_1.merge(df_2, how='inner', on=['col_1', 'col_2'])

Here is how this works:

  • This works similarly to the One Column scenario above except we pass a list of columns to the on argument.
  • The output data frame df will contain the rows that have the same value for col_1 and col_2 in both df_1 and df_2 and all columns in both data frames with one instance of each of col_1 and col_2.

Different Columns Names

We wish to join two data frames based on multiple columns, while each of the join columns has a different name in either data frame.

In this example, we wish to join df_1 and df_2 based on col_1_x and col_2_x from df_1 and col_1_y and col_2_y in df_2.

df = df_1.merge(df_2, how='inner',
                left_on=['col_1_x', 'col_2_x'],
                right_on=['col_1_y', 'col_2_y'])

Here is how this works:

  • This works similarly to the One Column scenario above except we pass a list of columns to the left_on and right_on arguments.
  • The output data frame df will contain the rows that have the same value for all the specified join columns. All columns from both data frames will be in the output data frame df.

Index Join

We wish to join two data frames based on their index.

In this example, we wish to join df_1 and df_2 based on their index.

df = df_1.merge(df_2, how='inner', left_index=True, right_index=True)

Here is how this works:

  • We use the merge() function in pandas to combine df_1 and df_2 based on the index.
  • We specify the join type by passing inner as the how argument. See Join Types.
  • We set both left_index and right_index arguments to True which means the two data frames will be joined based on the index. In this example, df_1 is the left data frame and df_2 is the right one.
  • The output data frame df will contain the rows that have the same index value in df_1 and df_2.

Extension: Join on MultiIndex

df = df_1.merge(df_2, how='inner', left_index=True, right_index=True)

Here is how this works: This works similarly to the scenario above except here the data frames have a MultiIndex.

Extension: Join a Single Index to a MultiIndex

df = df_1.merge(df_2, how='inner', left_index=True, right_index=True)

Here is how this works:

  • This works similarly to the scenario above except here the right data data frame have a single index.
  • Note: indices must be named and share the same name.

Column and Index Join

We wish to join two data frames based on the index from one data frame and a column from the other data frame.

df = df_1.merge(df_2, how='inner', left_index=True, right_on='col_1')

Here is how this works:

  • We use the merge() function in pandas to combine df_1 and df_2 based on the index.
  • We specify the join type by passing inner as the how argument. See Join Types.
  • To use the index of the left data frame df_1 as a join column we both left_index to True and we pass col_1 from df_2 as the right_on argument.
  • The output data frame df will contain the rows that have the same value as an index in df_1 and col_1 in df_2.

Extension: Join on MultiIndex

df = df_1.merge(df_2, how='inner',
                left_index=True,
                right_on=['col_3', 'col_4'])

Here is how this works:

  • This works similarly to the scenario above except here the data frame df_1 has MultiIndex.
  • We pass the list of columns from df_2 that maps to the MultiIndex hierarchy to the right_on argument.
PYTHON
I/O