Join Columns

We wish to join two data frames based on specific columns.

This section is organized as follows:

  • All Common Columns: We wish to join two data frames based on all common columns between them. This is often referred to as a Natural Join.
  • One Column: We wish to join two data frames based on one column.
  • Multiple Columns: We wish to join two data frames based on multiple columns.

This section is complemented by:

  • Join Types: where we cover how to specify the join type.
  • Join Special Cases: where we cover how to deal with special cases when performing a join operation.

All Common Columns

We wish to join two data frames based on all common columns between them. This is often referred to as a Natural Join.

In this example, we wish to join df_1 and df_2 based on the columns named col_1 and col_2 in both data frames.

df = df_1 %>% inner_join(df_2)

Here is how this works:

  • We use the inner_join() function from the dplyr package to combine df_1 and df_2 based on the common columns col_1 and col_2.
  • By default, all joining functions in dplyr will join the two data frames on all the columns with the same name in both data frames.
  • The output data frame df will contain the rows that have the same value for both col_1 and col_2 in both df_1 and df_2 and all columns in both data frames with only one col_1 and col_2.

One Column

We wish to join two data frames based on one column.

Common Column Name

We wish to join two data frames based on one column that have the same name in the data frames to be joined.

In this example, we wish to join df_1 and df_2 based on a column named col_1 in both data frames.

df = df_1 %>% inner_join(df_2, by = "col_1")

Here is how this works:

  • We use the inner_join() function from the dplyr package to combine df_1 and df_2 based on the common column col_1.
  • By default, inner_join() will join the two data frames on all the columns with the same name in both data frames. To change this behavior, we specify the column we wish to join on by passing its name as a string to the by argument; which in this example is by=”col_1”.
  • The output data frame df will contain the rows that have the same value for col_1 in both df_1 and df_2 and all columns in both data frames with only one col_1.
  • The same applies to other join types. See * Join Types*

Different Column Name

We wish to join two data frames based on one column that has a different name in the data frames.

In this example, we wish to join df_1 and df_2 based on col_1_x from df_1 and col_1_y in df_2 only.

df = df_1 %>% inner_join(df_2, by = c("col_1_x" = "col_1_y"))

Here is how this works:

  • We use the inner_join() function from the dplyr package to combine df_1 and df_2 based on the col_1_x and col_1_y.
  • We use by = c("col_1_x" = "col_1_y") to specify the mapping between the names of the column we wish to join on.
  • The output data frame df will contain the rows that have the same value for col_1_x and col_1_y and all columns in both data frames except col_1_y.
  • The same applies to other join types. See Join Types

Multiple Columns

We wish to join two data frames based on multiple columns.

Common Columns Names

We wish to join two tables based on multiple columns that have the same name in the data frames to be joined.

In this example, we wish to join df_1 and df_2 based on the columns named col_1 and col_2 in both data frames.

df = df_1 %>% inner_join(df_2, by = c("col_1", "col_2"))

Here is how this works:

  • This works similarly to the One Column scenario above except we pass a list of columns to the by argument.
  • The output data frame df will contain the rows that have the same value for col_1 and col_2 in both df_1 and df_2 and all columns in both data frames with one instance of each of col_1 and col_2.

Different Columns Names

We wish to join two data frames based on multiple columns, while each of the join columns has a different name in either data frame.

In this example, we wish to join df_1 and df_2 based on col_1_x and col_2_x from df_1 and col_1_y and col_2_y in df_2.

df = df_1 %>% inner_join(df_2, by = c("col_1_x" = "col_1_y",
                                      "col_2_x" = "col_2_y"))

Here is how this works:

  • This works similarly to the One Column scenario above except we pass a list of columns to the by argument.
  • Each item in the list is mapping each column to its counterpart in the other data frame.
  • The output data frame df will contain the rows that have the same value for all the specified join columns. All columns from both data frames will be in the output data frame df except join columns in the right data frame which are col_1_y and col_2_y in this example.
R
I/O