Join Types

We wish to join two or more data frames based on a join condition. In this section, we will cover the most common types of joins, including:

Inner Join returns only the rows that meet the join condition in both data frames.
Left Join returns all rows from the left data frame and the rows that meet the join condition from the right data frame.
Right Join returns all rows from the right data frame and the rows that meet the join condition from the left data frame.
Full Join returns all rows from both data frames, including the rows that do not have matching values in the other data frame. Any rows that do not have matching values in the other data frame will be completed with NA.
Cross Join returns the Cartesian product of the two data frames, which means it returns every possible combination of rows from the two data frames.

In addition, we cover the following two scenarios which can be applied to extend any of the above:

Multiple Joins where we cover how to join more than 2 data frames and use different join types.
Selecting Columns where we cover how to select a subset of the columns of the data frames that are to be joined.

This section is complemented by:

Join Columns: Specify the columns we wish to use in the join condition.
Join Special Cases: We cover different challenges we might face when performing a join operation.

Inner Join

We wish to join two data frames based on the values of common columns and return only rows that meet the join condition and a union of the columns of both tables.

In this example, we wish to return only rows where the values of col_1 are the same in both df_1 and df_2.

df = df_1 %>% inner_join(df_2)

Here is how this works:

We use the inner_join() function from the dplyr package to combine df_1 and df_2 based on the common column, which here is col_1.
By default, inner_join() will join the two data frames on all the columns with the same name in both data frames. See Join Columns to see how to change this behavior.
The output data frame df will contain the rows that have the same value for col_1 in both df_1 and df_2 and all columns in both data frames with only one col_1.

We wish to join two data frames based on the values of the common columns and return all rows from the left data frame and the rows that meet the join condition from the right data frame and a union of the columns of both tables.

In this example, we wish to return all rows from df_1 and only rows from df_2 where the values of col_1 are the same in both df_1 and df_2.

df = df_1 %>% left_join(df_2)

Here is how this works:

This works similarly to the Inner Join scenario above except we use the left_join() function from the dplyr package.
Rows coming from df_1 that don’t match the join condition will have NA for all columns coming from df_2.
The output data frame df will contain all rows from df_1 and rows from df_2 that have the same value for col_1 in both df_1 and df_2. All columns coming from df_2 will have NA for rows of df_1 that do not match the join condition.

Right Join

We wish to join two data frames based on the values of the common columns and return all rows from the right data frame and the rows that meet the join condition from the left data frame and a union of the columns of both tables.

In this example, we wish to return all rows from df_2 and only rows from df_1 where the values of col_1 are the same in both df_1 and df_2.

df = df_1 %>% right_join(df_2)

Here is how this works:

This works similarly to the Left Join scenario above except we use the right_join() function from the dplyr package.
Rows coming from df_2 that don’t match the join condition will have NA for all columns coming from df_1.
The output data frame df will contain all rows from df_2 and rows from df_1 that have the same value for col_1 in both df_1 and df_2 . All columns coming from df_1 will have NA for rows that do not match the join condition.
A right join is just a swapped left join and the only difference is the order of the columns in the output data frame. If the order of the columns doesn't matter, we recommend to always using a left join as it is easier to understand.

Full Join

We wish to join two data frames based on the common columns and return all rows from both data frames, including the rows that do not have matching values in the other data frame. Any rows that do not have matching values in the other data frame will be filled with NA.

A full join is useful when we want to combine data from two data frames and retrieve all rows where the intersection is shown once plus the difference of both data frames

In this example, we wish to return all rows from df_1 and df_2 and combine rows that have matching values for col_1 in both data frames.

df = df_1 %>% full_join(df_2)

Here is how this works:

This works similarly to the Left Join scenario above except we use the full_join() function from the dplyr package.
Rows that don’t match the join condition will have NA for all columns coming from the other data frame.
The output data frame df will contain all rows from df_1 and df_2 where rows that have the same value for col_1 in both df_1 and df_2 are combined.

Cross Join

We wish to join two data frames and return all possible combinations of rows from the two data frames. The number of rows in the result data frame is equal to the number of rows in the first data frame multiplied by the number of rows in the second data frame. For example, we might use a cross-join to combine a data frame of products with a data frame of price-ranges to create a table of every possible product-price-range combination.

In this example, we wish to return all possible combinations of rows from df_1 and df_2.

df = df_1 %>% full_join(df_2, by = character())

Here is how this works:

This works similarly to the Full Join scenario above except we by = character() to do a cross join.
The output data frame df will contain all possible combinations of rows from df_1 and df_2 and all columns from both data frames. The number of rows will be 4 * 4 = 16.
Since col_1 exists in both data frames, the output data frame will have two columns col_1_x ( coming from df_1) and col_1_y (coming from df_2) to resolve the duplicate names.

Multiple Joins

We wish to join more than two data frames using one of the above-mentioned join types.

In this example, we wish to join df_1, df_2, and df_3 based on the common column col_1 as follows

Return rows that match in df_1 and df_2 (inner join) and
enrich the rows from the inner join of df_1 and df_2 with rows from df_3 that match the join condition (left join)

library(dplyr)

df_1 = data.frame(col_1 = c(1, 2, 3, 4),
                  col_2 = c('A', 'B', 'C', 'D'),
                  col_3 = c(25, 30, 35, 40))

df_2 = data.frame(col_1 = c(2, 3, 4, 5),
                  col_4 = c('W', 'X', 'Y', 'Z'))

df_3 = data.frame(col_1 = c(3, 4, 5, 6),
                  col_5 = c('10', '20', '30', '40'))

df = df_1 %>% inner_join(df_2) %>% left_join(df_3)

Here is how this works:

This works similarly to the Inner Join scenario above with an additional join operation.
The output data frame df will contain all rows that have the same value for col_1 in df_1 and df_2 and a union of the columns of all three data frames df_1, df_2, and df_3.

Selecting Columns

We wish to select the columns we wish to keep from the data frames being joined.

In this example, we wish to select col_1, col_2, and col_4.

df = df_1 %>%
  select(col_1, col_2) %>%
  inner_join(df_2 %>% select(col_1, col_4), by = "col_1")

Here is how this works:

select() selects columns by name. See Selecting By Name for coverage of selecting.
The output data frame df will contain all rows that have the same value for col_1 in df_1 and df_2 and the three selected columns (col_1, col_2, and col_4).

Extension: Keep Join Columns

We wish to keep the join columns in the output data frame.

df = df_1 %>% inner_join(df_2, keep = TRUE)

Here is how this works:

We set keep to TURE to keep join column in the output.
The output data frame df will contain the rows that have the same value for col_1 in both df_1 and df_2 and all columns in both data frames with col_1 coming from df_1 as col_1.x and col_2 from df_2 as col_1.y.

Optima.io Reference beta