We wish to join two or more data frames based on a join condition. In this section, we will cover the most common types of joins, including:
NA
.In addition, we cover the following two scenarios which can be applied to extend any of the above:
This section is complemented by:
We wish to join two data frames based on the values of common columns and return only rows that meet the join condition and a union of the columns of both tables.
In this example, we wish to return only rows where the values of col_1
are the same in both df_1
and df_2
.
df = df_1 %>% inner_join(df_2)
Here is how this works:
inner_join()
function from the dplyr
package to combine df_1
and df_2
based on
the common column, which here is col_1
.inner_join()
will join the two data frames on all the columns with the same name in
both data frames. See Join Columns
to see how to change this behavior.df
will contain the rows that have the same value for col_1
in
both df_1
and df_2
and all columns in both data frames with only one col_1
.We wish to join two data frames based on the values of the common columns and return all rows from the left data frame and the rows that meet the join condition from the right data frame and a union of the columns of both tables.
In this example, we wish to return all rows from df_1
and only rows from df_2
where the values
of col_1
are the same in both df_1
and df_2
.
df = df_1 %>% left_join(df_2)
Here is how this works:
left_join()
function from the dplyr
package.df_1
that don’t match the join condition will have NA
for all columns coming
from df_2
.df
will contain all rows from df_1
and rows from df_2
that have the
same value for col_1
in both df_1
and df_2
. All columns coming from df_2
will have NA
for rows of df_1
that do not match the join condition.We wish to join two data frames based on the values of the common columns and return all rows from the right data frame and the rows that meet the join condition from the left data frame and a union of the columns of both tables.
In this example, we wish to return all rows from df_2
and only rows from df_1
where the values
of col_1
are the same in both df_1
and df_2
.
df = df_1 %>% right_join(df_2)
Here is how this works:
right_join()
function from the dplyr
package.df_2
that don’t match the join condition will have NA
for all columns coming
from df_1
.df
will contain all rows from df_2
and rows from df_1
that have the
same value for col_1
in both df_1
and df_2
. All columns coming from df_1
will have NA
for rows that do not match the join condition.We wish to join two data frames based on the common columns and return all rows from both data
frames, including the rows that do not have matching values in the other data frame. Any rows that
do not have matching values in the other data frame will be filled with NA
.
A full join is useful when we want to combine data from two data frames and retrieve all rows where the intersection is shown once plus the difference of both data frames
In this example, we wish to return all rows from df_1
and df_2
and combine rows that have
matching values for col_1
in both data frames.
df = df_1 %>% full_join(df_2)
Here is how this works:
full_join()
function from the dplyr
package.NA
for all columns coming from the other data
frame.df
will contain all rows from df_1
and df_2
where rows that have the
same value for col_1
in both df_1
and df_2
are combined.We wish to join two data frames and return all possible combinations of rows from the two data frames. The number of rows in the result data frame is equal to the number of rows in the first data frame multiplied by the number of rows in the second data frame. For example, we might use a cross-join to combine a data frame of products with a data frame of price-ranges to create a table of every possible product-price-range combination.
In this example, we wish to return all possible combinations of rows from df_1
and df_2
.
df = df_1 %>% full_join(df_2, by = character())
Here is how this works:
by = character()
to
do a cross join.df
will contain all possible combinations of rows from df_1
and df_2
and all columns from both data frames. The number of rows will be 4 * 4 = 16
.col_1
exists in both data frames, the output data frame will have two columns col_1_x
(
coming from df_1
) and col_1_y
(coming from df_2
) to resolve the duplicate names.We wish to join more than two data frames using one of the above-mentioned join types.
In this example, we wish to join df_1
, df_2
, and df_3
based on the common column col_1
as
follows
df_1
and df_2
(inner join) anddf_1
and df_2
with rows from df_3
that match the join
condition (left join)library(dplyr)
df_1 = data.frame(col_1 = c(1, 2, 3, 4),
col_2 = c('A', 'B', 'C', 'D'),
col_3 = c(25, 30, 35, 40))
df_2 = data.frame(col_1 = c(2, 3, 4, 5),
col_4 = c('W', 'X', 'Y', 'Z'))
df_3 = data.frame(col_1 = c(3, 4, 5, 6),
col_5 = c('10', '20', '30', '40'))
df = df_1 %>% inner_join(df_2) %>% left_join(df_3)
Here is how this works:
df
will contain all rows that have the same value for col_1
in df_1
and df_2
and a union of the columns of all three data frames df_1
, df_2
, and df_3
.We wish to select the columns we wish to keep from the data frames being joined.
In this example, we wish to select col_1
, col_2
, and col_4
.
df = df_1 %>%
select(col_1, col_2) %>%
inner_join(df_2 %>% select(col_1, col_4), by = "col_1")
Here is how this works:
select()
selects columns by name.
See Selecting By Name for coverage of selecting.df
will contain all rows that have the same value for col_1
in df_1
and df_2
and the three selected columns (col_1
, col_2
, and col_4
).Extension: Keep Join Columns
We wish to keep the join columns in the output data frame.
df = df_1 %>% inner_join(df_2, keep = TRUE)
Here is how this works:
TURE
to keep join column in the output.df
will contain the rows that have the same value for col_1
in
both df_1
and df_2
and all columns in both data frames with col_1
coming from df_1
as col_1.x
and col_2
from df_2
as col_1.y
.