We wish to join two data frames based on specific columns.
This section is organized as follows:
This section is complemented by:
We wish to join two data frames based on all common columns between them. This is often referred to as a Natural Join.
In this example, we wish to join df_1
and df_2
based on the columns named col_1
and col_2
in
both data frames.
df = df_1 %>% inner_join(df_2)
Here is how this works:
inner_join()
function from the dplyr
package to combine df_1
and df_2
based on
the common columns col_1
and col_2
.dplyr
will join the two data frames on all the columns with
the same name in both data frames.df
will contain the rows that have the same value for both col_1
and col_2
in both df_1
and df_2
and all columns in both data frames with only one col_1
and col_2
.We wish to join two data frames based on one column.
Common Column Name
We wish to join two data frames based on one column that have the same name in the data frames to be joined.
In this example, we wish to join df_1
and df_2
based on a column named col_1
in both
data frames.
df = df_1 %>% inner_join(df_2, by = "col_1")
Here is how this works:
inner_join()
function from the dplyr
package to combine df_1
and df_2
based on
the common column col_1
.inner_join()
will join the two data frames on all the columns with the same name in
both data frames. To change this behavior, we specify the column we wish to join on by passing its
name as a string to the by
argument; which in this example is by=”col_1”
.df
will contain the rows that have the same value for col_1
in
both df_1
and df_2
and all columns in both data frames with only one col_1
.Different Column Name
We wish to join two data frames based on one column that has a different name in the data frames.
In this example, we wish to join df_1
and df_2
based on col_1_x
from df_1
and col_1_y
in df_2
only.
df = df_1 %>% inner_join(df_2, by = c("col_1_x" = "col_1_y"))
Here is how this works:
inner_join()
function from the dplyr
package to combine df_1
and df_2
based on
the col_1_x
and col_1_y
.by = c("col_1_x" = "col_1_y")
to specify the mapping between the names of the column we
wish to join on.df
will contain the rows that have the same value for col_1_x
and col_1_y
and all columns in both data frames except col_1_y
.We wish to join two data frames based on multiple columns.
Common Columns Names
We wish to join two tables based on multiple columns that have the same name in the data frames to be joined.
In this example, we wish to join df_1
and df_2
based on the columns named col_1
and col_2
in both data frames.
df = df_1 %>% inner_join(df_2, by = c("col_1", "col_2"))
Here is how this works:
by
argument.df
will contain the rows that have the same value for col_1
and col_2
in both df_1
and df_2
and all columns in both data frames with one instance of each of col_1
and col_2
.Different Columns Names
We wish to join two data frames based on multiple columns, while each of the join columns has a different name in either data frame.
In this example, we wish to join df_1
and df_2
based on col_1_x
and col_2_x
from df_1
and col_1_y
and col_2_y
in df_2
.
df = df_1 %>% inner_join(df_2, by = c("col_1_x" = "col_1_y",
"col_2_x" = "col_2_y"))
Here is how this works:
by
argument.df
will contain the rows that have the same value for all the specified
join columns. All columns from both data frames will be in the output data frame df
except join
columns in the right data frame which are col_1_y
and col_2_y
in this example.