Joining data frames is a common task in data analysis and manipulation, but it can also present some special cases that we need to be mindful of and handle as appropriate for the purpose at hand.
In this section, we will discuss some of these special cases, including:
This section is complemented by:
We wish to rename columns that have the same name in both data frames being joined.
Suffix
We wish to add a suffix per data frame to columns with the same name in both data frames.
In this example, we wish to rename col_2
coming from df_1
to col_2_1
and rename col_2
coming
from df_2
to col_2_2
.
df = df_1%>% inner_join(df_2, by="col_1", suffix=c("_1", "_2"))
Here is how this works:
inner_join()
function from the dplyr
package to combine df_1
and df_2
based on
the common column col_1
.inner_join()
function will use .x
and .y
as suffixes to resolve duplicate column
names. We can change these suffixes by passing a vector of suffixes ("_1", "_2")
in this
example.df
will contain the rows that have the same value for col_1
in
both df_1
and df_2
with three columns col_1
, col_2_1
, and col_2_2
.Rename
We wish to rename columns with the same name in both data frames.
In this example, we wish to rename col_3
coming from df_1
to col_2
and rename col_4
coming
from df_2
to col_3
and col_5
to col_4
.
df = df_1%>%
rename(col_2 = col_3)%>%
inner_join(df_2%>% rename(col_3 = col_4, col_4 = col_5), by = "col_1")
Here is how this works:
inner_join()
function from the dplyr
package to combine df_1
and df_2
based on
the common columns col_1
and col_2
.rename()
from dplyr
to change column names from their current values to
the desired values.rename()
expects a mapping between the current column names and the desired
column names of the form desired_name = current_name
which in this case is col_3 = col_4
for df_1
and col_3 = col_4, col_4 = col_5
for df_2.
rename()
will retain their current names.Alternative: Rename After Join
df = df_1 %>%
inner_join(df_2, by = "col_1") %>%
rename(col_2 = col_3.x, col_3 = col_3.y, col_4 = col_5)
Here is how this works:
dplyr
join functions automatically add .x
for columns coming from df_1
and .y
suffix
for columns coming from df2
if they have the same name.We wish to join two data frames and handle duplicate values in the join columns by removing them. If the desired output is a Cartesian product, then there is no need to remove duplicate values.
In this example, we wish to drop rows with duplicate values in col_1
in df_1
and then perform
the join operation.
df = df_1%>%
distinct(col_1, .keep_all = TRUE)%>%
inner_join(df_2)
Here is how this works:
distinct()
function to drop rows with duplicate values from the column col_1
of the
data frame df_1
. See Uniqueness..keep_all = TRUE
to keep all columns from df_1
after selecting distinct rows based
on col_1
.col_1
in this example.df
will contain the rows that have the same value for col_1
in
both df_1
and df_2
(except the second row from df_1
) and all columns in both data frames
with only one col_1
.We wish to join two data frames with missing values in the join columns. NA
values should not be
matched because NA
values are not comparable. That said, we should be mindful that ignoring NA
values will have different outcomes based on the join type.
NA
values for the columns from the right data
frame.NA
values for the columns from the left data
frame.NA
values for the columns from the other data frame.In this example, we wish to ignore rows with NA
values in the join column col_1.
df = df_1%>% inner_join(df_2, na_matches = "never")
Here is how this works:
inner_join()
function from the dplyr
package to combine df_1
and df_2
based on
the common column col_1
.inner_join()
treats two NA or NaN values as equal. We set na_matches=”never”
to
specify that two NA
or NAN
values should be treated as different.df
will contain the rows that have the same value for col_1
in
both df_1
and df_2
excluding any rows with NA
values in col_1
and all columns in both data
frames with only one col_1
.We wish to join two data frames on a common column with a different data type.
In this example, we wish to join df_1
and df_2
on col_1
which is of type integer in df_1
and
of type string in df_2
.
df = df_1%>% inner_join(df_2%>% mutate(col_1 = as.integer(col_1)))
Here is how this works:
We cast col_1
to integer before joining using as.integer(col_1)
function.
See Type Casting.