Joining data frames is a common task in data analysis and manipulation, but it can also present some special cases that we need to be mindful of and handle as appropriate for the purpose at hand.
In this section, we will discuss some of these special cases, including:
This section is complemented by:
We wish to rename columns that have the same name in both data frames being joined.
Suffix
We wish to add a suffix per data frame to columns with the same name in both data frames.
In this example, we wish to rename col_2
coming from df_1
to col_2_1
and rename col_2
coming
from df_2
to col_2_2
.
df = df_1.merge(df_2, how='inner', on="col_1", suffixes=("_1", "_2"))
Here is how this works:
merge()
function in pandas to combine df_1
and df_2
based on the common
column col_1
.inner
as the how
argument.merge()
will join the two data frames on all the columns with the same name in both
data frames. Here we specify that we wish to join on col_1
only by using on="col_1"
.
See Join Columns for coverage
of different scenarios.merge()
function will use _x
and _y
as suffixes to resolve duplicate column
names. We can change these suffixes by passing a tuple of suffixes ("_1", "_2")
in this
example.df
will contain the rows that have the same value for col_1
in
both df_1
and df_2
with three columns col_1
, col_2_1
, and col_2_2
.Rename
We wish to rename columns with the same name in both data frames.
In this example, we wish to rename col_3
coming from df_1
to col_2
and rename col_4
coming
from df_2
to col_3
and col_5
to col_4
.
df = df_1.rename(columns={'col_3': 'col_2'}). \
merge(df_2.rename(columns={'col_4': 'col_3', 'col_5': 'col_4'}),
how='inner')
Here is how this works:
merge()
function in pandas to combine df_1
and df_2
based on the common
column col_1
.inner
as the how
argument.("_1", "_2")
in this example.rename()
to change column names from their current values to the
desired values.rename()
expects a Python dictionary mapping between the current column
names (as dictionary keys) and desired column names (as dictionary values);
i.e. {current_name : desired_name}
. The name-mapping dictionary in this case
is {'col_a': 'col_1', 'col_b': 'col_2'}
.Alternative: Rename After Join
df = df_1.merge(df_2, how='inner', on='col_1').rename(
columns={'col_4_x': 'col_2', 'col_4_y': 'col_3', 'col_5': 'col_4'})
Here is how this works:
pandas
merge function automatically add _x
for columns coming from df_1
and _y
suffix
for columns coming from df2
if they have the same name.We wish to join two data frames and handle duplicate values in the join columns by removing them. If the desired output is a Cartesian product, then there is no need to remove duplicate values.
In this example, we wish to drop duplicates in df_1
and then perform the join operation.
df = df_1.drop_duplicates(subset=["col_1"]).merge(df_2, how='inner')
Here is how this works:
drop_duplicates()
function to drop duplicate rows in df_1
.col_1
in this example.df
will contain the rows that have the same value for col_1
in
both df_1
and df_2
(except second row from df_1
) and all columns in both data frames with
only one col_1
.We wish to join two data frames with missing values in the join columns. NA
values should not be
matched because NA
values are not comparable. Unfortunately, Pandas has a bug in which this is not
the current behavior where NA
values will be matched. We describe the needed step to get the
expected output based on the join type.
Inner Join
We wish to perform an inner join between two data frames with missing values in the join column. Any rows with missing values in the join column will be excluded which means that we need to drop NA values from either data frame.
In this example, we need to drop rows with NA
values in the join column col_1
.
df = df_1.dropna(subset=["col_1"]).merge(df_2, how='inner')
Here is how this works:
merge()
function in pandas to combine df_1
and df_2
based on the common
column col_1
.inner
as the how
argument.merge()
will join the two data frames on all the columns with the same name in both
data frames. See Join Columns
for coverage of join columns.NA
value in col_1
. We use dropna()
function to drop
rows with missing values, and we pass col_1
as the subset of columns we wish to use.df
will contain the rows that have the same value for col_1
in
both df_1
and df_2
excluding any rows with NA
values in col_1
and all columns in both data
frames with only one col_1
.Left Join
We wish to perform a left join between two data frames with missing values in the join column. The
resulting data frame will include all rows from the left data frame, along with any matching rows
from the right data frame. Rows with missing values in the join column will be included in the
resulting data frame and will have NA
values for the columns from the right data frame.
In this example, we need to drop rows with NA
in the values in the join column col_1
in df_2
.
df = df_1.merge(df_2.dropna(subset=["col_1"]), how='left')
Here is how this works:
NA
values in the join from
the right data frame, and we use a left join instead of an inner join.NA
values in the join column from
the left data frame.Full Join
We wish to perform a full join between two data frames with missing values in the join column. The
resulting data frame will include all rows from both data frames. Rows with no matching or missing
values in the join column will be included in the resulting data frame and will have NA
values for
the columns from the other data frame.
df = df_1.dropna(subset=["col_1"]).merge(df_2.dropna(subset=["col_1"]),
how='outer')
df = pd.concat([df,
df_1.loc[df_1['col_1'].isnull()],
df_2.loc[df_2['col_1'].isnull()]
])
Here is how this works:
merge()
function to perform a full join between df_1
and df_2
.NA
values on col_1
from both data frames. Rows that don’t match the
join condition will have NA
for all columns coming from the other data frame.NA
values in col_1
to df using concat()
function.
See Binding for a coverage of
binding.col_1
using df_1.loc[df_1['col_1'].isnull()]
.
See Filtering for a coverage of filtering.df
will contain all rows from df_1
and df_2
where rows that have the
same value for col_1
in both df_1
and df_2
are combined. Rows that don’t match the join
condition will have NA
for all columns coming from the other data frame.We wish to join two data frames on a common column with a different data type.
In this example we wish to join df_1 and df_2 on col_1 which is of type integer in df_1
and of
type string in df_2
.
df = df_1.merge(df_2.astype({'col_1': int}), how='inner', on='col_1')
Here is how this works:
We cast col_1
to integer before joining using astype()
function.
See Type Casting.