Join Special Cases

Joining tables is a common task in data analysis and manipulation, but it can also present some special cases that we need to be mindful of and handle as appropriate for the purpose at hand.

In this section, we will discuss some of these special cases, including:

Duplicate Column Names: Join two tables with duplicate column names.
Duplicate Join Values: Join two tables with duplicate values in the join columns.
Missing Join Values: Join two tables with missing values in the join columns.
Different Join Data Types: Join two tables on join columns with different data types.

This section is complemented by:

Join Types: where we cover how to the different join types and their meaning.
Join Columns: Specify the columns we wish to use in the join condition.

Duplicate Column Names

We wish to rename columns that have the same name in both tables being joined.

In this example, we wish to rename col_2 coming from table_1 to col_2_x and rename col_2 coming from table_2 to col_2_y.

SELECT t1.col_1,
       t1.col_2 AS col_2_x,
       t1.col_3,
       t2.col_2 AS col_2_y

FROM table_1 AS t1
         INNER JOIN table_2 AS t2 ON
    t1.col_1 = t2.col_1;

Here is how this works:

We use the INNER JOIN operation in SQL to combine table_1 and table_2 based on the common column col_1.
We give each table an alias using AS, in this example table_1 alias is t1 and table_2 alias is t_2. We recommend using table alias always when performing a join, so we can easily select the columns we wish to keep from each table and write clear join conditions.
We pass the join condition to the ON clause, in this example t1.col_1 = t2.col_1;. See Join Columns to learn how to change this behavior.
By default, SQL will use _1 and _2 as suffixes to resolve duplicate column names if we select all columns using .* notation. See Selecting By Name
We specify the list of columns we wish to keep in the SELECT statement, and we rename the columns using AS alias. See Renaming.
The output table will contain the rows that have the same value for col_1 in both table_1 and table_2 and the selected columns after renaming.
The same applies to other join types. See Join Types.

Duplicate Join Values

We wish to join two tables and handle duplicate values in the join columns by removing them. If the desired output is a Cartesian product, then there is no need to remove duplicate values.

In this example, we wish to drop rows with duplicate values in col_1 in table_1 and then perform the join operation.

WITH table_1_distinct AS
         (SELECT *
          FROM table_1
          QUALIFY ROW_NUMBER() OVER (PARTITION BY col_1) = 1)


SELECT t1.col_1,
       t1.col_2,
       t1.col_3,
       t2.col_4
FROM table_1_distinct AS t1
         INNER JOIN table_2 AS t2 ON t1.col_1 = t2.col_1;

Here is how this works:

We use the QUALIFY ROW_NUMBER() function to drop rows with duplicate values from the column col_1 of the table table_1. See Uniqueness.
The output table will contain the rows that have the same value for col_1 in both table_1 and table_2 after dropping rows with duplicate values in table_1.

We wish to join two tables with missing values in the join columns. NULL values should not be matched because NULL values are not comparable. That said, we should be mindful that ignoring NULL values will have different outcomes based on the join type.

Inner Join: Any rows with missing values in the join column will be excluded.
Left Join: The resulting table will include all rows from the left table, along with any matching rows from the right table. Rows with missing values in the join column will be included in the resulting table and will have NULL values for the columns from the right table.
Right Join: Similar to left join but rows with missing values in the join column will be included in the resulting table and will have NULL values for the columns from the left table.
Full Join: The resulting table will include all rows from both tables. Rows with no matching or missing values in the join column will be included in the resulting table and will have NULL values for the columns from the other table.

This is the default behoving in SQL and can be changed by adding a specific join condition to check for NULL values.

In this example, we wish to join table_1 and table_2 based on col_1 and consider NULL values as equal.

SELECT t1.col_1,
       t1.col_2,
       t1.col_3,
       t2.col_4
FROM table_1 AS t1
         INNER JOIN table_2 AS t2 ON
            t1.col_1 = t2.col_1 OR
            (t1.col_1 IS NULL AND t2.col_1 IS NULL);

Here is how this works:

We use the INNER JOIN operation in SQL to combine table_1 and table_2 based on the common column col_1.
Rows with an equal value in both tables for col_1 will be combined. Sice NULL will be ignored in SQL, we use an OR condition to check if col_1 is NULL in both tables (t1.col_1 IS NULL AND t2.col_1 IS NULL).
The output table will contain the rows that have the same value for col_1 in both table_1 and table_2 or rows that have NULL value in col_1 in both tables. Note that this will generate a cartesian product between rows with NULL values in col_1.

Different Join Data Types

We wish to join two tables on a column with a different data type.

In this example, we wish to join table_1 and table_2 on col_1 which is of type integer in table_1 and of type string in table_2.

SELECT t1.col_1,
       t1.col_2,
       t1.col_3,
       t2.col_4
FROM table_1 AS t1
         INNER JOIN table_2 AS t2 ON
    t1.col_1 = SAFE_CAST(t2.col_1 AS int64)

Here is how this works:

We cast col_1 to integer before joining using SAFE_CAST(t2.col_1 AS int64) function. See Type Casting.

Optima.io Reference beta

Join Special Cases

Duplicate Column Names

Duplicate Join Values

Missing Join Values

Different Join Data Types