How to Remove Duplicate Columns on Join in a Spark DataFrame

Published Jan 18, 2022  ∙  Updated May 2, 2022

How can we perform a join between two Spark DataFrames without any duplicate columns?

Example scenario

Suppose we have two DataFrames: df1 and df2, both with columns col.

We want to join df1 and df2 over column col, so we might run a join like this:

joined = df1.join(df2, df1.col == df2.col)

Join DataFrames without duplicate columns

We can specify the join column using an array or a string to prevent duplicate columns.

joined = df1.join(df2, ["col"])
# OR
joined = df1.join(df2, "col")