How to Get Distinct Combinations of Multiple Columns in a PySpark DataFrame

Published Jan 5, 2022  ∙  Updated May 2, 2022

How can we get all unique combinations of multiple columns in a PySpark DataFrame?

Suppose we have a DataFrame df with columns col1 and col2.

We can easily return all distinct values for a single column using distinct().

df.select('col1').distinct().collect()
# OR
df.select('col1').distinct().rdd.map(lambda r: r[0]).collect()

How can we get only distinct pairs of values in these two columns?

Get distinct pairs

We can simply add a second argument to distinct() with the second column name.

df.select('col1','col2').distinct().collect()
# OR
df.select('col1','col2').distinct().rdd.map(lambda r: r[0]).collect()

Get distinct combinations for all columns

We can also get the unique combinations for all columns in the DataFrame using the asterisk *.

df.select('*').distinct().collect()
# OR
df.select('*').distinct().rdd.map(lambda r: r[0]).collect()