How to Get Distinct Combinations of Multiple Columns in a PySpark DataFrame

Published Jan 5, 2022  ∙  Updated May 2, 2022

How can we get all unique combinations of multiple columns in a PySpark DataFrame?

Suppose we have a DataFrame df with columns col1 and col2.

We can easily return all distinct values for a single column using distinct().'col1').distinct().collect()
# OR'col1').distinct() r: r[0]).collect()

How can we get only distinct pairs of values in these two columns?

Get distinct pairs

We can simply add a second argument to distinct() with the second column name.'col1','col2').distinct().collect()
# OR'col1','col2').distinct() r: r[0]).collect()

Get distinct combinations for all columns

We can also get the unique combinations for all columns in the DataFrame using the asterisk *.'*').distinct().collect()
# OR'*').distinct() r: r[0]).collect()