What is the difference between sort() and orderBy() in Spark?

Published Jan 7, 2022

What is the difference between sort() and orderBy() in the Spark API?

SORT BY and ORDER BY are different in Spark SQL

The SORT BY clause is used to return the result rows sorted within each partition in the user specified order. When there is more than one partition SORT BY may return result that is partially ordered.

The SORT BY clause can be found in the Spark SQL documentation here.

The ORDER BY clause is used to return the result rows in a sorted manner in the user specified order. Unlike the SORT BY clause, this clause guarantees a total order in the output.

The ORDER BY clause can be found in the Spark SQL documentation here.

sort() and orderBy() are the same in the DataFrame API

So, if SORT BY and ORDER BY are different in Spark SQL, how are they the same in the Spark DataFrame API?

Let’s first look at some languages supported by Spark.

  • In Python, orderBy() is an alias of sort(), as seen in the source here.
  • In Scala, orderBy() is an alias of sort(), as seen in the source here
  • In Java, orderBy() is an alias of sort(), as seen in the documentation here

sort() and orderBy() both perform whole ordering of the dataset, like ORDER BY.

sortWithinPartitions() performs partition wise ordering, like SORT BY.