How to Create a New Column From Another Column Based on Multiple Conditions in PySpark

Published Jan 8, 2022

How can we create a column based on another column in PySpark with multiple conditions?

For instance, suppose we have a PySpark DataFrame df with a time column, containing an integer representing the hour of the day from 0 to 24.

We want to create a new column day_or_night that follows these criteria:

  • If time is between [0, 8], then day_or_night is Night
  • If time is between [9, 18], then day_or_night is Day
  • If time is between [19, 23], then day_or_night is Night

This can be simplified down to: everything between 9am and 7pm should be considered Day.

Anything else is considered Night.

Create new columns using withColumn()

We can easily create new columns based on other columns using the DataFrame’s withColumn() method.

For example, if the column num is of type double, we can create a new column num_div_10 like so:

df = df.withColumn('num_div_10', df['num'] / 10)

But now, we want to set values for our new column based on certain conditions.

Create conditions using when() and otherwise()

We can create a proper if-then-else structure using when() and otherwise() in PySpark.

The when() method functions as our if statement. The first argument is our condition, and the second argument is the value of that column if that condition is true.

Any when() method chained after the first when() is essentially an else if statement.

Naturally, otherwise() is our else statement.

from pyspark.sql.functions import when, col
conditions = when(col("time") < 9, "Night")
            .when(col("time") > 18, "Night")
df = df.withColumn("day_or_night", conditions)

As we can see, when() allows us to chain multiple if statements together.

We can also use simple AND and OR operators to simplify logic.

from pyspark.sql.functions import when, col
conditions = when(col("time") < 9 | col("time") > 18, "Night")
df = df.withColumn("day_or_night", conditions)

We can further simplify using between() (both lower and upper bounds of between() are inclusive).

from pyspark.sql.functions import when, col
conditions = when(col("time").between(9,18), "Day")
df = df.withColumn("day_or_night", conditions)

Read more about between() in the documentation here.