pyspark.sql.DataFrame.join#

DataFrame.join(other, on=None, how=None)[source]#

Joins with another DataFrame, using the given join expression.

New in version 1.3.0.

Changed in version 3.4.0: Supports Spark Connect.

Parameters

otherDataFrame: Right side of the join
onstr, list or Column, optional: a string for the join column name, a list of column names, a join expression (Column), or a list of Columns. If on is a string or a list of strings indicating the name of the join column(s), the column(s) must exist on both sides, and this performs an equi-join.
howstr, optional: default inner. Must be one of: inner, cross, outer, full, fullouter, full_outer, left, leftouter, left_outer, right, rightouter, right_outer, semi, leftsemi, left_semi, anti, leftanti and left_anti.

Returns

DataFrame: Joined DataFrame.

Examples

The following examples demonstrate various join types among df1, df2, and df3.

>>> import pyspark.sql.functions as sf
>>> from pyspark.sql import Row
>>> df = spark.createDataFrame([Row(name="Alice", age=2), Row(name="Bob", age=5)])
>>> df2 = spark.createDataFrame([Row(name="Tom", height=80), Row(name="Bob", height=85)])
>>> df3 = spark.createDataFrame([
...     Row(name="Alice", age=10, height=80),
...     Row(name="Bob", age=5, height=None),
...     Row(name="Tom", age=None, height=None),
...     Row(name=None, age=None, height=None),
... ])

Inner join on columns (default)

>>> df.join(df2, "name").show()
+----+---+------+
|name|age|height|
+----+---+------+
| Bob|  5|    85|
+----+---+------+

>>> df.join(df3, ["name", "age"]).show()
+----+---+------+
|name|age|height|
+----+---+------+
| Bob|  5|  NULL|
+----+---+------+

Outer join on a single column with an explicit join condition.

When the join condition is explicited stated: df.name == df2.name, this will produce all records where the names match, as well as those that don’t (since it’s an outer join). If there are names in df2 that are not present in df, they will appear with NULL in the name column of df, and vice versa for df2.

>>> joined = df.join(df2, df.name == df2.name, "outer").sort(sf.desc(df.name))
>>> joined.show() 
+-----+----+----+------+
| name| age|name|height|
+-----+----+----+------+
|  Bob|   5| Bob|    85|
|Alice|   2|NULL|  NULL|
| NULL|NULL| Tom|    80|
+-----+----+----+------+

To unambiguously select output columns, specify the dataframe along with the column name:

>>> joined.select(df.name, df2.height).show() 
+-----+------+
| name|height|
+-----+------+
|  Bob|    85|
|Alice|  NULL|
| NULL|    80|
+-----+------+

However, in self-joins, direct column references can cause ambiguity:

>>> df.join(df, df.name == df.name, "outer").select(df.name).show() 
Traceback (most recent call last):
...
pyspark.errors.exceptions.captured.AnalysisException: Column name#0 are ambiguous...

A better approach is to assign aliases to the dataframes, and then reference the ouptut columns from the join operation using these aliases:

>>> df.alias("a").join(
...     df.alias("b"), sf.col("a.name") == sf.col("b.name"), "outer"
... ).sort(sf.desc("a.name")).select("a.name", "b.age").show()
+-----+---+
| name|age|
+-----+---+
|  Bob|  5|
|Alice|  2|
+-----+---+

Outer join on a single column with implicit join condition using column name

When you provide the column name directly as the join condition, Spark will treat both name columns as one, and will not produce separate columns for df.name and df2.name. This avoids having duplicate columns in the output.

>>> df.join(df2, "name", "outer").sort(sf.desc("name")).show()
+-----+----+------+
| name| age|height|
+-----+----+------+
|  Tom|NULL|    80|
|  Bob|   5|    85|
|Alice|   2|  NULL|
+-----+----+------+

Outer join on multiple columns

>>> df.join(df3, ["name", "age"], "outer").sort("name", "age").show()
+-----+----+------+
| name| age|height|
+-----+----+------+
| NULL|NULL|  NULL|
|Alice|   2|  NULL|
|Alice|  10|    80|
|  Bob|   5|  NULL|
|  Tom|NULL|  NULL|
+-----+----+------+

Left outer join on columns

>>> df.join(df2, "name", "left_outer").show()
+-----+---+------+
| name|age|height|
+-----+---+------+
|Alice|  2|  NULL|
|  Bob|  5|    85|
+-----+---+------+

Right outer join on columns

>>> df.join(df2, "name", "right_outer").show()
+----+----+------+
|name| age|height|
+----+----+------+
| Tom|NULL|    80|
| Bob|   5|    85|
+----+----+------+

Left semi join on columns

>>> df.join(df2, "name", "left_semi").show()
+----+---+
|name|age|
+----+---+
| Bob|  5|
+----+---+

Left anti join on columns

>>> df.join(df2, "name", "left_anti").show()
+-----+---+
| name|age|
+-----+---+
|Alice|  2|
+-----+---+