pyspark.sql.DataFrame.join#
- DataFrame.join(other, on=None, how=None)[source]#
Joins with another
DataFrame
, using the given join expression.New in version 1.3.0.
Changed in version 3.4.0: Supports Spark Connect.
- Parameters
- other
DataFrame
Right side of the join
- onstr, list or
Column
, optional a string for the join column name, a list of column names, a join expression (Column), or a list of Columns. If on is a string or a list of strings indicating the name of the join column(s), the column(s) must exist on both sides, and this performs an equi-join.
- howstr, optional
default
inner
. Must be one of:inner
,cross
,outer
,full
,fullouter
,full_outer
,left
,leftouter
,left_outer
,right
,rightouter
,right_outer
,semi
,leftsemi
,left_semi
,anti
,leftanti
andleft_anti
.
- other
- Returns
DataFrame
Joined DataFrame.
Examples
The following examples demonstrate various join types among
df1
,df2
, anddf3
.>>> import pyspark.sql.functions as sf >>> from pyspark.sql import Row >>> df = spark.createDataFrame([Row(name="Alice", age=2), Row(name="Bob", age=5)]) >>> df2 = spark.createDataFrame([Row(name="Tom", height=80), Row(name="Bob", height=85)]) >>> df3 = spark.createDataFrame([ ... Row(name="Alice", age=10, height=80), ... Row(name="Bob", age=5, height=None), ... Row(name="Tom", age=None, height=None), ... Row(name=None, age=None, height=None), ... ])
Inner join on columns (default)
>>> df.join(df2, "name").show() +----+---+------+ |name|age|height| +----+---+------+ | Bob| 5| 85| +----+---+------+
>>> df.join(df3, ["name", "age"]).show() +----+---+------+ |name|age|height| +----+---+------+ | Bob| 5| NULL| +----+---+------+
Outer join on a single column with an explicit join condition.
When the join condition is explicited stated: df.name == df2.name, this will produce all records where the names match, as well as those that don’t (since it’s an outer join). If there are names in df2 that are not present in df, they will appear with NULL in the name column of df, and vice versa for df2.
>>> joined = df.join(df2, df.name == df2.name, "outer").sort(sf.desc(df.name)) >>> joined.show() +-----+----+----+------+ | name| age|name|height| +-----+----+----+------+ | Bob| 5| Bob| 85| |Alice| 2|NULL| NULL| | NULL|NULL| Tom| 80| +-----+----+----+------+
To unambiguously select output columns, specify the dataframe along with the column name:
>>> joined.select(df.name, df2.height).show() +-----+------+ | name|height| +-----+------+ | Bob| 85| |Alice| NULL| | NULL| 80| +-----+------+
However, in self-joins, direct column references can cause ambiguity:
>>> df.join(df, df.name == df.name, "outer").select(df.name).show() Traceback (most recent call last): ... pyspark.errors.exceptions.captured.AnalysisException: Column name#0 are ambiguous...
A better approach is to assign aliases to the dataframes, and then reference the ouptut columns from the join operation using these aliases:
>>> df.alias("a").join( ... df.alias("b"), sf.col("a.name") == sf.col("b.name"), "outer" ... ).sort(sf.desc("a.name")).select("a.name", "b.age").show() +-----+---+ | name|age| +-----+---+ | Bob| 5| |Alice| 2| +-----+---+
Outer join on a single column with implicit join condition using column name
When you provide the column name directly as the join condition, Spark will treat both name columns as one, and will not produce separate columns for df.name and df2.name. This avoids having duplicate columns in the output.
>>> df.join(df2, "name", "outer").sort(sf.desc("name")).show() +-----+----+------+ | name| age|height| +-----+----+------+ | Tom|NULL| 80| | Bob| 5| 85| |Alice| 2| NULL| +-----+----+------+
Outer join on multiple columns
>>> df.join(df3, ["name", "age"], "outer").sort("name", "age").show() +-----+----+------+ | name| age|height| +-----+----+------+ | NULL|NULL| NULL| |Alice| 2| NULL| |Alice| 10| 80| | Bob| 5| NULL| | Tom|NULL| NULL| +-----+----+------+
Left outer join on columns
>>> df.join(df2, "name", "left_outer").show() +-----+---+------+ | name|age|height| +-----+---+------+ |Alice| 2| NULL| | Bob| 5| 85| +-----+---+------+
Right outer join on columns
>>> df.join(df2, "name", "right_outer").show() +----+----+------+ |name| age|height| +----+----+------+ | Tom|NULL| 80| | Bob| 5| 85| +----+----+------+
Left semi join on columns
>>> df.join(df2, "name", "left_semi").show() +----+---+ |name|age| +----+---+ | Bob| 5| +----+---+
Left anti join on columns
>>> df.join(df2, "name", "left_anti").show() +-----+---+ | name|age| +-----+---+ |Alice| 2| +-----+---+