public class Dataset<T>
extends java.lang.Object
implements org.apache.spark.sql.execution.Queryable, scala.Serializable
Dataset
is a strongly typed collection of objects that can be transformed in parallel
using functional or relational operations.
A Dataset
differs from an RDD
in the following ways:
- Internally, a Dataset
is represented by a Catalyst logical plan and the data is stored
in the encoded form. This representation allows for additional logical operations and
enables many operations (sorting, shuffling, etc.) to be performed without deserializing to
an object.
- The creation of a Dataset
requires the presence of an explicit Encoder
that can be
used to serialize the object into a binary format. Encoders are also capable of mapping the
schema of a given object to the Spark SQL type system. In contrast, RDDs rely on runtime
reflection based serialization. Operations that change the type of object stored in the
dataset also need an encoder for the new type.
A Dataset
can be thought of as a specialized DataFrame, where the elements map to a specific
JVM object type, instead of to a generic Row
container. A DataFrame can be transformed into
specific Dataset by calling df.as[ElementType]
. Similarly you can transform a strongly-typed
Dataset
to a generic DataFrame by calling ds.toDF()
.
COMPATIBILITY NOTE: Long term we plan to make DataFrame
extend Dataset[Row]
. However,
making this change to the class hierarchy would break the function signatures for the existing
functional operations (map, flatMap, etc). As such, this class should be considered a preview
of the final API. Changes will be made to the interface after Spark 1.6.
Modifier and Type | Method and Description |
---|---|
<U> Dataset<U> |
as(Encoder<U> evidence$1)
Returns a new
Dataset where each record has been mapped on to the specified type. |
Dataset<T> |
as(java.lang.String alias)
Applies a logical alias to this
Dataset that can be used to disambiguate columns that have
the same name after two Datasets have been joined. |
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder<T> |
boundTEncoder()
The encoder where the expressions used to construct an object from an input row have been
bound to the ordinals of the given schema.
|
Dataset<T> |
cache()
Persist this
Dataset with the default storage level (MEMORY_AND_DISK ). |
Dataset<T> |
coalesce(int numPartitions)
Returns a new
Dataset that has exactly numPartitions partitions. |
java.lang.Object |
collect()
Returns an array that contains all the elements in this
Dataset . |
java.util.List<T> |
collectAsList()
Returns an array that contains all the elements in this
Dataset . |
long |
count()
Returns the number of elements in the
Dataset . |
Dataset<T> |
distinct()
|
void |
explain()
Prints the physical plan to the console for debugging purposes.
|
void |
explain(boolean extended)
Prints the plans (logical and physical) to the console for debugging purposes.
|
Dataset<T> |
filter(FilterFunction<T> func)
|
Dataset<T> |
filter(scala.Function1<T,java.lang.Object> func)
|
T |
first()
Returns the first element in this
Dataset . |
<U> Dataset<U> |
flatMap(FlatMapFunction<T,U> f,
Encoder<U> encoder)
|
<U> Dataset<U> |
flatMap(scala.Function1<T,scala.collection.TraversableOnce<U>> func,
Encoder<U> evidence$4)
|
void |
foreach(ForeachFunction<T> func)
(Java-specific)
Runs
func on each element of this Dataset . |
void |
foreach(scala.Function1<T,scala.runtime.BoxedUnit> func)
(Scala-specific)
Runs
func on each element of this Dataset . |
void |
foreachPartition(ForeachPartitionFunction<T> func)
(Java-specific)
Runs
func on each partition of this Dataset . |
void |
foreachPartition(scala.Function1<scala.collection.Iterator<T>,scala.runtime.BoxedUnit> func)
(Scala-specific)
Runs
func on each partition of this Dataset . |
GroupedDataset<Row,T> |
groupBy(Column... cols)
Returns a
GroupedDataset where the data is grouped by the given Column expressions. |
<K> GroupedDataset<K,T> |
groupBy(scala.Function1<T,K> func,
Encoder<K> evidence$5)
(Scala-specific)
Returns a
GroupedDataset where the data is grouped by the given key func . |
<K> GroupedDataset<K,T> |
groupBy(MapFunction<T,K> func,
Encoder<K> encoder)
(Java-specific)
Returns a
GroupedDataset where the data is grouped by the given key func . |
GroupedDataset<Row,T> |
groupBy(scala.collection.Seq<Column> cols)
Returns a
GroupedDataset where the data is grouped by the given Column expressions. |
Dataset<T> |
intersect(Dataset<T> other)
|
<U> Dataset<scala.Tuple2<T,U>> |
joinWith(Dataset<U> other,
Column condition)
Using inner equi-join to join this
Dataset returning a Tuple2 for each pair
where condition evaluates to true. |
<U> Dataset<scala.Tuple2<T,U>> |
joinWith(Dataset<U> other,
Column condition,
java.lang.String joinType)
|
<U> Dataset<U> |
map(scala.Function1<T,U> func,
Encoder<U> evidence$2)
(Scala-specific)
Returns a new
Dataset that contains the result of applying func to each element. |
<U> Dataset<U> |
map(MapFunction<T,U> func,
Encoder<U> encoder)
(Java-specific)
Returns a new
Dataset that contains the result of applying func to each element. |
<U> Dataset<U> |
mapPartitions(scala.Function1<scala.collection.Iterator<T>,scala.collection.Iterator<U>> func,
Encoder<U> evidence$3)
(Scala-specific)
Returns a new
Dataset that contains the result of applying func to each partition. |
<U> Dataset<U> |
mapPartitions(MapPartitionsFunction<T,U> f,
Encoder<U> encoder)
(Java-specific)
Returns a new
Dataset that contains the result of applying func to each partition. |
Dataset<T> |
persist()
Persist this
Dataset with the default storage level (MEMORY_AND_DISK ). |
Dataset<T> |
persist(StorageLevel newLevel)
Persist this
Dataset with the given storage level. |
void |
printSchema()
Prints the schema of the underlying
Dataset to the console in a nice tree format. |
org.apache.spark.sql.execution.QueryExecution |
queryExecution() |
RDD<T> |
rdd()
Converts this
Dataset to an RDD . |
T |
reduce(scala.Function2<T,T,T> func)
(Scala-specific)
Reduces the elements of this
Dataset using the specified binary function. |
T |
reduce(ReduceFunction<T> func)
(Java-specific)
Reduces the elements of this Dataset using the specified binary function.
|
Dataset<T> |
repartition(int numPartitions)
Returns a new
Dataset that has exactly numPartitions partitions. |
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder<T> |
resolvedTEncoder()
The encoder for this
Dataset that has been resolved to its output schema. |
Dataset<T> |
sample(boolean withReplacement,
double fraction)
Returns a new
Dataset by sampling a fraction of records, using a random seed. |
Dataset<T> |
sample(boolean withReplacement,
double fraction,
long seed)
Returns a new
Dataset by sampling a fraction of records. |
StructType |
schema()
Returns the schema of the encoded form of the objects in this
Dataset . |
protected DataFrame |
select(Column... cols)
Returns a new
DataFrame by selecting a set of column based expressions. |
protected DataFrame |
select(scala.collection.Seq<Column> cols)
Returns a new
DataFrame by selecting a set of column based expressions. |
<U1> Dataset<U1> |
select(TypedColumn<T,U1> c1,
Encoder<U1> evidence$6)
|
<U1,U2> Dataset<scala.Tuple2<U1,U2>> |
select(TypedColumn<T,U1> c1,
TypedColumn<T,U2> c2)
|
<U1,U2,U3> Dataset<scala.Tuple3<U1,U2,U3>> |
select(TypedColumn<T,U1> c1,
TypedColumn<T,U2> c2,
TypedColumn<T,U3> c3)
|
<U1,U2,U3,U4> |
select(TypedColumn<T,U1> c1,
TypedColumn<T,U2> c2,
TypedColumn<T,U3> c3,
TypedColumn<T,U4> c4)
|
<U1,U2,U3,U4,U5> |
select(TypedColumn<T,U1> c1,
TypedColumn<T,U2> c2,
TypedColumn<T,U3> c3,
TypedColumn<T,U4> c4,
TypedColumn<T,U5> c5)
|
protected Dataset<?> |
selectUntyped(scala.collection.Seq<TypedColumn<?,?>> columns)
Internal helper function for building typed selects that return tuples.
|
void |
show()
Displays the top 20 rows of
Dataset in a tabular form. |
void |
show(boolean truncate)
Displays the top 20 rows of
Dataset in a tabular form. |
void |
show(int numRows)
Displays the content of this
Dataset in a tabular form. |
void |
show(int numRows,
boolean truncate)
Displays the
Dataset in a tabular form. |
SQLContext |
sqlContext() |
Dataset<T> |
subtract(Dataset<T> other)
Returns a new
Dataset where any elements present in other have been removed. |
java.lang.Object |
take(int num)
Returns the first
num elements of this Dataset as an array. |
java.util.List<T> |
takeAsList(int num)
Returns the first
num elements of this Dataset as an array. |
DataFrame |
toDF()
Converts this strongly typed collection of data to generic Dataframe.
|
Dataset<T> |
toDS()
Returns this
Dataset . |
<U> Dataset<U> |
transform(scala.Function1<Dataset<T>,Dataset<U>> t)
Concise syntax for chaining custom transformations.
|
Dataset<T> |
union(Dataset<T> other)
|
Dataset<T> |
unpersist()
Mark the
Dataset as non-persistent, and remove all blocks for it from memory and disk. |
Dataset<T> |
unpersist(boolean blocking)
Mark the
Dataset as non-persistent, and remove all blocks for it from memory and disk. |
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder<T> |
unresolvedTEncoder()
An unresolved version of the internal encoder for the type of this
Dataset . |
public GroupedDataset<Row,T> groupBy(Column... cols)
GroupedDataset
where the data is grouped by the given Column
expressions.cols
- (undocumented)protected DataFrame select(Column... cols)
DataFrame
by selecting a set of column based expressions.
df.select($"colA", $"colB" + 1)
cols
- (undocumented)public SQLContext sqlContext()
sqlContext
in interface org.apache.spark.sql.execution.Queryable
public org.apache.spark.sql.execution.QueryExecution queryExecution()
queryExecution
in interface org.apache.spark.sql.execution.Queryable
public org.apache.spark.sql.catalyst.encoders.ExpressionEncoder<T> unresolvedTEncoder()
Dataset
. This one is
marked implicit so that we can use it when constructing new Dataset
objects that have the
same object type (that will be possibly resolved to a different schema).public org.apache.spark.sql.catalyst.encoders.ExpressionEncoder<T> resolvedTEncoder()
Dataset
that has been resolved to its output schema.public org.apache.spark.sql.catalyst.encoders.ExpressionEncoder<T> boundTEncoder()
public StructType schema()
Dataset
.schema
in interface org.apache.spark.sql.execution.Queryable
public void printSchema()
Dataset
to the console in a nice tree format.printSchema
in interface org.apache.spark.sql.execution.Queryable
public void explain(boolean extended)
explain
in interface org.apache.spark.sql.execution.Queryable
extended
- (undocumented)public void explain()
explain
in interface org.apache.spark.sql.execution.Queryable
public <U> Dataset<U> as(Encoder<U> evidence$1)
Dataset
where each record has been mapped on to the specified type. The
method used to map columns depend on the type of U
:
- When U
is a class, fields for the class will be mapped to columns of the same name
(case sensitivity is determined by spark.sql.caseSensitive
)
- When U
is a tuple, the columns will be be mapped by ordinal (i.e. the first column will
be assigned to _1
).
- When U
is a primitive type (i.e. String, Int, etc). then the first column of the
DataFrame
will be used.
If the schema of the DataFrame
does not match the desired U
type, you can use select
along with alias
or as
to rearrange or rename as required.
evidence$1
- (undocumented)public Dataset<T> as(java.lang.String alias)
Dataset
that can be used to disambiguate columns that have
the same name after two Datasets have been joined.alias
- (undocumented)public DataFrame toDF()
Row
objects that allow fields to be accessed by ordinal or name.public long count()
Dataset
.public void show(int numRows)
Dataset
in a tabular form. Strings more than 20 characters
will be truncated, and all cells will be aligned right. For example:
year month AVG('Adj Close) MAX('Adj Close)
1980 12 0.503218 0.595103
1981 01 0.523289 0.570307
1982 02 0.436504 0.475256
1983 03 0.410516 0.442194
1984 04 0.450090 0.483521
numRows
- Number of rows to show
public void show()
Dataset
in a tabular form. Strings more than 20 characters
will be truncated, and all cells will be aligned right.
public void show(boolean truncate)
Dataset
in a tabular form.
truncate
- Whether truncate long strings. If true, strings more than 20 characters will
be truncated and all cells will be aligned right
public void show(int numRows, boolean truncate)
Dataset
in a tabular form. For example:
year month AVG('Adj Close) MAX('Adj Close)
1980 12 0.503218 0.595103
1981 01 0.523289 0.570307
1982 02 0.436504 0.475256
1983 03 0.410516 0.442194
1984 04 0.450090 0.483521
numRows
- Number of rows to showtruncate
- Whether truncate long strings. If true, strings more than 20 characters will
be truncated and all cells will be aligned right
public Dataset<T> repartition(int numPartitions)
Dataset
that has exactly numPartitions
partitions.numPartitions
- (undocumented)public Dataset<T> coalesce(int numPartitions)
Dataset
that has exactly numPartitions
partitions.
Similar to coalesce defined on an RDD
, this operation results in a narrow dependency, e.g.
if you go from 1000 partitions to 100 partitions, there will not be a shuffle, instead each of
the 100 new partitions will claim 10 of the current partitions.numPartitions
- (undocumented)public <U> Dataset<U> transform(scala.Function1<Dataset<T>,Dataset<U>> t)
def featurize(ds: Dataset[T]) = ...
dataset
.transform(featurize)
.transform(...)
t
- (undocumented)public Dataset<T> filter(scala.Function1<T,java.lang.Object> func)
func
- (undocumented)public Dataset<T> filter(FilterFunction<T> func)
func
- (undocumented)public <U> Dataset<U> map(scala.Function1<T,U> func, Encoder<U> evidence$2)
Dataset
that contains the result of applying func
to each element.func
- (undocumented)evidence$2
- (undocumented)public <U> Dataset<U> map(MapFunction<T,U> func, Encoder<U> encoder)
Dataset
that contains the result of applying func
to each element.func
- (undocumented)encoder
- (undocumented)public <U> Dataset<U> mapPartitions(scala.Function1<scala.collection.Iterator<T>,scala.collection.Iterator<U>> func, Encoder<U> evidence$3)
Dataset
that contains the result of applying func
to each partition.func
- (undocumented)evidence$3
- (undocumented)public <U> Dataset<U> mapPartitions(MapPartitionsFunction<T,U> f, Encoder<U> encoder)
Dataset
that contains the result of applying func
to each partition.f
- (undocumented)encoder
- (undocumented)public <U> Dataset<U> flatMap(scala.Function1<T,scala.collection.TraversableOnce<U>> func, Encoder<U> evidence$4)
Dataset
by first applying a function to all elements of this Dataset
,
and then flattening the results.func
- (undocumented)evidence$4
- (undocumented)public <U> Dataset<U> flatMap(FlatMapFunction<T,U> f, Encoder<U> encoder)
Dataset
by first applying a function to all elements of this Dataset
,
and then flattening the results.f
- (undocumented)encoder
- (undocumented)public void foreach(scala.Function1<T,scala.runtime.BoxedUnit> func)
func
on each element of this Dataset
.func
- (undocumented)public void foreach(ForeachFunction<T> func)
func
on each element of this Dataset
.func
- (undocumented)public void foreachPartition(scala.Function1<scala.collection.Iterator<T>,scala.runtime.BoxedUnit> func)
func
on each partition of this Dataset
.func
- (undocumented)public void foreachPartition(ForeachPartitionFunction<T> func)
func
on each partition of this Dataset
.func
- (undocumented)public T reduce(scala.Function2<T,T,T> func)
Dataset
using the specified binary function. The given func
must be commutative and associative or the result may be non-deterministic.func
- (undocumented)public T reduce(ReduceFunction<T> func)
func
must be commutative and associative or the result may be non-deterministic.func
- (undocumented)public <K> GroupedDataset<K,T> groupBy(scala.Function1<T,K> func, Encoder<K> evidence$5)
GroupedDataset
where the data is grouped by the given key func
.func
- (undocumented)evidence$5
- (undocumented)public GroupedDataset<Row,T> groupBy(scala.collection.Seq<Column> cols)
GroupedDataset
where the data is grouped by the given Column
expressions.cols
- (undocumented)public <K> GroupedDataset<K,T> groupBy(MapFunction<T,K> func, Encoder<K> encoder)
GroupedDataset
where the data is grouped by the given key func
.func
- (undocumented)encoder
- (undocumented)protected DataFrame select(scala.collection.Seq<Column> cols)
DataFrame
by selecting a set of column based expressions.
df.select($"colA", $"colB" + 1)
cols
- (undocumented)public <U1> Dataset<U1> select(TypedColumn<T,U1> c1, Encoder<U1> evidence$6)
Dataset
by computing the given Column
expression for each element.
val ds = Seq(1, 2, 3).toDS()
val newDS = ds.select(expr("value + 1").as[Int])
c1
- (undocumented)evidence$6
- (undocumented)protected Dataset<?> selectUntyped(scala.collection.Seq<TypedColumn<?,?>> columns)
columns
- (undocumented)public <U1,U2> Dataset<scala.Tuple2<U1,U2>> select(TypedColumn<T,U1> c1, TypedColumn<T,U2> c2)
c1
- (undocumented)c2
- (undocumented)public <U1,U2,U3> Dataset<scala.Tuple3<U1,U2,U3>> select(TypedColumn<T,U1> c1, TypedColumn<T,U2> c2, TypedColumn<T,U3> c3)
c1
- (undocumented)c2
- (undocumented)c3
- (undocumented)public <U1,U2,U3,U4> Dataset<scala.Tuple4<U1,U2,U3,U4>> select(TypedColumn<T,U1> c1, TypedColumn<T,U2> c2, TypedColumn<T,U3> c3, TypedColumn<T,U4> c4)
c1
- (undocumented)c2
- (undocumented)c3
- (undocumented)c4
- (undocumented)public <U1,U2,U3,U4,U5> Dataset<scala.Tuple5<U1,U2,U3,U4,U5>> select(TypedColumn<T,U1> c1, TypedColumn<T,U2> c2, TypedColumn<T,U3> c3, TypedColumn<T,U4> c4, TypedColumn<T,U5> c5)
c1
- (undocumented)c2
- (undocumented)c3
- (undocumented)c4
- (undocumented)c5
- (undocumented)public Dataset<T> sample(boolean withReplacement, double fraction, long seed)
Dataset
by sampling a fraction of records.withReplacement
- (undocumented)fraction
- (undocumented)seed
- (undocumented)public Dataset<T> sample(boolean withReplacement, double fraction)
Dataset
by sampling a fraction of records, using a random seed.withReplacement
- (undocumented)fraction
- (undocumented)public Dataset<T> distinct()
Dataset
that contains only the unique elements of this Dataset
.
Note that, equality checking is performed directly on the encoded representation of the data
and thus is not affected by a custom equals
function defined on T
.
public Dataset<T> intersect(Dataset<T> other)
Dataset
that contains only the elements of this Dataset
that are also
present in other
.
Note that, equality checking is performed directly on the encoded representation of the data
and thus is not affected by a custom equals
function defined on T
.
other
- (undocumented)public Dataset<T> union(Dataset<T> other)
Dataset
that contains the elements of both this and the other
Dataset
combined.
Note that, this function is not a typical set union operation, in that it does not eliminate
duplicate items. As such, it is analogous to UNION ALL
in SQL.
other
- (undocumented)public Dataset<T> subtract(Dataset<T> other)
Dataset
where any elements present in other
have been removed.
Note that, equality checking is performed directly on the encoded representation of the data
and thus is not affected by a custom equals
function defined on T
.
other
- (undocumented)public <U> Dataset<scala.Tuple2<T,U>> joinWith(Dataset<U> other, Column condition, java.lang.String joinType)
Dataset
returning a Tuple2
for each pair where condition
evaluates to
true.
This is similar to the relation join
function with one important difference in the
result schema. Since joinWith
preserves objects present on either side of the join, the
result schema is similarly nested into a tuple under the column names _1
and _2
.
This type of join can be useful both for preserving type-safety with the original object types as well as working with relational data where either side of the join has column names in common.
other
- Right side of the join.condition
- Join expression.joinType
- One of: inner
, outer
, left_outer
, right_outer
, leftsemi
.public <U> Dataset<scala.Tuple2<T,U>> joinWith(Dataset<U> other, Column condition)
Dataset
returning a Tuple2
for each pair
where condition
evaluates to true.
other
- Right side of the join.condition
- Join expression.public T first()
Dataset
.public java.lang.Object collect()
Dataset
.
Running collect requires moving all the data into the application's driver process, and
doing so on a very large Dataset
can crash the driver process with OutOfMemoryError.
For Java API, use collectAsList
.
public java.util.List<T> collectAsList()
Dataset
.
Running collect requires moving all the data into the application's driver process, and
doing so on a very large Dataset
can crash the driver process with OutOfMemoryError.
For Java API, use collectAsList
.
public java.lang.Object take(int num)
num
elements of this Dataset
as an array.
Running take requires moving data into the application's driver process, and doing so with
a very large num
can crash the driver process with OutOfMemoryError.
num
- (undocumented)public java.util.List<T> takeAsList(int num)
num
elements of this Dataset
as an array.
Running take requires moving data into the application's driver process, and doing so with
a very large num
can crash the driver process with OutOfMemoryError.
num
- (undocumented)public Dataset<T> persist()
Dataset
with the default storage level (MEMORY_AND_DISK
).public Dataset<T> cache()
Dataset
with the default storage level (MEMORY_AND_DISK
).public Dataset<T> persist(StorageLevel newLevel)
Dataset
with the given storage level.newLevel
- One of: MEMORY_ONLY
, MEMORY_AND_DISK
, MEMORY_ONLY_SER
,
MEMORY_AND_DISK_SER
, DISK_ONLY
, MEMORY_ONLY_2
,
MEMORY_AND_DISK_2
, etc.public Dataset<T> unpersist(boolean blocking)
Dataset
as non-persistent, and remove all blocks for it from memory and disk.blocking
- Whether to block until all blocks are deleted.