Dataset

java.lang.Object
- org.apache.spark.sql.Dataset<T>

All Implemented Interfaces:

java.io.Serializable, org.apache.spark.sql.execution.Queryable
```
public class Dataset<T>
extends java.lang.Object
implements org.apache.spark.sql.execution.Queryable, scala.Serializable
```
:: Experimental :: A Dataset is a strongly typed collection of objects that can be transformed in parallel using functional or relational operations.
A Dataset differs from an RDD in the following ways: - Internally, a Dataset is represented by a Catalyst logical plan and the data is stored in the encoded form. This representation allows for additional logical operations and enables many operations (sorting, shuffling, etc.) to be performed without deserializing to an object. - The creation of a Dataset requires the presence of an explicit Encoder that can be used to serialize the object into a binary format. Encoders are also capable of mapping the schema of a given object to the Spark SQL type system. In contrast, RDDs rely on runtime reflection based serialization. Operations that change the type of object stored in the dataset also need an encoder for the new type.
A Dataset can be thought of as a specialized DataFrame, where the elements map to a specific JVM object type, instead of to a generic Row container. A DataFrame can be transformed into specific Dataset by calling df.as[ElementType]. Similarly you can transform a strongly-typed Dataset to a generic DataFrame by calling ds.toDF().
COMPATIBILITY NOTE: Long term we plan to make DataFrame extend Dataset[Row]. However, making this change to the class hierarchy would break the function signatures for the existing functional operations (map, flatMap, etc). As such, this class should be considered a preview of the final API. Changes will be made to the interface after Spark 1.6.

Since:

1.6.0

See Also:
Serialized Form

Method Summary

Methods
Modifier and Type	Method and Description
`<U> Dataset<U>`	`as(Encoder<U> evidence$1)` Returns a new `Dataset` where each record has been mapped on to the specified type.
`Dataset<T>`	`as(java.lang.String alias)` Applies a logical alias to this `Dataset` that can be used to disambiguate columns that have the same name after two Datasets have been joined.
`org.apache.spark.sql.catalyst.encoders.ExpressionEncoder<T>`	`boundTEncoder()` The encoder where the expressions used to construct an object from an input row have been bound to the ordinals of the given schema.
`Dataset<T>`	`cache()` Persist this `Dataset` with the default storage level (`MEMORY_AND_DISK`).
`Dataset<T>`	`coalesce(int numPartitions)` Returns a new `Dataset` that has exactly `numPartitions` partitions.
`java.lang.Object`	`collect()` Returns an array that contains all the elements in this `Dataset`.
`java.util.List<T>`	`collectAsList()` Returns an array that contains all the elements in this `Dataset`.
`long`	`count()` Returns the number of elements in the `Dataset`.
`Dataset<T>`	`distinct()` Returns a new `Dataset` that contains only the unique elements of this `Dataset`.
`void`	`explain()` Prints the physical plan to the console for debugging purposes.
`void`	`explain(boolean extended)` Prints the plans (logical and physical) to the console for debugging purposes.
`Dataset<T>`	`filter(FilterFunction<T> func)` (Java-specific) Returns a new `Dataset` that only contains elements where `func` returns `true`.
`Dataset<T>`	`filter(scala.Function1<T,java.lang.Object> func)` (Scala-specific) Returns a new `Dataset` that only contains elements where `func` returns `true`.
`T`	`first()` Returns the first element in this `Dataset`.
`<U> Dataset<U>`	`flatMap(FlatMapFunction<T,U> f, Encoder<U> encoder)` (Java-specific) Returns a new `Dataset` by first applying a function to all elements of this `Dataset`, and then flattening the results.
`<U> Dataset<U>`	`flatMap(scala.Function1<T,scala.collection.TraversableOnce<U>> func, Encoder<U> evidence$4)` (Scala-specific) Returns a new `Dataset` by first applying a function to all elements of this `Dataset`, and then flattening the results.
`void`	`foreach(ForeachFunction<T> func)` (Java-specific) Runs `func` on each element of this `Dataset`.
`void`	`foreach(scala.Function1<T,scala.runtime.BoxedUnit> func)` (Scala-specific) Runs `func` on each element of this `Dataset`.
`void`	`foreachPartition(ForeachPartitionFunction<T> func)` (Java-specific) Runs `func` on each partition of this `Dataset`.
`void`	`foreachPartition(scala.Function1<scala.collection.Iterator<T>,scala.runtime.BoxedUnit> func)` (Scala-specific) Runs `func` on each partition of this `Dataset`.
`GroupedDataset<Row,T>`	`groupBy(Column... cols)` Returns a `GroupedDataset` where the data is grouped by the given `Column` expressions.
`<K> GroupedDataset<K,T>`	`groupBy(scala.Function1<T,K> func, Encoder<K> evidence$5)` (Scala-specific) Returns a `GroupedDataset` where the data is grouped by the given key `func`.
`<K> GroupedDataset<K,T>`	`groupBy(MapFunction<T,K> func, Encoder<K> encoder)` (Java-specific) Returns a `GroupedDataset` where the data is grouped by the given key `func`.
`GroupedDataset<Row,T>`	`groupBy(scala.collection.Seq<Column> cols)` Returns a `GroupedDataset` where the data is grouped by the given `Column` expressions.
`Dataset<T>`	`intersect(Dataset<T> other)` Returns a new `Dataset` that contains only the elements of this `Dataset` that are also present in `other`.
`<U> Dataset<scala.Tuple2<T,U>>`	`joinWith(Dataset<U> other, Column condition)` Using inner equi-join to join this `Dataset` returning a `Tuple2` for each pair where `condition` evaluates to true.
`<U> Dataset<scala.Tuple2<T,U>>`	`joinWith(Dataset<U> other, Column condition, java.lang.String joinType)` Joins this `Dataset` returning a `Tuple2` for each pair where `condition` evaluates to true.
`<U> Dataset<U>`	`map(scala.Function1<T,U> func, Encoder<U> evidence$2)` (Scala-specific) Returns a new `Dataset` that contains the result of applying `func` to each element.
`<U> Dataset<U>`	`map(MapFunction<T,U> func, Encoder<U> encoder)` (Java-specific) Returns a new `Dataset` that contains the result of applying `func` to each element.
`<U> Dataset<U>`	`mapPartitions(scala.Function1<scala.collection.Iterator<T>,scala.collection.Iterator<U>> func, Encoder<U> evidence$3)` (Scala-specific) Returns a new `Dataset` that contains the result of applying `func` to each partition.
`<U> Dataset<U>`	`mapPartitions(MapPartitionsFunction<T,U> f, Encoder<U> encoder)` (Java-specific) Returns a new `Dataset` that contains the result of applying `func` to each partition.
`Dataset<T>`	`persist()` Persist this `Dataset` with the default storage level (`MEMORY_AND_DISK`).
`Dataset<T>`	`persist(StorageLevel newLevel)` Persist this `Dataset` with the given storage level.
`void`	`printSchema()` Prints the schema of the underlying `Dataset` to the console in a nice tree format.
`org.apache.spark.sql.execution.QueryExecution`	`queryExecution()`
`RDD<T>`	`rdd()` Converts this `Dataset` to an `RDD`.
`T`	`reduce(scala.Function2<T,T,T> func)` (Scala-specific) Reduces the elements of this `Dataset` using the specified binary function.
`T`	`reduce(ReduceFunction<T> func)` (Java-specific) Reduces the elements of this Dataset using the specified binary function.
`Dataset<T>`	`repartition(int numPartitions)` Returns a new `Dataset` that has exactly `numPartitions` partitions.
`org.apache.spark.sql.catalyst.encoders.ExpressionEncoder<T>`	`resolvedTEncoder()` The encoder for this `Dataset` that has been resolved to its output schema.
`Dataset<T>`	`sample(boolean withReplacement, double fraction)` Returns a new `Dataset` by sampling a fraction of records, using a random seed.
`Dataset<T>`	`sample(boolean withReplacement, double fraction, long seed)` Returns a new `Dataset` by sampling a fraction of records.
`StructType`	`schema()` Returns the schema of the encoded form of the objects in this `Dataset`.
`protected DataFrame`	`select(Column... cols)` Returns a new `DataFrame` by selecting a set of column based expressions.
`protected DataFrame`	`select(scala.collection.Seq<Column> cols)` Returns a new `DataFrame` by selecting a set of column based expressions.
`<U1> Dataset<U1>`	`select(TypedColumn<T,U1> c1, Encoder<U1> evidence$6)` Returns a new `Dataset` by computing the given `Column` expression for each element.
`<U1,U2> Dataset<scala.Tuple2<U1,U2>>`	`select(TypedColumn<T,U1> c1, TypedColumn<T,U2> c2)` Returns a new `Dataset` by computing the given `Column` expressions for each element.
`<U1,U2,U3> Dataset<scala.Tuple3<U1,U2,U3>>`	`select(TypedColumn<T,U1> c1, TypedColumn<T,U2> c2, TypedColumn<T,U3> c3)` Returns a new `Dataset` by computing the given `Column` expressions for each element.
`<U1,U2,U3,U4> Dataset<scala.Tuple4<U1,U2,U3,U4>>`	`select(TypedColumn<T,U1> c1, TypedColumn<T,U2> c2, TypedColumn<T,U3> c3, TypedColumn<T,U4> c4)` Returns a new `Dataset` by computing the given `Column` expressions for each element.
`<U1,U2,U3,U4,U5> Dataset<scala.Tuple5<U1,U2,U3,U4,U5>>`	`select(TypedColumn<T,U1> c1, TypedColumn<T,U2> c2, TypedColumn<T,U3> c3, TypedColumn<T,U4> c4, TypedColumn<T,U5> c5)` Returns a new `Dataset` by computing the given `Column` expressions for each element.
`protected Dataset<?>`	`selectUntyped(scala.collection.Seq<TypedColumn<?,?>> columns)` Internal helper function for building typed selects that return tuples.
`void`	`show()` Displays the top 20 rows of `Dataset` in a tabular form.
`void`	`show(boolean truncate)` Displays the top 20 rows of `Dataset` in a tabular form.
`void`	`show(int numRows)` Displays the content of this `Dataset` in a tabular form.
`void`	`show(int numRows, boolean truncate)` Displays the `Dataset` in a tabular form.
`SQLContext`	`sqlContext()`
`Dataset<T>`	`subtract(Dataset<T> other)` Returns a new `Dataset` where any elements present in `other` have been removed.
`java.lang.Object`	`take(int num)` Returns the first `num` elements of this `Dataset` as an array.
`java.util.List<T>`	`takeAsList(int num)` Returns the first `num` elements of this `Dataset` as an array.
`DataFrame`	`toDF()` Converts this strongly typed collection of data to generic Dataframe.
`Dataset<T>`	`toDS()` Returns this `Dataset`.
`<U> Dataset<U>`	`transform(scala.Function1<Dataset<T>,Dataset<U>> t)` Concise syntax for chaining custom transformations.
`Dataset<T>`	`union(Dataset<T> other)` Returns a new `Dataset` that contains the elements of both this and the `other` `Dataset` combined.
`Dataset<T>`	`unpersist()` Mark the `Dataset` as non-persistent, and remove all blocks for it from memory and disk.
`Dataset<T>`	`unpersist(boolean blocking)` Mark the `Dataset` as non-persistent, and remove all blocks for it from memory and disk.
`org.apache.spark.sql.catalyst.encoders.ExpressionEncoder<T>`	`unresolvedTEncoder()` An unresolved version of the internal encoder for the type of this `Dataset`.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Methods inherited from interface org.apache.spark.sql.execution.Queryable
toString

- Method Detail
 - groupBy
```
public GroupedDataset<Row,T> groupBy(Column... cols)
```
 Returns a GroupedDataset where the data is grouped by the given Column expressions.
 
 Parameters:
 cols - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 1.6.0
 - select
```
protected DataFrame select(Column... cols)
```
 Returns a new DataFrame by selecting a set of column based expressions.
```
 df.select($"colA", $"colB" + 1)
 
```
 Parameters:
 cols - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 1.6.0
 - sqlContext
```
public SQLContext sqlContext()
```
 Specified by:
 
 sqlContext in interface org.apache.spark.sql.execution.Queryable
 - queryExecution
```
public org.apache.spark.sql.execution.QueryExecution queryExecution()
```
 Specified by:
 
 queryExecution in interface org.apache.spark.sql.execution.Queryable
 - unresolvedTEncoder
```
public org.apache.spark.sql.catalyst.encoders.ExpressionEncoder<T> unresolvedTEncoder()
```
 An unresolved version of the internal encoder for the type of this Dataset. This one is marked implicit so that we can use it when constructing new Dataset objects that have the same object type (that will be possibly resolved to a different schema).
 
 Returns:
 (undocumented)
 - resolvedTEncoder
```
public org.apache.spark.sql.catalyst.encoders.ExpressionEncoder<T> resolvedTEncoder()
```
 The encoder for this Dataset that has been resolved to its output schema.
 - boundTEncoder
```
public org.apache.spark.sql.catalyst.encoders.ExpressionEncoder<T> boundTEncoder()
```
 The encoder where the expressions used to construct an object from an input row have been bound to the ordinals of the given schema.
 
 Returns:
 (undocumented)
 - schema
```
public StructType schema()
```
 Returns the schema of the encoded form of the objects in this Dataset.
 
 Specified by:
 
 schema in interface org.apache.spark.sql.execution.Queryable
 
 Returns:
 (undocumented)
 Since:
 
 1.6.0
 - printSchema
```
public void printSchema()
```
 Prints the schema of the underlying Dataset to the console in a nice tree format.
 
 Specified by:
 
 printSchema in interface org.apache.spark.sql.execution.Queryable
 
 Since:
 
 1.6.0
 - explain
```
public void explain(boolean extended)
```
 Prints the plans (logical and physical) to the console for debugging purposes.
 
 Specified by:
 
 explain in interface org.apache.spark.sql.execution.Queryable
 
 Parameters:
 extended - (undocumented)
 Since:
 
 1.6.0
 - explain
```
public void explain()
```
 Prints the physical plan to the console for debugging purposes.
 
 Specified by:
 
 explain in interface org.apache.spark.sql.execution.Queryable
 
 Since:
 
 1.6.0
 - as
```
public  Dataset as(Encoder evidence$1)
```
 Returns a new Dataset where each record has been mapped on to the specified type. The method used to map columns depend on the type of U: - When U is a class, fields for the class will be mapped to columns of the same name (case sensitivity is determined by spark.sql.caseSensitive) - When U is a tuple, the columns will be be mapped by ordinal (i.e. the first column will be assigned to _1). - When U is a primitive type (i.e. String, Int, etc). then the first column of the DataFrame will be used.
 If the schema of the DataFrame does not match the desired U type, you can use select along with alias or as to rearrange or rename as required.
 
 Parameters:
 evidence$1 - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 1.6.0
 - as
```
public Dataset<T> as(java.lang.String alias)
```
 Applies a logical alias to this Dataset that can be used to disambiguate columns that have the same name after two Datasets have been joined.
 
 Parameters:
 alias - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 1.6.0
 - toDF
```
public DataFrame toDF()
```
 Converts this strongly typed collection of data to generic Dataframe. In contrast to the strongly typed objects that Dataset operations work on, a Dataframe returns generic Row objects that allow fields to be accessed by ordinal or name.
 
 Returns:
 (undocumented)
 - toDS
```
public Dataset<T> toDS()
```
 Returns this Dataset.
 
 Returns:
 (undocumented)
 Since:
 
 1.6.0
 - rdd
```
public RDD<T> rdd()
```
 Converts this Dataset to an RDD.
 
 Returns:
 (undocumented)
 Since:
 
 1.6.0
 - count
```
public long count()
```
 Returns the number of elements in the Dataset.
 
 Returns:
 (undocumented)
 Since:
 
 1.6.0
 - show
```
public void show(int numRows)
```
 Displays the content of this Dataset in a tabular form. Strings more than 20 characters will be truncated, and all cells will be aligned right. For example:
```
 year month AVG('Adj Close) MAX('Adj Close)
 1980 12 0.503218 0.595103
 1981 01 0.523289 0.570307
 1982 02 0.436504 0.475256
 1983 03 0.410516 0.442194
 1984 04 0.450090 0.483521
 
```
 Parameters:
 numRows - Number of rows to show
 Since:
 
 1.6.0
 - show
```
public void show()
```
 Displays the top 20 rows of Dataset in a tabular form. Strings more than 20 characters will be truncated, and all cells will be aligned right.
 
 Since:
 
 1.6.0
 - show
```
public void show(boolean truncate)
```
 Displays the top 20 rows of Dataset in a tabular form.
 
 Parameters:
 truncate - Whether truncate long strings. If true, strings more than 20 characters will be truncated and all cells will be aligned right
 Since:
 
 1.6.0
 - show
```
public void show(int numRows,
 boolean truncate)
```
 Displays the Dataset in a tabular form. For example:
```
 year month AVG('Adj Close) MAX('Adj Close)
 1980 12 0.503218 0.595103
 1981 01 0.523289 0.570307
 1982 02 0.436504 0.475256
 1983 03 0.410516 0.442194
 1984 04 0.450090 0.483521
 
```
 Parameters:
 numRows - Number of rows to show
 truncate - Whether truncate long strings. If true, strings more than 20 characters will be truncated and all cells will be aligned right
 Since:
 
 1.6.0
 - repartition
```
public Dataset<T> repartition(int numPartitions)
```
 Returns a new Dataset that has exactly numPartitions partitions.
 
 Parameters:
 numPartitions - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 1.6.0
 - coalesce
```
public Dataset<T> coalesce(int numPartitions)
```
 Returns a new Dataset that has exactly numPartitions partitions. Similar to coalesce defined on an RDD, this operation results in a narrow dependency, e.g. if you go from 1000 partitions to 100 partitions, there will not be a shuffle, instead each of the 100 new partitions will claim 10 of the current partitions.
 
 Parameters:
 numPartitions - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 1.6.0
 - transform
```
public  Dataset transform(scala.Function1<Dataset<T>,Dataset> t)
```
 Concise syntax for chaining custom transformations.
```
 def featurize(ds: Dataset[T]) = ...

 dataset
 .transform(featurize)
 .transform(...)
 
```
 Parameters:
 t - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 1.6.0
 - filter
```
public Dataset<T> filter(scala.Function1<T,java.lang.Object> func)
```
 (Scala-specific) Returns a new Dataset that only contains elements where func returns true.
 
 Parameters:
 func - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 1.6.0
 - filter
```
public Dataset<T> filter(FilterFunction<T> func)
```
 (Java-specific) Returns a new Dataset that only contains elements where func returns true.
 
 Parameters:
 func - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 1.6.0
 - map
```
public  Dataset map(scala.Function1<T,U> func,
 Encoder evidence$2)
```
 (Scala-specific) Returns a new Dataset that contains the result of applying func to each element.
 
 Parameters:
 func - (undocumented)
 evidence$2 - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 1.6.0
 - map
```
public  Dataset map(MapFunction<T,U> func,
 Encoder encoder)
```
 (Java-specific) Returns a new Dataset that contains the result of applying func to each element.
 
 Parameters:
 func - (undocumented)
 encoder - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 1.6.0
 - mapPartitions
```
public  Dataset mapPartitions(scala.Function1<scala.collection.Iterator<T>,scala.collection.Iterator> func,
 Encoder evidence$3)
```
 (Scala-specific) Returns a new Dataset that contains the result of applying func to each partition.
 
 Parameters:
 func - (undocumented)
 evidence$3 - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 1.6.0
 - mapPartitions
```
public  Dataset mapPartitions(MapPartitionsFunction<T,U> f,
 Encoder encoder)
```
 (Java-specific) Returns a new Dataset that contains the result of applying func to each partition.
 
 Parameters:
 f - (undocumented)
 encoder - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 1.6.0
 - flatMap
```
public  Dataset flatMap(scala.Function1<T,scala.collection.TraversableOnce> func,
 Encoder evidence$4)
```
 (Scala-specific) Returns a new Dataset by first applying a function to all elements of this Dataset, and then flattening the results.
 
 Parameters:
 func - (undocumented)
 evidence$4 - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 1.6.0
 - flatMap
```
public  Dataset flatMap(FlatMapFunction<T,U> f,
 Encoder encoder)
```
 (Java-specific) Returns a new Dataset by first applying a function to all elements of this Dataset, and then flattening the results.
 
 Parameters:
 f - (undocumented)
 encoder - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 1.6.0
 - foreach
```
public void foreach(scala.Function1<T,scala.runtime.BoxedUnit> func)
```
 (Scala-specific) Runs func on each element of this Dataset.
 
 Parameters:
 func - (undocumented)
 Since:
 
 1.6.0
 - foreach
```
public void foreach(ForeachFunction<T> func)
```
 (Java-specific) Runs func on each element of this Dataset.
 
 Parameters:
 func - (undocumented)
 Since:
 
 1.6.0
 - foreachPartition
```
public void foreachPartition(scala.Function1<scala.collection.Iterator<T>,scala.runtime.BoxedUnit> func)
```
 (Scala-specific) Runs func on each partition of this Dataset.
 
 Parameters:
 func - (undocumented)
 Since:
 
 1.6.0
 - foreachPartition
```
public void foreachPartition(ForeachPartitionFunction<T> func)
```
 (Java-specific) Runs func on each partition of this Dataset.
 
 Parameters:
 func - (undocumented)
 Since:
 
 1.6.0
 - reduce
```
public T reduce(scala.Function2<T,T,T> func)
```
 (Scala-specific) Reduces the elements of this Dataset using the specified binary function. The given func must be commutative and associative or the result may be non-deterministic.
 
 Parameters:
 func - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 1.6.0
 - reduce
```
public T reduce(ReduceFunction<T> func)
```
 (Java-specific) Reduces the elements of this Dataset using the specified binary function. The given func must be commutative and associative or the result may be non-deterministic.
 
 Parameters:
 func - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 1.6.0
 - groupBy
```
public <K> GroupedDataset<K,T> groupBy(scala.Function1<T,K> func,
 Encoder<K> evidence$5)
```
 (Scala-specific) Returns a GroupedDataset where the data is grouped by the given key func.
 
 Parameters:
 func - (undocumented)
 evidence$5 - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 1.6.0
 - groupBy
```
public GroupedDataset<Row,T> groupBy(scala.collection.Seq<Column> cols)
```
 Returns a GroupedDataset where the data is grouped by the given Column expressions.
 
 Parameters:
 cols - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 1.6.0
 - groupBy
```
public <K> GroupedDataset<K,T> groupBy(MapFunction<T,K> func,
 Encoder<K> encoder)
```
 (Java-specific) Returns a GroupedDataset where the data is grouped by the given key func.
 
 Parameters:
 func - (undocumented)
 encoder - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 1.6.0
 - select
```
protected DataFrame select(scala.collection.Seq<Column> cols)
```
 Returns a new DataFrame by selecting a set of column based expressions.
```
 df.select($"colA", $"colB" + 1)
 
```
 Parameters:
 cols - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 1.6.0
 - select
```
public <U1> Dataset<U1> select(TypedColumn<T,U1> c1,
 Encoder<U1> evidence$6)
```
 Returns a new Dataset by computing the given Column expression for each element.
```
 val ds = Seq(1, 2, 3).toDS()
 val newDS = ds.select(expr("value + 1").as[Int])
 
```
 Parameters:
 c1 - (undocumented)
 evidence$6 - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 1.6.0
 - selectUntyped
```
protected Dataset<?> selectUntyped(scala.collection.Seq<TypedColumn<?,?>> columns)
```
 Internal helper function for building typed selects that return tuples. For simplicity and code reuse, we do this without the help of the type system and then use helper functions that cast appropriately for the user facing interface.
 
 Parameters:
 columns - (undocumented)
 
 Returns:
 (undocumented)
 - select
```
public <U1,U2> Dataset<scala.Tuple2<U1,U2>> select(TypedColumn<T,U1> c1,
 TypedColumn<T,U2> c2)
```
 Returns a new Dataset by computing the given Column expressions for each element.
 
 Parameters:
 c1 - (undocumented)
 c2 - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 1.6.0
 - select
```
public <U1,U2,U3> Dataset<scala.Tuple3<U1,U2,U3>> select(TypedColumn<T,U1> c1,
 TypedColumn<T,U2> c2,
 TypedColumn<T,U3> c3)
```
 Returns a new Dataset by computing the given Column expressions for each element.
 
 Parameters:
 c1 - (undocumented)
 c2 - (undocumented)
 c3 - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 1.6.0
 - select
```
public <U1,U2,U3,U4> Dataset<scala.Tuple4<U1,U2,U3,U4>> select(TypedColumn<T,U1> c1,
 TypedColumn<T,U2> c2,
 TypedColumn<T,U3> c3,
 TypedColumn<T,U4> c4)
```
 Returns a new Dataset by computing the given Column expressions for each element.
 
 Parameters:
 c1 - (undocumented)
 c2 - (undocumented)
 c3 - (undocumented)
 c4 - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 1.6.0
 - select
```
public <U1,U2,U3,U4,U5> Dataset<scala.Tuple5<U1,U2,U3,U4,U5>> select(TypedColumn<T,U1> c1,
 TypedColumn<T,U2> c2,
 TypedColumn<T,U3> c3,
 TypedColumn<T,U4> c4,
 TypedColumn<T,U5> c5)
```
 Returns a new Dataset by computing the given Column expressions for each element.
 
 Parameters:
 c1 - (undocumented)
 c2 - (undocumented)
 c3 - (undocumented)
 c4 - (undocumented)
 c5 - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 1.6.0
 - sample
```
public Dataset<T> sample(boolean withReplacement,
 double fraction,
 long seed)
```
 Returns a new Dataset by sampling a fraction of records.
 
 Parameters:
 withReplacement - (undocumented)
 fraction - (undocumented)
 seed - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 1.6.0
 - sample
```
public Dataset<T> sample(boolean withReplacement,
 double fraction)
```
 Returns a new Dataset by sampling a fraction of records, using a random seed.
 
 Parameters:
 withReplacement - (undocumented)
 fraction - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 1.6.0
 - distinct
```
public Dataset<T> distinct()
```
 Returns a new Dataset that contains only the unique elements of this Dataset.
 Note that, equality checking is performed directly on the encoded representation of the data and thus is not affected by a custom equals function defined on T.
 
 Returns:
 (undocumented)
 Since:
 
 1.6.0
 - intersect
```
public Dataset<T> intersect(Dataset<T> other)
```
 Returns a new Dataset that contains only the elements of this Dataset that are also present in other.
 Note that, equality checking is performed directly on the encoded representation of the data and thus is not affected by a custom equals function defined on T.
 
 Parameters:
 other - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 1.6.0
 - union
```
public Dataset<T> union(Dataset<T> other)
```
 Returns a new Dataset that contains the elements of both this and the other Dataset combined.
 Note that, this function is not a typical set union operation, in that it does not eliminate duplicate items. As such, it is analogous to UNION ALL in SQL.
 
 Parameters:
 other - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 1.6.0
 - subtract
```
public Dataset<T> subtract(Dataset<T> other)
```
 Returns a new Dataset where any elements present in other have been removed.
 Note that, equality checking is performed directly on the encoded representation of the data and thus is not affected by a custom equals function defined on T.
 
 Parameters:
 other - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 1.6.0
 - joinWith
```
public  Dataset<scala.Tuple2<T,U>> joinWith(Dataset other,
 Column condition,
 java.lang.String joinType)
```
 Joins this Dataset returning a Tuple2 for each pair where condition evaluates to true.
 This is similar to the relation join function with one important difference in the result schema. Since joinWith preserves objects present on either side of the join, the result schema is similarly nested into a tuple under the column names _1 and _2.
 This type of join can be useful both for preserving type-safety with the original object types as well as working with relational data where either side of the join has column names in common.
 
 Parameters:
 other - Right side of the join.
 condition - Join expression.
 joinType - One of: inner, outer, left_outer, right_outer, leftsemi.
 
 Returns:
 (undocumented)
 Since:
 
 1.6.0
 - joinWith
```
public  Dataset<scala.Tuple2<T,U>> joinWith(Dataset other,
 Column condition)
```
 Using inner equi-join to join this Dataset returning a Tuple2 for each pair where condition evaluates to true.
 
 Parameters:
 other - Right side of the join.
 condition - Join expression.
 
 Returns:
 (undocumented)
 Since:
 
 1.6.0
 - first
```
public T first()
```
 Returns the first element in this Dataset.
 
 Returns:
 (undocumented)
 Since:
 
 1.6.0
 - collect
```
public java.lang.Object collect()
```
 Returns an array that contains all the elements in this Dataset.
 Running collect requires moving all the data into the application's driver process, and doing so on a very large Dataset can crash the driver process with OutOfMemoryError.
 For Java API, use collectAsList.
 
 Returns:
 (undocumented)
 Since:
 
 1.6.0
 - collectAsList
```
public java.util.List<T> collectAsList()
```
 Returns an array that contains all the elements in this Dataset.
 Running collect requires moving all the data into the application's driver process, and doing so on a very large Dataset can crash the driver process with OutOfMemoryError.
 For Java API, use collectAsList.
 
 Returns:
 (undocumented)
 Since:
 
 1.6.0
 - take
```
public java.lang.Object take(int num)
```
 Returns the first num elements of this Dataset as an array.
 Running take requires moving data into the application's driver process, and doing so with a very large num can crash the driver process with OutOfMemoryError.
 
 Parameters:
 num - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 1.6.0
 - takeAsList
```
public java.util.List<T> takeAsList(int num)
```
 Returns the first num elements of this Dataset as an array.
 Running take requires moving data into the application's driver process, and doing so with a very large num can crash the driver process with OutOfMemoryError.
 
 Parameters:
 num - (undocumented)
 
 Returns:
 (undocumented)
 Since:
 
 1.6.0
 - persist
```
public Dataset<T> persist()
```
 Persist this Dataset with the default storage level (MEMORY_AND_DISK).
 
 Returns:
 (undocumented)
 Since:
 
 1.6.0
 - cache
```
public Dataset<T> cache()
```
 Persist this Dataset with the default storage level (MEMORY_AND_DISK).
 
 Returns:
 (undocumented)
 Since:
 
 1.6.0
 - persist
```
public Dataset<T> persist(StorageLevel newLevel)
```
 Persist this Dataset with the given storage level.
 
 Parameters:
 newLevel - One of: MEMORY_ONLY, MEMORY_AND_DISK, MEMORY_ONLY_SER, MEMORY_AND_DISK_SER, DISK_ONLY, MEMORY_ONLY_2, MEMORY_AND_DISK_2, etc.
 
 Returns:
 (undocumented)
 Since:
 
 1.6.0
 - unpersist
```
public Dataset<T> unpersist(boolean blocking)
```
 Mark the Dataset as non-persistent, and remove all blocks for it from memory and disk.
 
 Parameters:
 blocking - Whether to block until all blocks are deleted.
 
 Returns:
 (undocumented)
 Since:
 
 1.6.0
 - unpersist
```
public Dataset<T> unpersist()
```
 Mark the Dataset as non-persistent, and remove all blocks for it from memory and disk.
 
 Returns:
 (undocumented)
 Since:
 
 1.6.0

Class Dataset<T>

Method Summary

Methods inherited from class java.lang.Object

Methods inherited from interface org.apache.spark.sql.execution.Queryable

Method Detail

groupBy

select

sqlContext

queryExecution

unresolvedTEncoder

resolvedTEncoder

boundTEncoder

schema

printSchema

explain

explain

as

as

toDF

toDS

rdd

count

show

show

show

show

repartition

coalesce

transform

filter

filter

map

map

mapPartitions

mapPartitions

flatMap

flatMap

foreach

foreach

foreachPartition

foreachPartition

reduce

reduce

groupBy

groupBy

groupBy

select

select

selectUntyped

select

select

select

select

sample

sample

distinct

intersect

union

subtract

joinWith

joinWith

first

collect

collectAsList

take

takeAsList

persist

cache

persist

unpersist

unpersist