添加链接
link之家
链接快照平台
  • 输入网页链接,自动生成快照
  • 标签化管理网页链接
Collectives™ on Stack Overflow

Find centralized, trusted content and collaborate around the technologies you use most.

Learn more about Collectives

Teams

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

Learn more about Teams

As we look forward to Spark 2.0, we plan some exciting improvements to Datasets, specifically: Custom encoders – while we currently autogenerate encoders for a wide variety of types, we’d like to open up an API for custom objects.

and attempts to store custom type in a Dataset lead to following error like:

Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing sqlContext.implicits._ Support for serializing other types will be added in future releases

Java.lang.UnsupportedOperationException: No Encoder found for ....

Are there any existing workarounds?

Note this question exists only as an entry point for a Community Wiki answer. Feel free to update / improve both question and answer.

Update

This answer is still valid and informative, although things are now better since 2.2/2.3, which adds built-in encoder support for Set , Seq , Map , Date , Timestamp , and BigDecimal . If you stick to making types with only case classes and the usual Scala types, you should be fine with just the implicit in SQLImplicits .

Unfortunately, virtually nothing has been added to help with this. Searching for @since 2.0.0 in Encoders.scala or SQLImplicits.scala finds things mostly to do with primitive types (and some tweaking of case classes). So, first thing to say: there currently is no real good support for custom class encoders . With that out of the way, what follows is some tricks which do as good a job as we can ever hope to, given what we currently have at our disposal. As an upfront disclaimer: this won't work perfectly and I'll do my best to make all limitations clear and upfront.

What exactly is the problem

When you want to make a dataset, Spark "requires an encoder (to convert a JVM object of type T to and from the internal Spark SQL representation) that is generally created automatically through implicits from a SparkSession , or can be created explicitly by calling static methods on Encoders " (taken from the docs on createDataset ). An encoder will take the form Encoder[T] where T is the type you are encoding. The first suggestion is to add import spark.implicits._ (which gives you these implicit encoders) and the second suggestion is to explicitly pass in the implicit encoder using this set of encoder related functions.

There is no encoder available for regular classes, so

import spark.implicits._
class MyObj(val i: Int)
// ...
val d = spark.createDataset(Seq(new MyObj(1),new MyObj(2),new MyObj(3)))

will give you the following implicit related compile time error:

Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing sqlContext.implicits._ Support for serializing other types will be added in future releases

However, if you wrap whatever type you just used to get the above error in some class that extends Product, the error confusingly gets delayed to runtime, so

import spark.implicits._
case class Wrap[T](unwrap: T)
class MyObj(val i: Int)
// ...
val d = spark.createDataset(Seq(Wrap(new MyObj(1)),Wrap(new MyObj(2)),Wrap(new MyObj(3))))

Compiles just fine, but fails at runtime with

java.lang.UnsupportedOperationException: No Encoder found for MyObj

The reason for this is that the encoders Spark creates with the implicits are actually only made at runtime (via scala relfection). In this case, all Spark checks at compile time is that the outermost class extends Product (which all case classes do), and only realizes at runtime that it still doesn't know what to do with MyObj (the same problem occurs if I tried to make a Dataset[(Int,MyObj)] - Spark waits until runtime to barf on MyObj). These are central problems that are in dire need of being fixed:

  • some classes that extend Product compile despite always crashing at runtime and
  • there is no way of passing in custom encoders for nested types (I have no way of feeding Spark an encoder for just MyObj such that it then knows how to encode Wrap[MyObj] or (Int,MyObj)).
  • Just use kryo

    The solution everyone suggests is to use the kryo encoder.

    import spark.implicits._
    class MyObj(val i: Int)
    implicit val myObjEncoder = org.apache.spark.sql.Encoders.kryo[MyObj]
    // ...
    val d = spark.createDataset(Seq(new MyObj(1),new MyObj(2),new MyObj(3)))
    

    This gets pretty tedious fast though. Especially if your code is manipulating all sorts of datasets, joining, grouping etc. You end up racking up a bunch of extra implicits. So, why not just make an implicit that does this all automatically?

    import scala.reflect.ClassTag
    implicit def kryoEncoder[A](implicit ct: ClassTag[A]) = 
      org.apache.spark.sql.Encoders.kryo[A](ct)
    

    And now, it seems like I can do almost anything I want (the example below won't work in the spark-shell where spark.implicits._ is automatically imported)

    class MyObj(val i: Int)
    val d1 = spark.createDataset(Seq(new MyObj(1),new MyObj(2),new MyObj(3)))
    val d2 = d1.map(d => (d.i+1,d)).alias("d2") // mapping works fine and ..
    val d3 = d1.map(d => (d.i,  d)).alias("d3") // .. deals with the new type
    val d4 = d2.joinWith(d3, $"d2._1" === $"d3._1") // Boom!
    

    Or almost. The problem is that using kryo leads to Spark just storing every row in the dataset as a flat binary object. For map, filter, foreach that is enough, but for operations like join, Spark really needs these to be separated into columns. Inspecting the schema for d2 or d3, you see there is just one binary column:

    d2.printSchema
    // root
    //  |-- value: binary (nullable = true)
    

    Partial solution for tuples

    So, using the magic of implicits in Scala (more in 6.26.3 Overloading Resolution), I can make myself a series of implicits that will do as good a job as possible, at least for tuples, and will work well with existing implicits:

    import org.apache.spark.sql.{Encoder,Encoders}
    import scala.reflect.ClassTag
    import spark.implicits._  // we can still take advantage of all the old implicits
    implicit def single[A](implicit c: ClassTag[A]): Encoder[A] = Encoders.kryo[A](c)
    implicit def tuple2[A1, A2](
      implicit e1: Encoder[A1],
               e2: Encoder[A2]
    ): Encoder[(A1,A2)] = Encoders.tuple[A1,A2](e1, e2)
    implicit def tuple3[A1, A2, A3](
      implicit e1: Encoder[A1],
               e2: Encoder[A2],
               e3: Encoder[A3]
    ): Encoder[(A1,A2,A3)] = Encoders.tuple[A1,A2,A3](e1, e2, e3)
    // ... you can keep making these
    

    Then, armed with these implicits, I can make my example above work, albeit with some column renaming

    class MyObj(val i: Int)
    val d1 = spark.createDataset(Seq(new MyObj(1),new MyObj(2),new MyObj(3)))
    val d2 = d1.map(d => (d.i+1,d)).toDF("_1","_2").as[(Int,MyObj)].alias("d2")
    val d3 = d1.map(d => (d.i  ,d)).toDF("_1","_2").as[(Int,MyObj)].alias("d3")
    val d4 = d2.joinWith(d3, $"d2._1" === $"d3._1")
    

    I haven't yet figured out how to get the expected tuple names (_1, _2, ...) by default without renaming them - if someone else wants to play around with this, this is where the name "value" gets introduced and this is where the tuple names are usually added. However, the key point is that that I now have a nice structured schema:

    d4.printSchema
    // root
    //  |-- _1: struct (nullable = false)
    //  |    |-- _1: integer (nullable = true)
    //  |    |-- _2: binary (nullable = true)
    //  |-- _2: struct (nullable = false)
    //  |    |-- _1: integer (nullable = true)
    //  |    |-- _2: binary (nullable = true)
    

    So, in summary, this workaround:

  • allows us to get separate columns for tuples (so we can join on tuples again, yay!)
  • we can again just rely on the implicits (so no need to be passing in kryo all over the place)
  • is almost entirely backwards compatible with import spark.implicits._ (with some renaming involved)
  • does not let us join on the kyro serialized binary columns, let alone on fields those may have
  • has the unpleasant side-effect of renaming some of the tuple columns to "value" (if necessary, this can be undone by converting .toDF, specifying new column names, and converting back to a dataset - and the schema names seem to be preserved through joins, where they are most needed).
  • Partial solution for classes in general

    This one is less pleasant and has no good solution. However, now that we have the tuple solution above, I have a hunch the implicit conversion solution from another answer will be a bit less painful too since you can convert your more complex classes to tuples. Then, after creating the dataset, you'd probably rename the columns using the dataframe approach. If all goes well, this is really an improvement since I can now perform joins on the fields of my classes. If I had just used one flat binary kryo serializer that wouldn't have been possible.

    Here is an example that does a bit of everything: I have a class MyObj which has fields of types Int, java.util.UUID, and Set[String]. The first takes care of itself. The second, although I could serialize using kryo would be more useful if stored as a String (since UUIDs are usually something I'll want to join against). The third really just belongs in a binary column.

    class MyObj(val i: Int, val u: java.util.UUID, val s: Set[String])
    // alias for the type to convert to and from
    type MyObjEncoded = (Int, String, Set[String])
    // implicit conversions
    implicit def toEncoded(o: MyObj): MyObjEncoded = (o.i, o.u.toString, o.s)
    implicit def fromEncoded(e: MyObjEncoded): MyObj =
      new MyObj(e._1, java.util.UUID.fromString(e._2), e._3)
    

    Now, I can create a dataset with a nice schema using this machinery:

    val d = spark.createDataset(Seq[MyObjEncoded](
      new MyObj(1, java.util.UUID.randomUUID, Set("foo")),
      new MyObj(2, java.util.UUID.randomUUID, Set("bar"))
    )).toDF("i","u","s").as[MyObjEncoded]
    

    And the schema shows me I columns with the right names and with the first two both things I can join against.

    d.printSchema
    // root
    //  |-- i: integer (nullable = false)
    //  |-- u: string (nullable = true)
    //  |-- s: binary (nullable = true)
                    @AlexeyS I don't think so. But why would you want that? Why can you not get away with the last solution I propose? If you can put your data in JSON, you should be able to extract the fields and put them in a case class...
    – Alec
                    Jan 20, 2017 at 2:13
                    @combinatorist My understanding is that Datasets and Dataframes (but not RDDs, since they don't need encoders!) are equivalent from a performance perspective. Don't under-estimate the type-safety of Datasets! Just because Spark internally uses a ton of reflection, casts, etc. does not mean you shouldn't care about the type-safety of the interface that is exposed. But it does make me feel better about creating my own Dataset-based type-safe functions that use Dataframes under the hood.
    – Alec
                    Oct 30, 2017 at 22:53
                    I have a solution that works as a charm. It consists in:  1. Define a sparkSql UDT over the custom class 2. register it 3. Enclose your class in a Product (Tuple1 for example)
    – tmnd91
                    Feb 2, 2018 at 12:25
    
  • Using generic encoders.

    There are two generic encoders available for now kryo and javaSerialization where the latter one is explicitly described as:

    extremely inefficient and should only be used as the last resort.

    Assuming following class

    class Bar(i: Int) {
      override def toString = s"bar $i"
      def bar = i
    

    you can use these encoders by adding implicit encoder:

    object BarEncoders {
      implicit def barEncoder: org.apache.spark.sql.Encoder[Bar] = 
      org.apache.spark.sql.Encoders.kryo[Bar]
    

    which can be used together as follows:

    object Main {
      def main(args: Array[String]) {
        val sc = new SparkContext("local",  "test", new SparkConf())
        val sqlContext = new SQLContext(sc)
        import sqlContext.implicits._
        import BarEncoders._
        val ds = Seq(new Bar(1)).toDS
        ds.show
        sc.stop()
    

    It stores objects as binary column so when converted to DataFrame you get following schema:

    |-- value: binary (nullable = true)

    It is also possible to encode tuples using kryo encoder for specific field:

    val longBarEncoder = Encoders.tuple(Encoders.scalaLong, Encoders.kryo[Bar])
    spark.createDataset(Seq((1L, new Bar(1))))(longBarEncoder)
    // org.apache.spark.sql.Dataset[(Long, Bar)] = [_1: bigint, _2: binary]
    

    Please note that we don't depend on implicit encoders here but pass encoder explicitly so this most likely won't work with toDS method.

  • Using implicit conversions:

    Provide implicit conversions between representation which can be encoded and custom class, for example:

    object BarConversions {
      implicit def toInt(bar: Bar): Int = bar.bar
      implicit def toBar(i: Int): Bar = new Bar(i)
    object Main {
      def main(args: Array[String]) {
        val sc = new SparkContext("local",  "test", new SparkConf())
        val sqlContext = new SQLContext(sc)
        import sqlContext.implicits._
        import BarConversions._
        type EncodedBar = Int
        val bars: RDD[EncodedBar]  = sc.parallelize(Seq(new Bar(1)))
        val barsDS = bars.toDS
        barsDS.show
        barsDS.map(_.bar).show
        sc.stop()
    

    Related questions:

  • How to create encoder for Option type constructor, e.g. Option[Int]?
  • Solution 1 does not seem to work for typed collections (at least Set) I get Exception in thread "main" java.lang.UnsupportedOperationException: No Encoder found for Set[Bar]. – Victor P. Jul 4, 2016 at 13:20 @VictorP. It is expected I am afraid In case like this you'll need an encoder for specific type (kryo[Set[Bar]]. The same way if class contains a field Bar you need encoder for a whole object. These are very crude methods. – zero323 Jul 4, 2016 at 13:48 @zero323 I'm facing the same issue. Can you put a code example of how to encode the whole project? Many Thanks! – Rock Aug 10, 2016 at 4:19 @zero323 per your comment, "if class contains a field Bar you need encoder for a whole object". my question was how to encode this "whole project"? – Rock Aug 11, 2016 at 16:26

    You can use UDTRegistration and then Case Classes, Tuples, etc... all work correctly with your User Defined Type!

    Say you want to use a custom Enum:

    trait CustomEnum { def value:String }
    case object Foo extends CustomEnum  { val value = "F" }
    case object Bar extends CustomEnum  { val value = "B" }
    object CustomEnum {
      def fromString(str:String) = Seq(Foo, Bar).find(_.value == str).get
    

    Register it like this:

    // First define a UDT class for it:
    class CustomEnumUDT extends UserDefinedType[CustomEnum] {
      override def sqlType: DataType = org.apache.spark.sql.types.StringType
      override def serialize(obj: CustomEnum): Any = org.apache.spark.unsafe.types.UTF8String.fromString(obj.value)
      // Note that this will be a UTF8String type
      override def deserialize(datum: Any): CustomEnum = CustomEnum.fromString(datum.toString)
      override def userClass: Class[CustomEnum] = classOf[CustomEnum]
    // Then Register the UDT Class!
    // NOTE: you have to put this file into the org.apache.spark package!
    UDTRegistration.register(classOf[CustomEnum].getName, classOf[CustomEnumUDT].getName)
    

    Then USE IT!

    case class UsingCustomEnum(id:Int, en:CustomEnum)
    val seq = Seq(
      UsingCustomEnum(1, Foo),
      UsingCustomEnum(2, Bar),
      UsingCustomEnum(3, Foo)
    ).toDS()
    seq.filter(_.en == Foo).show()
    println(seq.collect())
    

    Say you want to use a Polymorphic Record:

    trait CustomPoly
    case class FooPoly(id:Int) extends CustomPoly
    case class BarPoly(value:String, secondValue:Long) extends CustomPoly
    

    ... and the use it like this:

    case class UsingPoly(id:Int, poly:CustomPoly)
      UsingPoly(1, new FooPoly(1)),
      UsingPoly(2, new BarPoly("Blah", 123)),
      UsingPoly(3, new FooPoly(1))
    ).toDS
    polySeq.filter(_.poly match {
      case FooPoly(value) => value == 1
      case _ => false
    }).show()
    

    You can write a custom UDT that encodes everything to bytes (I'm using java serialization here but it's probably better to instrument Spark's Kryo context).

    First define the UDT class:

    class CustomPolyUDT extends UserDefinedType[CustomPoly] {
      val kryo = new Kryo()
      override def sqlType: DataType = org.apache.spark.sql.types.BinaryType
      override def serialize(obj: CustomPoly): Any = {
        val bos = new ByteArrayOutputStream()
        val oos = new ObjectOutputStream(bos)
        oos.writeObject(obj)
        bos.toByteArray
      override def deserialize(datum: Any): CustomPoly = {
        val bis = new ByteArrayInputStream(datum.asInstanceOf[Array[Byte]])
        val ois = new ObjectInputStream(bis)
        val obj = ois.readObject()
        obj.asInstanceOf[CustomPoly]
      override def userClass: Class[CustomPoly] = classOf[CustomPoly]
    

    Then register it:

    // NOTE: The file you do this in has to be inside of the org.apache.spark package!
    UDTRegistration.register(classOf[CustomPoly].getName, classOf[CustomPolyUDT].getName)
    

    Then you can use it!

    // As shown above:
    case class UsingPoly(id:Int, poly:CustomPoly)
      UsingPoly(1, new FooPoly(1)),
      UsingPoly(2, new BarPoly("Blah", 123)),
      UsingPoly(3, new FooPoly(1))
    ).toDS
    polySeq.filter(_.poly match {
      case FooPoly(value) => value == 1
      case _ => false
    }).show()
                    I am trying to define a UDT in my project and I am getting this error "Symbol UserDefinedType is inaccessible from this place". Any help ?
    – Rijo Joseph
                    Sep 13, 2018 at 9:33
                    Hi @RijoJoseph. You need make a package org.apache.spark in your project and put your UDT code in that.
    – Choppy The Lumberjack
                    Sep 17, 2018 at 16:48
                    I tried this method putting my code in a package in org.apache.spark.  And called the registration.  But still get errors about  no encoder for my enum... ?
    – DCameronMauch
                    Oct 10, 2020 at 2:41
    

    Encoders work more or less the same in Spark2.0. And Kryo is still the recommended serialization choice.

    You can look at following example with spark-shell

    scala> import spark.implicits._
    import spark.implicits._
    scala> import org.apache.spark.sql.Encoders
    import org.apache.spark.sql.Encoders
    scala> case class NormalPerson(name: String, age: Int) {
     |   def aboutMe = s"I am ${name}. I am ${age} years old."
    defined class NormalPerson
    scala> case class ReversePerson(name: Int, age: String) {
     |   def aboutMe = s"I am ${name}. I am ${age} years old."
    defined class ReversePerson
    scala> val normalPersons = Seq(
     |   NormalPerson("Superman", 25),
     |   NormalPerson("Spiderman", 17),
     |   NormalPerson("Ironman", 29)
    normalPersons: Seq[NormalPerson] = List(NormalPerson(Superman,25), NormalPerson(Spiderman,17), NormalPerson(Ironman,29))
    scala> val ds1 = sc.parallelize(normalPersons).toDS
    ds1: org.apache.spark.sql.Dataset[NormalPerson] = [name: string, age: int]
    scala> val ds2 = ds1.map(np => ReversePerson(np.age, np.name))
    ds2: org.apache.spark.sql.Dataset[ReversePerson] = [name: int, age: string]
    scala> ds1.show()
    +---------+---+
    |     name|age|
    +---------+---+
    | Superman| 25|
    |Spiderman| 17|
    |  Ironman| 29|
    +---------+---+
    scala> ds2.show()
    +----+---------+
    |name|      age|
    +----+---------+
    |  25| Superman|
    |  17|Spiderman|
    |  29|  Ironman|
    +----+---------+
    scala> ds1.foreach(p => println(p.aboutMe))
    I am Ironman. I am 29 years old.
    I am Superman. I am 25 years old.
    I am Spiderman. I am 17 years old.
    scala> val ds2 = ds1.map(np => ReversePerson(np.age, np.name))
    ds2: org.apache.spark.sql.Dataset[ReversePerson] = [name: int, age: string]
    scala> ds2.foreach(p => println(p.aboutMe))
    I am 17. I am Spiderman years old.
    I am 25. I am Superman years old.
    I am 29. I am Ironman years old.
    

    Till now] there were no appropriate encoders in present scope so our persons were not encoded as binary values. But that will change once we provide some implicit encoders using Kryo serialization.

    // Provide Encoders
    scala> implicit val normalPersonKryoEncoder = Encoders.kryo[NormalPerson]
    normalPersonKryoEncoder: org.apache.spark.sql.Encoder[NormalPerson] = class[value[0]: binary]
    scala> implicit val reversePersonKryoEncoder = Encoders.kryo[ReversePerson]
    reversePersonKryoEncoder: org.apache.spark.sql.Encoder[ReversePerson] = class[value[0]: binary]
    // Ecoders will be used since they are now present in Scope
    scala> val ds3 = sc.parallelize(normalPersons).toDS
    ds3: org.apache.spark.sql.Dataset[NormalPerson] = [value: binary]
    scala> val ds4 = ds3.map(np => ReversePerson(np.age, np.name))
    ds4: org.apache.spark.sql.Dataset[ReversePerson] = [value: binary]
    // now all our persons show up as binary values
    scala> ds3.show()
    +--------------------+
    |               value|
    +--------------------+
    |[01 00 24 6C 69 6...|
    |[01 00 24 6C 69 6...|
    |[01 00 24 6C 69 6...|
    +--------------------+
    scala> ds4.show()
    +--------------------+
    |               value|
    +--------------------+
    |[01 00 24 6C 69 6...|
    |[01 00 24 6C 69 6...|
    |[01 00 24 6C 69 6...|
    +--------------------+
    // Our instances still work as expected    
    scala> ds3.foreach(p => println(p.aboutMe))
    I am Ironman. I am 29 years old.
    I am Spiderman. I am 17 years old.
    I am Superman. I am 25 years old.
    scala> ds4.foreach(p => println(p.aboutMe))
    I am 25. I am Superman years old.
    I am 29. I am Ironman years old.
    I am 17. I am Spiderman years old.
                    How to convert back to normal non-binary value after using the encoders when we do a .show?
    – jack
                    Mar 8, 2021 at 0:04
    
    import spark.sqlContext.implicits._
    import org.apache.spark.sql.Encoders
    implicit val encoder = Encoders.bean[MyClasss](classOf[MyClass])
    

    Now you can simply read the dataFrame as custom DataFrame

    dataFrame.as[MyClass]
    

    This will create a custom class encoder and not a binary one.

    My examples will be in Java, but I don't imagine it to be difficult adapting to Scala.

    I have been quite successful converting RDD<Fruit> to Dataset<Fruit> using spark.createDataset and Encoders.bean as long as Fruit is a simple Java Bean.

    Step 1: Create the simple Java Bean.

    public class Fruit implements Serializable {
        private String name  = "default-fruit";
        private String color = "default-color";
        // AllArgsConstructor
        public Fruit(String name, String color) {
            this.name  = name;
            this.color = color;
        // NoArgsConstructor
        public Fruit() {
            this("default-fruit", "default-color");
        // ...create getters and setters for above fields
        // you figure it out
    

    I'd stick to classes with primitive types and String as fields before the DataBricks folks beef up their Encoders. If you have a class with nested object, create another simple Java Bean with all of its fields flattened, so you can use RDD transformations to map the complex type to the simpler one. Sure it's a little extra work, but I imagine it'll help a lot on performance working with a flat schema.

    Step 2: Get your Dataset from the RDD

    SparkSession spark = SparkSession.builder().getOrCreate();
    JavaSparkContext jsc = new JavaSparkContext();
    List<Fruit> fruitList = ImmutableList.of(
        new Fruit("apple", "red"),
        new Fruit("orange", "orange"),
        new Fruit("grape", "purple"));
    JavaRDD<Fruit> fruitJavaRDD = jsc.parallelize(fruitList);
    RDD<Fruit> fruitRDD = fruitJavaRDD.rdd();
    Encoder<Fruit> fruitBean = Encoders.bean(Fruit.class);
    Dataset<Fruit> fruitDataset = spark.createDataset(rdd, bean);
    

    And voila! Lather, rinse, repeat.

    I'd suggest pointing out that for simple structures you'd be better served by storing them in native Spark types, rather than serializing them to a blob. They work better across the Python gateway, more transparent in Parquet, and can even be cast to structures of the same shape. – metasim Aug 22, 2018 at 18:21
  • I was reading 'Set typed data' from SQLContext. So original data format is DataFrame.

    val sample = spark.sqlContext.sql("select 1 as a, collect_set(1) as b limit 1") sample.show()

    +---+---+ | a| b| +---+---+ | 1|[1]| +---+---+

  • Then convert it into RDD using rdd.map() with mutable.WrappedArray type.

    sample .rdd.map(r => (r.getInt(0), r.getAs[mutable.WrappedArray[Int]](1).toSet)) .collect() .foreach(println)

    Result:

    (1,Set(1))

  • In addition to the suggestions already given, another option I recently discovered is that you can declare your custom class including the trait org.apache.spark.sql.catalyst.DefinedByConstructorParams.

    This works if the class has a constructor that uses types the ExpressionEncoder can understand, i.e. primitive values and standard collections. It can come in handy when you're not able to declare the class as a case class, but don't want to use Kryo to encode it every time it's included in a Dataset.

    For example, I wanted to declare a case class that included a Breeze vector. The only encoder that would be able to handle that would normally be Kryo. But if I declared a subclass that extended the Breeze DenseVector and DefinedByConstructorParams, the ExpressionEncoder understood that it could be serialized as an array of Doubles.

    Here's how I declared it:

    class SerializableDenseVector(values: Array[Double]) extends breeze.linalg.DenseVector[Double](values) with DefinedByConstructorParams
    implicit def BreezeVectorToSerializable(bv: breeze.linalg.DenseVector[Double]): SerializableDenseVector = bv.asInstanceOf[SerializableDenseVector]
    

    Now I can use SerializableDenseVector in a Dataset (directly, or as part of a Product) using a simple ExpressionEncoder and no Kryo. It works just like a Breeze DenseVector but serializes as an Array[Double].

    @Alec's answer is great! Just to add a comment in this part of his/her answer:

    import spark.implicits._
    case class Wrap[T](unwrap: T)
    class MyObj(val i: Int)
    // ...
    val d = spark.createDataset(Seq(Wrap(new MyObj(1)),Wrap(new MyObj(2)),Wrap(new MyObj(3))))
    

    @Alec mentions:

    there is no way of passing in custom encoders for nested types (I have no way of feeding Spark an encoder for just MyObj such that it then knows how to encode Wrap[MyObj] or (Int,MyObj)).

    It seems so, because if I add an encoder for MyObj:

    implicit val myEncoder = org.apache.spark.sql.Encoders.kryo[MyObj]
    

    , it still fails:

    java.lang.UnsupportedOperationException: No Encoder found for MyObj
    - field (class: "MyObj", name: "unwrap")
    - root class: "Wrap"
      at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor$1.apply(ScalaReflection.scala:643)
    

    But notice the important error message:

    root class: "Wrap"

    It actually gives a hint that encoding MyObj isn't enough, and you have to encode the entire chain including Wrap[T].

    So if I do this, it solves the problem:

    implicit val myWrapperEncoder = org.apache.spark.sql.Encoders.kryo[Wrap[MyObj]]
    

    Hence, the comment of @Alec is NOT that true:

    I have no way of feeding Spark an encoder for just MyObj such that it then knows how to encode Wrap[MyObj] or (Int,MyObj)

    We still have a way to feed Spark the encoder for MyObj such that it then knows how to encode Wrap[MyObj] or (Int,MyObj).

  •