Tuesday, December 11, 2018

Create DataFrame from RDD


Creating DataFrame from RDD in Spark:

RDD and DataFrame both are highly used APIs in Spark framework. Converting RDD to DataFrame is a very common technique every programmer has to do in their programming. I would like to take you through the most suitable way to achieve this.

There are 2 most commonly used techniques.
- Inferring the Schema Using Reflection
- Programmatically specifying the schema

Inferring the Schema Using Reflection:

//Creating RDD
val rdd = sc.parallelize(Seq(1,2,3,4))
import spark.implicits._
//Creating Dataframe from RDD
val dataFrame = rdd.toDF()

//Creating Schema
case class Person(name: String, age: Int)
//Creating DataFrame using Refelction
val people = sc.textFile("SaleData.txt").map(_.split(",")).map(p => Person(p(0), p(1).toInt)).toDF()


Programmatically specifying the schema:

//Creating Schema function
def dfSchema(columnNames: List[String]): StructType =
  StructType(
    Seq(
      StructField(name = "name", dataType = StringType, nullable = false),
      StructField(gender= "gender", dataType = StringType, nullable = false),
      StructField(age= "age", dataType = IntegerType, nullable = false)
    )
  )

//Calling Schema function
val schema = dfSchema(Seq("name", "gender","age"))

//Creating RDD
val rdd: RDD[String] = ...

//Creating function to map Row data
def row(line: List[String]): Row = Row(line(0), line(1),line(2).toInt)

//Mapping Row Data
val data = rdd.map(_.split(",").to[List]).map(row)

//Creating DataFrame
val dataFrame = spark.createDataFrame(data, schema)


No comments:

Post a Comment