Creating DataFrame from RDD in Spark:
RDD and DataFrame both are highly used APIs in Spark framework. Converting RDD to DataFrame is a very common technique every programmer has to do in their programming. I would like to take you through the most suitable way to achieve this.
There are 2 most commonly used techniques.
- Inferring the Schema Using Reflection
- Programmatically specifying the schema
Inferring the Schema Using Reflection:
//Creating RDD
val rdd = sc.parallelize(Seq(1,2,3,4))
import spark.implicits._
//Creating Dataframe from RDD
val dataFrame = rdd.toDF()
//Creating Schema
case class Person(name: String, age: Int)
//Creating DataFrame using Refelction
val people = sc.textFile("SaleData.txt").map(_.split(",")).map(p => Person(p(0), p(1).toInt)).toDF()
Programmatically specifying the schema:
//Creating Schema function
def dfSchema(columnNames: List[String]): StructType =
StructType(
Seq(
StructField(name = "name", dataType = StringType, nullable = false),
StructField(gender= "gender", dataType = StringType, nullable = false),
StructField(age= "age", dataType = IntegerType, nullable = false)
)
)
//Calling Schema function
val schema = dfSchema(Seq("name", "gender","age"))
//Creating RDD
val rdd: RDD[String] = ...
//Creating function to map Row data
def row(line: List[String]): Row = Row(line(0), line(1),line(2).toInt)
//Mapping Row Data
val data = rdd.map(_.split(",").to[List]).map(row)
//Creating DataFrame
val dataFrame = spark.createDataFrame(data, schema)
No comments:
Post a Comment