Showing posts with label RDD. Show all posts
Showing posts with label RDD. Show all posts

Monday, December 17, 2018

Spark Interview Questions

Spark Interview Questions:

Spark is a framework which is heavily used in Hadoop(Hadoop 2.0/Yarn) in order to execute Analytical, Streaming , Machine Learning process in a very efficient way. I would like to take you through some of the questions those are frequently asked in any interview.

Spark:

1. What is Spark.
2. Explain higher level architecture of Spark.
3. What is Driver &Executor and explain the difference between them
4. What is DAG
5. How do you trace your failed job through DAG.
6. What is Persistence. Name difference level of Persistence.
7. Why do we use Repartittion & Coalesce
8. What is RDD,Dataframe & Dataset and explain difference between them
9. How to see partition after loading a input file in Spark.
10. How do we load/store any Hive table in Spark
11. How to read JSON,CSV file in Spark.
12. What is Spark Streaming.
13. Name some properties which you have set in your project
14. How could a Hive UDF be used in Spark session
15. Explain some troubleshooting in Spark
16. What is Stage,Job,Tasks in Spark
17. Difference between GroupByKey and ReduceByKey
18. What is executor memory and explain how did you set them in your project
19.Name some Spark functions which have been used in your project
20. What is Spark UDF and write the signature of the same. 

Tuesday, December 11, 2018

Create DataFrame from RDD


Creating DataFrame from RDD in Spark:

RDD and DataFrame both are highly used APIs in Spark framework. Converting RDD to DataFrame is a very common technique every programmer has to do in their programming. I would like to take you through the most suitable way to achieve this.

There are 2 most commonly used techniques.
- Inferring the Schema Using Reflection
- Programmatically specifying the schema

Inferring the Schema Using Reflection:

//Creating RDD
val rdd = sc.parallelize(Seq(1,2,3,4))
import spark.implicits._
//Creating Dataframe from RDD
val dataFrame = rdd.toDF()

//Creating Schema
case class Person(name: String, age: Int)
//Creating DataFrame using Refelction
val people = sc.textFile("SaleData.txt").map(_.split(",")).map(p => Person(p(0), p(1).toInt)).toDF()


Programmatically specifying the schema:

//Creating Schema function
def dfSchema(columnNames: List[String]): StructType =
  StructType(
    Seq(
      StructField(name = "name", dataType = StringType, nullable = false),
      StructField(gender= "gender", dataType = StringType, nullable = false),
      StructField(age= "age", dataType = IntegerType, nullable = false)
    )
  )

//Calling Schema function
val schema = dfSchema(Seq("name", "gender","age"))

//Creating RDD
val rdd: RDD[String] = ...

//Creating function to map Row data
def row(line: List[String]): Row = Row(line(0), line(1),line(2).toInt)

//Mapping Row Data
val data = rdd.map(_.split(",").to[List]).map(row)

//Creating DataFrame
val dataFrame = spark.createDataFrame(data, schema)