Showing posts with label Hadoop. Show all posts
Showing posts with label Hadoop. Show all posts

Monday, December 17, 2018

Spark Interview Questions

Spark Interview Questions:

Spark is a framework which is heavily used in Hadoop(Hadoop 2.0/Yarn) in order to execute Analytical, Streaming , Machine Learning process in a very efficient way. I would like to take you through some of the questions those are frequently asked in any interview.

Spark:

1. What is Spark.
2. Explain higher level architecture of Spark.
3. What is Driver &Executor and explain the difference between them
4. What is DAG
5. How do you trace your failed job through DAG.
6. What is Persistence. Name difference level of Persistence.
7. Why do we use Repartittion & Coalesce
8. What is RDD,Dataframe & Dataset and explain difference between them
9. How to see partition after loading a input file in Spark.
10. How do we load/store any Hive table in Spark
11. How to read JSON,CSV file in Spark.
12. What is Spark Streaming.
13. Name some properties which you have set in your project
14. How could a Hive UDF be used in Spark session
15. Explain some troubleshooting in Spark
16. What is Stage,Job,Tasks in Spark
17. Difference between GroupByKey and ReduceByKey
18. What is executor memory and explain how did you set them in your project
19.Name some Spark functions which have been used in your project
20. What is Spark UDF and write the signature of the same. 

Tuesday, December 11, 2018

Create DataFrame from RDD


Creating DataFrame from RDD in Spark:

RDD and DataFrame both are highly used APIs in Spark framework. Converting RDD to DataFrame is a very common technique every programmer has to do in their programming. I would like to take you through the most suitable way to achieve this.

There are 2 most commonly used techniques.
- Inferring the Schema Using Reflection
- Programmatically specifying the schema

Inferring the Schema Using Reflection:

//Creating RDD
val rdd = sc.parallelize(Seq(1,2,3,4))
import spark.implicits._
//Creating Dataframe from RDD
val dataFrame = rdd.toDF()

//Creating Schema
case class Person(name: String, age: Int)
//Creating DataFrame using Refelction
val people = sc.textFile("SaleData.txt").map(_.split(",")).map(p => Person(p(0), p(1).toInt)).toDF()


Programmatically specifying the schema:

//Creating Schema function
def dfSchema(columnNames: List[String]): StructType =
  StructType(
    Seq(
      StructField(name = "name", dataType = StringType, nullable = false),
      StructField(gender= "gender", dataType = StringType, nullable = false),
      StructField(age= "age", dataType = IntegerType, nullable = false)
    )
  )

//Calling Schema function
val schema = dfSchema(Seq("name", "gender","age"))

//Creating RDD
val rdd: RDD[String] = ...

//Creating function to map Row data
def row(line: List[String]): Row = Row(line(0), line(1),line(2).toInt)

//Mapping Row Data
val data = rdd.map(_.split(",").to[List]).map(row)

//Creating DataFrame
val dataFrame = spark.createDataFrame(data, schema)


Saturday, September 29, 2018

Big Data Interview Questions

Hadoop is an open-source software framework for storing data and running applications on clusters of commodity hardware. It provides massive storage for any kind of data, enormous processing power and the ability to handle virtually limitless concurrent tasks or jobs. I would like to share some interview questions.

Hadoop:

1. What is Big Data & Hadoop
2. What is Map Reduce.
3. What distribution are you using in your project.
4. Write one Word count problem in Map reduce.
5. What is Yarn.
6. What is difference between Hadoop 1.x & 2.x
7. Explain the high level architecture of Yarn.
8. What is the difference between Application Master and Application Manager.
9. How does Node manager work in Yarn framework
10. What is Data locality
11. What is Speculative Execution
12. What is Rack Awareness
13. What is replication factor. How do you set them.
14. Draw the diagram of I/O read & write anatomy in Yarn framework
15. How many primary nodes are maintained in Yarn and why
16. What is Zookeeper.
17. What is the different type of Clusters are there
18. What is the difference between Local and Cluster mode.
19. What is Mesos.
20. Difference between Flume & Kafka


Hive:

1.What is Hive
2. Difference between Hive and RDBMS
3. Underlying storage of Hive
4. ACID property supported in Hive or not.
5. How will you Insert/Append/Overwrite Hive table.
6. What all those file formats have been used in your project.
7. What is columnar format. What columnar format you have used in your project.
8. What is the ORC & Parquet file format.
9. How do you parse CSV file tin Hive
10. What is Vectorization technique in Hive.
11. How will you import/export data from/to  between RDBMS and Hive.
12. What all those properties did you set in your project
13. What is Partitioning & Bucketing and explain the use case of both.
14. What are the limitations of Partitioning.
15. How to optimize I/O reading in Hive.
16. How many types of tables are there in Hive
17. Why do we use external table in Hive
18. Name some Serde properties which has been used in your project
19. What metastore has been used in your project underlying of Hive
20. Name some Date & String manipulation function.