Tech Learn: Spark

Showing posts with label Spark. Show all posts

Monday, December 17, 2018

Spark Interview Questions

Spark Interview Questions:

Spark is a framework which is heavily used in Hadoop(Hadoop 2.0/Yarn) in order to execute Analytical, Streaming , Machine Learning process in a very efficient way. I would like to take you through some of the questions those are frequently asked in any interview.

Spark:

1. What is Spark.
2. Explain higher level architecture of Spark.
3. What is Driver &Executor and explain the difference between them
4. What is DAG
5. How do you trace your failed job through DAG.
6. What is Persistence. Name difference level of Persistence.
7. Why do we use Repartittion & Coalesce
8. What is RDD,Dataframe & Dataset and explain difference between them
9. How to see partition after loading a input file in Spark.
10. How do we load/store any Hive table in Spark
11. How to read JSON,CSV file in Spark.
12. What is Spark Streaming.
13. Name some properties which you have set in your project
14. How could a Hive UDF be used in Spark session
15. Explain some troubleshooting in Spark
16. What is Stage,Job,Tasks in Spark
17. Difference between GroupByKey and ReduceByKey
18. What is executor memory and explain how did you set them in your project
19.Name some Spark functions which have been used in your project
20. What is Spark UDF and write the signature of the same.

Friday, December 14, 2018

Python Dictionary

Dictionary in Python

It is an ordered collection of items. If you look at the other data structures of Python(i.e. list,tuple) they consist only value items, but Dictionary consists of Key,Value pair. As Python dynamically typed language there are no need to define variable type. We can directly define variable name.

We can create dictionary by below commands.

#Empty dictionary
my_collections = {}

#Dictionary with same type of key
my_collections = {1:'Rahul',2:'Shayam'}

#Dictionary with different type of keys
my_collections = {'name':'Rahul',2:[2,3,4]}

We can access the dictionary by below commands.

my_collections1 = {1:'Rahul',2:'Shayam'}
my_collections2 = {'name':'Rahul',2:[2,3,4]}

my_collections1 [1]
my_collections2 ['name']

or

my_collections2 .get('name')

We can add/update the dictionary by below commands.

#update
my_collections2 ['name'] = 'John'

#add
my_collections2 ['age']= 23

We can delete/remove the dictionary by below commands.

cubes= {1:1, 2:8 3:27, 4:64, 5:125}

#Deleting particular item
cubes.pop(4)

#Deleting arbitrary item
cubes.popitem()

#Clearing Dictionary
cubes.clear()

Tuesday, December 11, 2018

Create DataFrame from RDD

Creating DataFrame from RDD in Spark:

RDD and DataFrame both are highly used APIs in Spark framework. Converting RDD to DataFrame is a very common technique every programmer has to do in their programming. I would like to take you through the most suitable way to achieve this.

There are 2 most commonly used techniques.
- Inferring the Schema Using Reflection
- Programmatically specifying the schema

Inferring the Schema Using Reflection:

//Creating RDD
val rdd = sc.parallelize(Seq(1,2,3,4))
import spark.implicits._
//Creating Dataframe from RDD
val dataFrame = rdd.toDF()

//Creating Schema
case class Person(name: String, age: Int)

//Creating DataFrame using Refelction

val people = sc.textFile("SaleData.txt").map(_.split(",")).map(p => Person(p(0), p(1).toInt)).toDF()

Programmatically specifying the schema:

//Creating Schema function

def dfSchema(columnNames: List[String]): StructType =

StructType(

Seq(

StructField(name = "name", dataType = StringType, nullable = false),

StructField(gender= "gender", dataType = StringType, nullable = false),

StructField(age= "age", dataType = IntegerType, nullable = false)

)

//Calling Schema function

val schema = dfSchema(Seq("name", "gender","age"))

//Creating RDD

val rdd: RDD[String] = ...

//Creating function to map Row data

def row(line: List[String]): Row = Row(line(0), line(1),line(2).toInt)

//Mapping Row Data

val data = rdd.map(_.split(",").to[List]).map(row)

//Creating DataFrame

val dataFrame = spark.createDataFrame(data, schema)

Saturday, December 8, 2018

PySpark Interview Questions

Big Data is evolving day by day. Now big organizations are using Python on Spark in order to derive Analytics based solutions. I would like to share some interview questions.

1. What is Spark.
2. Tell me some use case where we prefer Python over Scala in Spark framework.
3. Difference between Python and Scala.
4. How will you execute Python script in Spark framework?
5. What is Pandas in Python?
6. What is Cython?
7. Explain OOPS Feature in Python?
8. What is dictionary in Python?
9. Difference between Tuple and List in Python?
10. Tell some package which you have used in your program.
11. What is anonymous function in Python and explain the use case.
12. What is module and how will you add one module into another?
13. How to handle python I/O operation.
14. How to handle exception in Python?
15. How will you execute Hive queries in Python?

Reading JDBC(Oracle/SQL Server) Data using Spark

Reading JDBC table data using Spark Scala

import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.sql.SQLContext

val conf = new SparkConf().setAppName(appName).setMaster(master)
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)

Reading JDBC:

sc.sqlContext .read.format("jdbc")
.option("url",<jdbc URL>)
.option("dbtable",<table name>)
.option("user",<user name>)
.option("password",<password>)
.option("driver",<Driver Name>)
.load

Driver Details:
SQL Server: "com.microsoft.sqlserver.jdbc.SQLServerDriver"
Oracle: "oracle.jdbc.driver.OracleDriver"

URL Details:
SQL Server: "jdbc:sqlserver://<serverName><instanceName><:><portNumber>;<property1=value>;<property2=value>"
Oracle: "jdbc:oracle:thin:@localhost:1521:orcl"

** Please change the local host of Oracle JDBC URL depending on your server IP.
** You can use User Name and Password in place of Property1 and Property2 in time of SQL Server JDBC URL creation.

JSON File Parsing Using Spark Scala

Parsing JSON file in Spark:

First create object of Spark Context.

import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.sql.SQLContext

val conf = new SparkConf().setAppName(appName).setMaster(master)
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)

sc.sqlContext.read.option("multiline",<is_multiline>)
.option("mode","PERMISSIVE")
.option("quote","\"")
.option("escape","\u0000")
.json(<path>)

** is_multiline:Boolean = True/False
** path = JSON file Path

Saturday, September 29, 2018

Big Data Interview Questions

Hadoop is an open-source software framework for storing data and running applications on clusters of commodity hardware. It provides massive storage for any kind of data, enormous processing power and the ability to handle virtually limitless concurrent tasks or jobs. I would like to share some interview questions.

Hadoop:

1. What is Big Data & Hadoop
2. What is Map Reduce.
3. What distribution are you using in your project.
4. Write one Word count problem in Map reduce.
5. What is Yarn.
6. What is difference between Hadoop 1.x & 2.x
7. Explain the high level architecture of Yarn.
8. What is the difference between Application Master and Application Manager.
9. How does Node manager work in Yarn framework
10. What is Data locality
11. What is Speculative Execution
12. What is Rack Awareness
13. What is replication factor. How do you set them.
14. Draw the diagram of I/O read & write anatomy in Yarn framework
15. How many primary nodes are maintained in Yarn and why
16. What is Zookeeper.
17. What is the different type of Clusters are there
18. What is the difference between Local and Cluster mode.
19. What is Mesos.
20. Difference between Flume & Kafka

Hive:

1.What is Hive
2. Difference between Hive and RDBMS
3. Underlying storage of Hive
4. ACID property supported in Hive or not.
5. How will you Insert/Append/Overwrite Hive table.
6. What all those file formats have been used in your project.
7. What is columnar format. What columnar format you have used in your project.
8. What is the ORC & Parquet file format.
9. How do you parse CSV file tin Hive
10. What is Vectorization technique in Hive.
11. How will you import/export data from/to between RDBMS and Hive.
12. What all those properties did you set in your project
13. What is Partitioning & Bucketing and explain the use case of both.
14. What are the limitations of Partitioning.
15. How to optimize I/O reading in Hive.
16. How many types of tables are there in Hive
17. Why do we use external table in Hive
18. Name some Serde properties which has been used in your project
19. What metastore has been used in your project underlying of Hive
20. Name some Date & String manipulation function.