Dataframe Interview questions on
What is SparkContext and spark session?
• A SparkContext represents the connection to a Spark cluster, and can be used to create RDDs, accumulators and broadcast variables on that cluster.
• The SparkContext allows your Spark driver application to access the cluster through a resource manager. The resource manager can be YARN, or Spark’s cluster manager.
• Only one SparkContext may be active per JVM.
• In Spark shell, SparkContext is already created for the user with the name sc.
- From Spark 2.0,Spark Session is the new entry point for Spark applications.it includes all the APIs available in different contexts – Spark Context, SQL Context, Streaming Context, Hive Context.
What is Directed Acyclic Graph?
- DAG in Apache Spark is a set of Vertices and Edges, where vertices represent the RDDs and the edges represent the transformation to be applied on RDD.
- In DAG, calling of Action, the created DAG submits to DAG Scheduler which further splits the DAG graph into the stages of the task.
What is Transformation and Actions?
• Transformations are lazy operations on an RDD that create one or many new RDDs.
• Transformations are functions that take an RDD as the input and produce one or many RDDs as the output.
• Transformations are lazy, i.e. are not executed immediately. Only after calling an action, transformations are executed.
• Actions are RDD operations that produce non-RDD values.
• Actions trigger execution using lineage graph to load the real data into RDD.
• Actions trigger the execution of RDD transformations to return values.
• The values of action are stored to drivers or to the external storage system. It brings the laziness of RDD into motion.
Explain Lineage and Lazy Evaluations.
• RDD Lineage is also called an RDD operator graph or RDD dependency graph.
• When we create new RDDs based on the existing RDDs, Spark manages these dependencies using Lineage Graph.
• RDD Lineage is built as a result of applying transformations to the RDD and creates a logical execution plan.
• Lazy evaluation is an evaluation strategy which delays the evaluation of an expression until its value is needed.
• Lazy Evaluation helps to optimize the Disk & Memory Usage in Spark.
• Transformations are lazy in nature so when we call some transformation in RDD, it does not execute immediately.
• The execution will be started only when the action is triggered.
Benefits of Lazy evaluation:
• Increases Manageability
• Saves Computation and increases Speed
• Reduces time and space Complexities
Write wordcount program using Python?
file = sc.textFile(“spark/Lab1/hello.txt”);
file.collect()
words=file.flatMap(lambda line :line.splït(” “));
words.collect()
wordNum=words.map(lambda word : (word,1));
wordNum.collect()
wordCount= wordNum.reduceByKey(lambda sum,b : sum+b);
wordCount.collect()
sorted=wordCount.sortByKey ();
sorted.collect()
descSorted=wordCount.sortByKey(False);
descSorted.collect()
sorted = sc.textFile(“spark/Lab1/hello.txt”)
.flatMaplambda line:line.split{” “))
.map(lambda word:(word,l))
.reduceByKey(lambda sum,b:sum+b)
.sortByKey();
descSorted = sc.textFile(“spark/Lab1/helIo.txt”)
.flatMap(lambda line : line.split(” “))
.map(lambda word : (word,l))
.reduceByKey(lambda sum,b : sum+b)
.sortByKey(False)
Explain Action and Transformations:-
There are two types of operations that you can perform on an RDD
- Transformations
- Actions
1) Transformation
Transformation applies some function on a RDD and creates a new RDD. It does not modify the RDD when you apply the function on the RDD. There are two kinds of transformations:
- Narrow transformations:
o When each partition of the parent RDD is used by at most one partition of the child RDD then this transformation is called as Narrow transfromation.
All the elements that are required to compute the records in single partition live in the single partition of parent RDD.
o Narrow transfromation doesn’t required to distribute the data across the partitions Ex: flatMap, map, filter, etc
- Wide transformation
In wide transformation, all the elements that are required to compute the records in the single partition may live in many partitions of parent RDD. The partition may live in many partitions of parent RDD. Wide transformations are the result of groupbyKey and reducebyKey.
2) Action
- The action applies some function on an RDD and produces non-RDD values,
- Actions trigger execution using lineage graph.
- An Action is used to either save a result to some location or to display it.
Explain RDD Lineage Graph
- When you apply a transformation, Spark creates a lineage.
- A lineage keeps track of all the transformations has to be applied on that RDD, including from where it has to read the data.
Val rdd1= sc.parallelize(1 to10)
Val rdd2= rdd1.filter(n=>n%2==0)
Rdd2.count()
- sc.parallelize() and rddl.filter() do not get executed immediately. It will only get executed once we call an Action on the RDD called rdd2.count().
- The lineage graph helps Spark to recompute any intermediate RDD in case of failures. This way spark achieves fault tolerance.
Explain Pair RDDs:-
- RDDs, which contain the key/value pairs, are called as pair RDDs.
- These RDDs expose operations that allow us to act on each key in parallel or regroup data across the network, eg. reduceByKeyQ, join(), etc.
- There are a number of ways in which the pair RDD can be created. Spark provides some APIs for loading the data which return the pair RDDs.
Explain Partitions
• Partition is a basic unit of parallelism in a RDD.
• In Hadoop, we divide data set into multiple blocks and store them in different data nodes ,in Spark, we divide the RDD into multiple partitions and store them in worker nodes (datanodes) which are computed in parallel across all the nodes.
• In Hadoop, we need to replicate the data for fault tolerance, but in Spark, replication is not required because this is performed by RDDs.
• Spark automatically decides the number of partitons required and divides the RDD into partitions.
Val rdd1= sc.textFile(“cloud.txt”,3)
Val rdd3= sc.parallelize(1 to 115,4)
• we can also specify the number of partitions required when we create the RDD.
• When we execute some action then the task will be launched per partition. It means that more number of partitions, the more parallelism.
• Partitions never span over multiple machines, i.e., all tuples in the same partition are guaranteed to be on the same machine.
• Each machine in the cluster can contain one or more partitions.
• The total number of partitions are configurable. By default, it equals the total number of cores on all executor nodes.
we can access the number of partitions created as follows: rdd.partitions.size
rdd.getNuinPartitions
• we can access the partitioner information as follows:
• rdd.partitioner
• we can access the data from the partitions as follows:
• rdd.glom.collect
EX1:-
val rdd3 = sc.parallelize(l to 20,3)
rdd3.partitions.size
rdd3.getNumPardtions
rdd3.partitioner
rdd3.glom.collect
rdd3.saveAsTextFile(user/output2)
Types of Partitions
- HashPartitioner
- RangePartitioner
HashPartitioner
- It uses the hashCode() method to determine the partition in spark.
- So based on this hashcode() concept HashPartitioner will divide the keys that have the same hash code ().
partition = key.hashCode() % numPartitions
RangePartitioner
- It uses a range to distribute the data to the respective partition based on the range in which a key falls.
• When there are sort able records, then range partition will divide the records almost in equal ranges.
• RangePartitioner will sort the records based on the key first and then it will divide the records into a number of partitions based on the given value.
3) Custom Partitioners
• We can define our own partitioner class also.
• When we want to implement a custom partitioner, we need to follow the following steps:
i. We need to extend the org.apache.spark.Partitioner class
ii. We need to override the three methods which are:
• numPartitions : Int : returns the number of partitions
• getPartitions(key : Any) : Int : returns the partition 10 (0 to numPartitions-1) for the given key
• equalsf) : To ensure that whether correct partitioner object is used or not.