SPARK RDD – Interview Questions-4

If you are new to this article then please visit :

What is Catalyst framework?

Catalyst framework is a new optimization framework present in Spark SQL. It allows Spark to automatically transform SQL queries by adding new optimizations to build a faster processing system.

When running Spark applications, is it necessary to install Spark on all the nodes of YARN cluster?

Spark need not be installed when running a job under YARN or Mesos because Spark can execute on top of YARN or Mesos clusters without affecting any change to the cluster.

How can you trigger automatic clean-ups in Spark to handle accumulated metadata?

You can trigger the clean-ups by setting the parameter ‘spark.cleaner.ttl’ or by dividing the long running jobs into different batches and writing the intermediary results to the disk.

What is lineage graph?

The representation of dependencies in between RDDs is known as the lineage graph. Whenever a part of persistent RDD is lost, the data that is lost can be recovered using the lineage graph information.

Why is there a need for broadcast variables with Apache Spark?

Broadcast variables are read only variables, (in-memory cache on every machine). Usage of broadcast variables eliminates the necessity to ship copies of a variable for every task, so data can be processed faster.

It helps in storing a lookup table inside the memory which enhances the retrieval efficiency when compared to an RDD lookup ().

How can you minimize data transfers when working with Spark?
  1. Using Broadcast Variable– Broadcast variable enhances the efficiency of joins between RDDs.
  2. Using Accumulators – Accumulators update the values of variables in parallel while executing.
  3. The most common way is to avoid operations ByKey, repartition or any other operations which trigger shuffles.
Why is Spark RDD immutable?

– Immutable data is always safe to share across multiple processes as well as multiple threads.
– Since RDD is immutable we can recreate the RDD any time. (From lineage graph).
– If the computation is time-consuming, in that we can cache the RDD which result in performance improvement

By Default, how many partitions are created in RDD in Apache Spark?

numbers of cores in Cluster = no. of partitions

What is the difference between DAG and Lineage?

Lineage graph
When a new RDD has been created from an existing RDD, that new RDD contains a pointer to the parent RDD. Similarly, all the dependencies between the RDDs will be logged in a graph.This graph is called the lineage graph.

Directed Acyclic Graph(DAG)
DAG is a combination of Vertices as well as Edges. In DAG vertices represent the RDDs and the edges represent the Operation to be applied on RDD.

What is the difference between Caching and Persistence in Apache Spark?

Cache and Persist both are optimization techniques for Spark computations.

In Cache, we have only MEMORY_ONLY storage level. Using Cache technique we can save intermediate results in memory only when needed.

Persist having different storage levels which can be MEMORY,

  • MEMORY_AND_DISK,
  • MEMORY_ONLY_SER,
  • MEMORY_AND_DISK_SER,
  • DISK_ONLY,
  • MEMORY_ONLY_2,
  • MEMORY_AND_DISK_2
Explain about Spark SQL.

The two main components when using Spark SQL are DataFrame and SQLContext.

DataFrame

A Dataframe is a distributed collection of data organized into named columns.

DataFrames can be created from different data sources such as:

  • Existing RDDs
  • Structured data files
  • JSON datasets
  • Hive tables
  • External databases

 

SQLContext

  • Spark SQL provides SQLContext to encapsulate all relational functionality in Spark.

val sqlContext = new org.apache.spark.sql.SQLContext(sc)

How do I skip a header from CSV files in Spark?

Answer)Spark 2.x : spark.read.format(“csv”).option(“header”,”true”).load(“fileP‌​ath”) 

How to read multiple text files into a single RDD?

Answer)sc.textFile(“/my/dir1,/my/paths/part-00[0-5]*,/another/dir,/a/specific/file”) 

List the advantage of Parquet file in Apache Spark. 

Answer)Parquet is a columnar format supported by many data processing systems. The benefits of having columnar storage are
1)Columnar storage limits IO operations.
2)Columnar storage can fetch specific columns that you need to access.
3)Columnar storage consumes less space.
4)Columnar storage gives better-summarized data and follows type-specific encoding.

How to read JSON file in Spark?

To read data from text file , we have API in sparkcontext in spark. But certain industry use specific file formats like parquet , json files. For those kind of files, there are no direct API in Sparkcontext to read.

Sqlcontext has certain APIs to read industry-standard files(json,parquet,orc).Through the read and load command, we can read JSON files.

Val ordersdf = sqlContext.read.json(“/orders”)

Ordersdf.show

To see its schema, we can use below command:

OrdersDF.printschema

If you want to see only 2 columns from dataframe then we can use below command:

OrdersDF.select(“col1”,”col2”).show

We can use load command as well.

sqlContext.load(“/user/vikas”,”json”).show

Write some of the Date functions in Spark.

Current_date

Current_timestamp

Date_add

Date_format

Date_sub

Datediff

Day

Dayofmonth

To_date

Write transformation logic to convert date (2017:12:12 00:00:00) to 20171212.

Step1: Lets create RDD first by reading textfile

Val orders = sc.textFile(“/public/retail_db/orders”)

Its first element can be seen as below

Orders.first

Consider it has 1st column is order id,2nd is order date,3rd is order customer id

Step 2:filter Date first.

Str.split(“,”)(1).substring(0,10)

— output : 2013-07-25

Step 3: Replace “-” by blank space.

Str.split(“,”)(1).substring(0,10).replace(“-”,””)

–output – 20130725

Step 4:To convert into Int

Str.split(“,”)(1).substring(0,10).replace(“-”,””).toInt

–output : Int- 20130725

Step 5: Print 10 outputs

orderDates.take(10).foreach(println)

How to create Spark Project in Pycharm and run any simple program?

Ans: First we have to install Pycharm and python. Follow below steps to complete this task.

Step 1: Download and install Pycharm Community Edition.

https://www.jetbrains.com/pycharm/download/#section=windows

python version : 2.7

Step 2: Create Project PySparkPractice and navigate to settings.

Be ensure that , Interpreter is selected as python 2.7.

Step 3: Create SQLPackege and Create demo file

Run this program and will get output.

from pyspark.sql import SparkSession
from pyspark.sql.types import *
import os
import sqlite3

os.environ[‘HADOOP_HOME’]=’C:\hadoop’
spark=SparkSession.builder.master(‘local’).config(‘spark.jars.packages’,’mysql:mysql-connector-java:5.1.44,com.databricks:spark-xml_2.11:0.4.1′).appName(‘demoapp’).getOrCreate()
print(spark)
df = spark.read.format(‘xml’).options(rowTag=’contact’).load(‘C:\Users\cloudvikas\Documents\booknew.xml’)
df.show()
df.printSchema()
df2 = df.select(‘communications.communication.emailid’)
df2.show()
df2.coalesce(1).write.saveAsTable(‘output2′, format=’parquet’, mode=’overwrite’, path=’C:\Users\cloudvikas\PycharmProjects\Project1\SQLPackege\spark-warehouse\ouput2′)
df2.registerTempTable(“Students”)
result = spark.sql(“select * from Students”)
print(“result”)
result.show()
spark.stop()

QUESTION 2:Why do we have to configure winutils as Hadoop path in spark program?

Ans: If we have to apply Hadoop functions in window environment then we have to configure Hadoop path as winutils:

Step 1: write below code in your driver class:

import os
os.environ['HADOOP_HOME']='C:\\hadoop'

Step 2: Download winutils and keep under same path mentioned above:

https://github.com/steveloughran/winutils/blob/master/hadoop-2.7.1/bin/winutils.exe

If you don’t use this then you can get below error in your program:

Why does Spark application fail with “IOException: (null) entry in command string: null chmod 0644”? [duplicate]

Question 3) How to overwrite files in saveastable command?

df2.coalesce(1).write.saveAsTable(‘output2′, format=’parquet’, mode=’overwrite’, path=’C:\Users\cloudvikas\PycharmProjects\Project1\SQLPackege\spark-warehouse\ouput2′)