SPARK RDD – Interview Questions

Set 1What is SparkContext and spark session?
What is Directed Acyclic Graph?
What is Transformation and Actions?
Explain Lineage and Lazy Evaluations.
Write wordcount program using Python?
Explain Action and Transformations:-
Explain RDD Lineage Graph
Explain Pair RDDs:-
Explain Partitions
Set 2SPARK RDD – Interview Questions-2
Set 3What is Catalyst optimizer?
What are the fundamentals of Catalyst Optimizer?
How will you run application locally on 8 cores?
How will you run on a Spark standalone cluster in client deploy mode?
How will you run on a Spark standalone cluster in cluster deploy mode with supervise?
Run on a YARN cluster
Run a Python application on a Spark standalone cluster
Run on a Mesos cluster in cluster deploy mode with supervise
Run on a Kubernetes cluster in cluster deploy mode
Do I need Hadoop to run Spark?
What is the Default level of parallelism in Spark?
Is it possible to have multiple SparkContext in single JVM? 
What is the advantage of broadcasting values across Spark Cluster? 
How do you disable Info Message when running Spark Application?
In your project ,you  have access to a cluster (12 nodes where each node has 2 processors Intel(R) Xeon(R) CPU E5-2650 2.00GHz, where each processor has 8 cores). Now how will you tune application and to observe its performance?
What is Lazy evaluated RDD mean?
How would you control the number of partitions of a RDD?
Data is spread in all the nodes of cluster, how spark tries to process this data?
What is coalesce transformation?
What is Shuffling?
What is the difference between groupByKey and reduceByKey ?
What is checkpointing?
What is stage, with regards to Spark Job execution?
What is Speculative Execution of a tasks?
Which all cluster manager can be used with Spark?
Hadoop uses replication to achieve fault tolerance. How is this achieved in Apache Spark?
What are the various levels of persistence in Apache Spark?
What is the difference between persist() and cache()
What are the various data sources available in SparkSQL?
What is the advantage of a Parquet file?
Set 4What is Catalyst framework?
When running Spark applications, is it necessary to install Spark on all the nodes of YARN cluster?
How can you trigger automatic clean-ups in Spark to handle accumulated metadata?
What is lineage graph?
Why is there a need for broadcast variables with Apache Spark?
How can you minimize data transfers when working with Spark?
Why is Spark RDD immutable?
By Default, how many partitions are created in RDD in Apache Spark?
What is the difference between DAG and Lineage?
What is the difference between Caching and Persistence in Apache Spark?
Explain about Spark SQL.
How do I skip a header from CSV files in Spark?
How to read multiple text files into a single RDD?
List the advantage of Parquet file in Apache Spark. 
How to read JSON file in Spark?
Write some of the Date functions in Spark.
Write transformation logic to convert date (2017:12:12 00:00:00) to 20171212.
How to create Spark Project in Pycharm and run any simple program?
QUESTION 2:Why do we have to configure winutils as Hadoop path in spark program?
Question 3) How to overwrite files in saveastable command?
Question 4)How to Launch Jupyter and execute a simple PySpark Program?

Ans: We are going to discuss two ways to implement this.

Way 1:

Step 1: Install python 3.7.5 and run below command to install jupyter in cmd:

pip install jupyter

—first navigate to —

C:\Users\cloudvikas\AppData\Local\Programs\Python\Python37\Scripts

Then execute this command in cmd.

C:\Spark\spark-2.4.5-bin-hadoop2.6\bin>pip install jupyter

Once it is installed then we can navigate Step 2.

Step 2: Run Below command in jupyter notebook.

Navigate to Python->Scripts folder and open cmd.

Run jupyter notebook command in cmd.

C:\Python27\Scripts>jupyter notebook

    To access the notebook, open this file in a browser:
        file:///C:/Users/cloudvikas/AppData/Roaming/jupyter/runtime/nbserver-12400-open.html
    Or copy and paste one of these URLs:
        http://localhost:8888/?token=e20c5af58e28aef74db963a4e3169042d1a54fa71061fa87
     or 

Step 3:

We can see url link . it opens jupyter notebook in browser.

Launch Jupiter and login with url in browser.

Create New Python3 project and execute pyspark program.

Select python3.

sparksession will be active.

Way 2: Through Anaconda:

Step 1: Install Gnu On Windows

1. Download and install Gnu On Windows (Gow) from https://github.com/bmatzelle/gow/releases/download/v0.8.0/Gow-0.8.0.exe. Select the default options when prompted during the installation of Gow.

Step 2: Install Anaconda and Jupyter Notebook

  1. Downloads and install Anaconda.

Select python 3.7 version.

Once it is downloaded then downloaded file name looks like:

Anaconda3-2020.02-Windows-x86_64

Select the both options when prompted during the installation of Anaconda.

  1. Open “Anaconda Prompt” by finding it in the Windows (Start) Menu.

Step 3: Anaconda with Jupyter notebook

Install conda findspark, to access spark instance from jupyter notebook. Check current installation in Anaconda cloud. In time of writing:

conda install -c conda-forge findspark
(base) C:\Users\cloudvikas>conda install -c conda-forge findspark
Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: C:\Users\cloudvikas\Anaconda3

  added / updated specs:
    - findspark


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    certifi-2019.11.28         |           py37_0         148 KB  conda-forge
    conda-4.8.3                |   py37hc8dfbb8_0         3.1 MB  conda-forge
    findspark-1.3.0            |             py_1           6 KB  conda-forge
    python_abi-3.7             |          1_cp37m           4 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         3.2 MB

The following NEW packages will be INSTALLED:

  findspark          conda-forge/noarch::findspark-1.3.0-py_1
  python_abi         conda-forge/win-64::python_abi-3.7-1_cp37m

The following packages will be UPDATED:

  conda                       pkgs/main::conda-4.8.2-py37_0 --> conda-forge::conda-4.8.3-py37hc8dfbb8_0

The following packages will be SUPERSEDED by a higher-priority channel:

  certifi                                         pkgs/main --> conda-forge


Proceed ([y]/n)? y
(base) C:\Users\cloudvikas>

Step 4: Download and Install Spark

Go to Spark home page, and download the .tgz file from 2.3.2 version from

apache spark download page

Extract the file to your chosen directory (7z can open tgz). In my case, it was C:\spark. There is another compressed directory in the tar, extract it (into here) as well.

Setup the environment variables

SPARK_HOME  = C:\spark\spark-2.3.2-bin-hadoop2.7
HADOOP_HOME = C:\spark\spark-2.3.2-bin-hadoop2.7

Add the following path to PATH environment variable:

C:\spark\spark-2.3.2-bin-hadoop2.7\bin

Step 5: Download and setup winutils.exe

In hadoop binaries repository, https://github.com/steveloughran/winutils choose your hadoop version, then goto bin, and download the winutils.exe file. In my case: https://github.com/steveloughran/winutils/blob/master/hadoop-2.7.1/bin/winutils.exe

Save winutils.exe in to bin directory of your spark installation, SPARK_HOME\bin directory. In my case: C:\spark\spark-2.3.2-bin-hadoop2.7\bin.

Step 6: PySpark with Jupyter notebook

Install conda findspark, to access spark instance from jupyter notebook. Check current installation in Anaconda cloud. In time of writing:

conda install -c conda-forge findspark

Open your python jupyter notebook, and write inside:

import findspark
findspark.init()findspark.find()
import pyspark
findspark.find()

Last line will output SPARK_HOME path. It’s just for test, you can delete it.

Run below commands:

In this way, we can setup jupyter using Anaconda and execute simple pyspark program.

QUESTION 5:How to connect mysql database through jupyter notebook?

Prerequisite:

Install mysql workbench.

https://mysql-workbench.en.uptodown.com/windows/download/269553

after downloading , file name will be :

mysql-workbench-6-3-4-64-bit-multi-win

To install this we have to install below software as well:

https://visualstudio.microsoft.com/downloads/?q=Visual+C%2B%2B+Redistributable+for+Visual+Studio+2019

or you can ignore if you have done already.

once its installed then open it.

Run below code in jupyter:

from pyspark.sql import SparkSession
#creation of sparksession driver process
Sparkdriver =SparkSession.builder.master(‘local’).\
config(‘spark.jars.packages’,’mysql:mysql-connector-java:5.1.44’).\
appName(‘cloudvikas’).getOrCreate()
#SparkDriver
Df1=sparkdriver.read.format(‘jdbc’).\
option(‘url’,’jdbc:mysql://localhost:3306’).\
option(‘driver’,’com.mysql.jdbc.Driver’).\
option(‘user’,’root’).\
option(‘password’,’cloudvikas’)
option(‘dbtable’,’dbsample.cloudvikastable’).\
load()
df1.printSchema()
df1.show(5)

QUESTION:6 How will you read json file in pyspark?

Ans:

Step 1: Create any json file:

[{
  "id": 1,
  "first_name": "Jeanette",
  "last_name": "Penddreth",
  "email": "vikas1",
  "gender": "Female",
  "ip_address": "26.58.193.2"
}, {
  "id": 2,
  "first_name": "Giavani",
  "last_name": "Frediani",
  "email": "vikas2",
  "gender": "Male",
  "ip_address": "229.179.4.212"
}, {
  "id": 3,
  "first_name": "Noell",
  "last_name": "Bea",
  "email": "vikas3",
  "gender": "Female",
  "ip_address": "180.66.162.255"
}, {
  "id": 4,
  "first_name": "Willard",
  "last_name": "Valek",
  "email": "vikas4",
  "gender": "Male",
  "ip_address": "67.76.188.26"
}]

Step 2: Write Spark SQL code:

from pyspark.sql import SparkSession
from pyspark.sql.types import *
import os
import sqlite3


os.environ['HADOOP_HOME']='C:\\hadoop'
spark=SparkSession.builder.master('local').config('spark.jars.packages','mysql:mysql-connector-java:5.1.44,com.databricks:spark-xml_2.11:0.4.1').appName('demoapp').getOrCreate()
print(spark)

# A JSON dataset is pointed to by path.
# The path can be either a single text file or a directory storing text files
Path="C:\\Users\\cloudvikas\\Downloads\\example_2.json"
df1=spark.read.json(path)

# The inferred schema can be visualized using the printSchema() method

df1.printSchema()

# Creates a temporary view using the DataFrame

df1.createOrReplaceTempView("student")
# SQL statements can be run by using the sql methods provided by spark

data1=spark.sql("select * from student")
data1.show()

Step 3: Execute this script :

We are executing in pycharm:

There is an issue found while executing:

Issue is related to json file.

Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the referenced columns only include the internal corrupt record column.We must keep data in single line when we are using json file.

Reason:Spark while processing json data considers each new line as a complete json. Thus it is failing.You should keep your complete json in a single line in a compact form by removing all white spaces and newlines.

Like

{"a":{"b":1}}

If you want multiple jsons in a single file keep them like this

{"a":{"b":1}}
{"a":{"b":2}}
{"a":{"b":3}} 

Lets update file in such way:

Through notepad++ -> EDIT option->Blank operations.


[{ "id": 1, "first_name": "Jeanette", "last_name": "Penddreth", "email": "vikas1", "gender": "Female", "ip_address": "26.58.193.2" }, { "id": 2, "first_name": "Giavani", "last_name": "Frediani", "email": "vikas2", "gender": "Male", "ip_address": "229.179.4.212" }, { "id": 3, "first_name": "Noell", "last_name": "Bea", "email": "vikas3", "gender": "Female", "ip_address": "180.66.162.255" }, { "id": 4, "first_name": "Willard", "last_name": "Valek", "email": "vikas4", "gender": "Male", "ip_address": "67.76.188.26" }]

Step 3: Execute same script again:

Script is executed successfully and got output:

Spark SQL can automatically infer the schema of a JSON dataset and load it as a DataFrame. This conversion can be done using SparkSession.read.json on a JSON file.

Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object.For a regular multi-line JSON file, set the multiLine parameter to True.

Question:7 How to store output to mysql table?

Run below command:

First we have to create dataframe with some columns(column count depends on schema of output). Then we have to write dataframe data into mysql table.Currently I am assuming two columns.

Df_local2 = df_local.select(‘column1’,’column2’)
df_local2.write.format(‘jdbc’).\
option(‘url’,’jdbc:mysql://localhost:3306’).\
option(‘driver’,’com.mysql.jdbc.driver’).\
option(‘user’,’root’).\
option(‘password’,’vikas’).\
option(‘dbtable’,’dbsample.table).
save()