SPARK RDD – Interview Questions-5

Question 4)How to Launch Jupyter and execute a simple PySpark Program?

Ans: We are going to discuss two ways to implement this.

Way 1:

Step 1: Install python 3.7.5 and run below command to install jupyter in cmd:

pip install jupyter

—first navigate to —

C:\Users\cloudvikas\AppData\Local\Programs\Python\Python37\Scripts

Then execute this command in cmd.

C:\Spark\spark-2.4.5-bin-hadoop2.6\bin>pip install jupyter

Once it is installed then we can navigate Step 2.

Step 2: Run Below command in jupyter notebook.

Navigate to Python->Scripts folder and open cmd.

Run jupyter notebook command in cmd.

C:\Python27\Scripts>jupyter notebook

    To access the notebook, open this file in a browser:
        file:///C:/Users/cloudvikas/AppData/Roaming/jupyter/runtime/nbserver-12400-open.html
    Or copy and paste one of these URLs:
        http://localhost:8888/?token=e20c5af58e28aef74db963a4e3169042d1a54fa71061fa87
     or 

Step 3:

We can see url link . it opens jupyter notebook in browser.

Launch Jupiter and login with url in browser.

Create New Python3 project and execute pyspark program.

Select python3.

sparksession will be active.

Way 2: Through Anaconda:

Step 1: Install Gnu On Windows

1. Download and install Gnu On Windows (Gow) from https://github.com/bmatzelle/gow/releases/download/v0.8.0/Gow-0.8.0.exe. Select the default options when prompted during the installation of Gow.

Step 2: Install Anaconda and Jupyter Notebook

  1. Downloads and install Anaconda.

Select python 3.7 version.

Once it is downloaded then downloaded file name looks like:

Anaconda3-2020.02-Windows-x86_64

Select the both options when prompted during the installation of Anaconda.

  1. Open “Anaconda Prompt” by finding it in the Windows (Start) Menu.

Step 3: Anaconda with Jupyter notebook

Install conda findspark, to access spark instance from jupyter notebook. Check current installation in Anaconda cloud. In time of writing:

conda install -c conda-forge findspark
(base) C:\Users\cloudvikas>conda install -c conda-forge findspark
Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: C:\Users\cloudvikas\Anaconda3

  added / updated specs:
    - findspark


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    certifi-2019.11.28         |           py37_0         148 KB  conda-forge
    conda-4.8.3                |   py37hc8dfbb8_0         3.1 MB  conda-forge
    findspark-1.3.0            |             py_1           6 KB  conda-forge
    python_abi-3.7             |          1_cp37m           4 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         3.2 MB

The following NEW packages will be INSTALLED:

  findspark          conda-forge/noarch::findspark-1.3.0-py_1
  python_abi         conda-forge/win-64::python_abi-3.7-1_cp37m

The following packages will be UPDATED:

  conda                       pkgs/main::conda-4.8.2-py37_0 --> conda-forge::conda-4.8.3-py37hc8dfbb8_0

The following packages will be SUPERSEDED by a higher-priority channel:

  certifi                                         pkgs/main --> conda-forge


Proceed ([y]/n)? y
(base) C:\Users\cloudvikas>

Step 4: Download and Install Spark

Go to Spark home page, and download the .tgz file from 2.3.2 version from

apache spark download page

Extract the file to your chosen directory (7z can open tgz). In my case, it was C:\spark. There is another compressed directory in the tar, extract it (into here) as well.

Setup the environment variables

SPARK_HOME  = C:\spark\spark-2.3.2-bin-hadoop2.7
HADOOP_HOME = C:\spark\spark-2.3.2-bin-hadoop2.7

Add the following path to PATH environment variable:

C:\spark\spark-2.3.2-bin-hadoop2.7\bin

Step 5: Download and setup winutils.exe

In hadoop binaries repository, https://github.com/steveloughran/winutils choose your hadoop version, then goto bin, and download the winutils.exe file. In my case: https://github.com/steveloughran/winutils/blob/master/hadoop-2.7.1/bin/winutils.exe

Save winutils.exe in to bin directory of your spark installation, SPARK_HOME\bin directory. In my case: C:\spark\spark-2.3.2-bin-hadoop2.7\bin.

Step 6: PySpark with Jupyter notebook

Install conda findspark, to access spark instance from jupyter notebook. Check current installation in Anaconda cloud. In time of writing:

conda install -c conda-forge findspark

Open your python jupyter notebook, and write inside:

import findspark
findspark.init()findspark.find()
import pyspark
findspark.find()

Last line will output SPARK_HOME path. It’s just for test, you can delete it.

Run below commands:

In this way, we can setup jupyter using Anaconda and execute simple pyspark program.

QUESTION 5:How to connect mysql database through jupyter notebook?

Prerequisite:

Install mysql workbench.

https://mysql-workbench.en.uptodown.com/windows/download/269553

after downloading , file name will be :

mysql-workbench-6-3-4-64-bit-multi-win

To install this we have to install below software as well:

https://visualstudio.microsoft.com/downloads/?q=Visual+C%2B%2B+Redistributable+for+Visual+Studio+2019

or you can ignore if you have done already.

once its installed then open it.

Run below code in jupyter:

from pyspark.sql import SparkSession
#creation of sparksession driver process
Sparkdriver =SparkSession.builder.master(‘local’).\
config(‘spark.jars.packages’,’mysql:mysql-connector-java:5.1.44’).\
appName(‘cloudvikas’).getOrCreate()
#SparkDriver
Df1=sparkdriver.read.format(‘jdbc’).\
option(‘url’,’jdbc:mysql://localhost:3306’).\
option(‘driver’,’com.mysql.jdbc.Driver’).\
option(‘user’,’root’).\
option(‘password’,’cloudvikas’)
option(‘dbtable’,’dbsample.cloudvikastable’).\
load()
df1.printSchema()
df1.show(5)

QUESTION:6 How will you read json file in pyspark?

Ans:

Step 1: Create any json file:

[{
  "id": 1,
  "first_name": "Jeanette",
  "last_name": "Penddreth",
  "email": "vikas1",
  "gender": "Female",
  "ip_address": "26.58.193.2"
}, {
  "id": 2,
  "first_name": "Giavani",
  "last_name": "Frediani",
  "email": "vikas2",
  "gender": "Male",
  "ip_address": "229.179.4.212"
}, {
  "id": 3,
  "first_name": "Noell",
  "last_name": "Bea",
  "email": "vikas3",
  "gender": "Female",
  "ip_address": "180.66.162.255"
}, {
  "id": 4,
  "first_name": "Willard",
  "last_name": "Valek",
  "email": "vikas4",
  "gender": "Male",
  "ip_address": "67.76.188.26"
}]

Step 2: Write Spark SQL code:

from pyspark.sql import SparkSession
from pyspark.sql.types import *
import os
import sqlite3


os.environ['HADOOP_HOME']='C:\\hadoop'
spark=SparkSession.builder.master('local').config('spark.jars.packages','mysql:mysql-connector-java:5.1.44,com.databricks:spark-xml_2.11:0.4.1').appName('demoapp').getOrCreate()
print(spark)

# A JSON dataset is pointed to by path.
# The path can be either a single text file or a directory storing text files
Path="C:\\Users\\cloudvikas\\Downloads\\example_2.json"
df1=spark.read.json(path)

# The inferred schema can be visualized using the printSchema() method

df1.printSchema()

# Creates a temporary view using the DataFrame

df1.createOrReplaceTempView("student")
# SQL statements can be run by using the sql methods provided by spark

data1=spark.sql("select * from student")
data1.show()

Step 3: Execute this script :

We are executing in pycharm:

There is an issue found while executing:

Issue is related to json file.

Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the referenced columns only include the internal corrupt record column.We must keep data in single line when we are using json file.

Reason:Spark while processing json data considers each new line as a complete json. Thus it is failing.You should keep your complete json in a single line in a compact form by removing all white spaces and newlines.

Like

{"a":{"b":1}}

If you want multiple jsons in a single file keep them like this

{"a":{"b":1}}
{"a":{"b":2}}
{"a":{"b":3}} 

Lets update file in such way:

Through notepad++ -> EDIT option->Blank operations.


[{ "id": 1, "first_name": "Jeanette", "last_name": "Penddreth", "email": "vikas1", "gender": "Female", "ip_address": "26.58.193.2" }, { "id": 2, "first_name": "Giavani", "last_name": "Frediani", "email": "vikas2", "gender": "Male", "ip_address": "229.179.4.212" }, { "id": 3, "first_name": "Noell", "last_name": "Bea", "email": "vikas3", "gender": "Female", "ip_address": "180.66.162.255" }, { "id": 4, "first_name": "Willard", "last_name": "Valek", "email": "vikas4", "gender": "Male", "ip_address": "67.76.188.26" }]

Step 3: Execute same script again:

Script is executed successfully and got output:

Spark SQL can automatically infer the schema of a JSON dataset and load it as a DataFrame. This conversion can be done using SparkSession.read.json on a JSON file.

Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object.For a regular multi-line JSON file, set the multiLine parameter to True.

Question:7 How to store output to mysql table?

Run below command:

First we have to create dataframe with some columns(column count depends on schema of output). Then we have to write dataframe data into mysql table.Currently I am assuming two columns.

Df_local2 = df_local.select(‘column1’,’column2’)
df_local2.write.format(‘jdbc’).\
option(‘url’,’jdbc:mysql://localhost:3306’).\
option(‘driver’,’com.mysql.jdbc.driver’).\
option(‘user’,’root’).\
option(‘password’,’vikas’).\
option(‘dbtable’,’dbsample.table).
save()