Question 4)How to Launch Jupyter and execute a simple PySpark Program?
Ans: We are going to discuss two ways to implement this.
Way 1:
Step 1: Install python 3.7.5 and run below command to install jupyter in cmd:
pip install jupyter
—first navigate to —
C:\Users\cloudvikas\AppData\Local\Programs\Python\Python37\Scripts
Then execute this command in cmd.
C:\Spark\spark-2.4.5-bin-hadoop2.6\bin>pip install jupyter
Once it is installed then we can navigate Step 2.
Step 2: Run Below command in jupyter notebook.
Navigate to Python->Scripts folder and open cmd.
Run jupyter notebook command in cmd.
C:\Python27\Scripts>jupyter notebook
To access the notebook, open this file in a browser:
file:///C:/Users/cloudvikas/AppData/Roaming/jupyter/runtime/nbserver-12400-open.html
Or copy and paste one of these URLs:
http://localhost:8888/?token=e20c5af58e28aef74db963a4e3169042d1a54fa71061fa87
or
Step 3:
We can see url link . it opens jupyter notebook in browser.
Launch Jupiter and login with url in browser.
Create New Python3 project and execute pyspark program.
Select python3.
sparksession will be active.
Way 2: Through Anaconda:
Step 1: Install Gnu On Windows
1. Download and install Gnu On Windows (Gow) from https://github.com/bmatzelle/gow/releases/download/v0.8.0/Gow-0.8.0.exe. Select the default options when prompted during the installation of Gow.
Step 2: Install Anaconda and Jupyter Notebook
- Downloads and install Anaconda.
Select python 3.7 version.
Once it is downloaded then downloaded file name looks like:
Anaconda3-2020.02-Windows-x86_64
Select the both options when prompted during the installation of Anaconda.
- Open “Anaconda Prompt” by finding it in the Windows (Start) Menu.
Step 3: Anaconda with Jupyter notebook
Install conda findspark, to access spark instance from jupyter notebook. Check current installation in Anaconda cloud. In time of writing:
conda install -c conda-forge findspark
(base) C:\Users\cloudvikas>conda install -c conda-forge findspark
Collecting package metadata (current_repodata.json): done
Solving environment: done
## Package Plan ##
environment location: C:\Users\cloudvikas\Anaconda3
added / updated specs:
- findspark
The following packages will be downloaded:
package | build
---------------------------|-----------------
certifi-2019.11.28 | py37_0 148 KB conda-forge
conda-4.8.3 | py37hc8dfbb8_0 3.1 MB conda-forge
findspark-1.3.0 | py_1 6 KB conda-forge
python_abi-3.7 | 1_cp37m 4 KB conda-forge
------------------------------------------------------------
Total: 3.2 MB
The following NEW packages will be INSTALLED:
findspark conda-forge/noarch::findspark-1.3.0-py_1
python_abi conda-forge/win-64::python_abi-3.7-1_cp37m
The following packages will be UPDATED:
conda pkgs/main::conda-4.8.2-py37_0 --> conda-forge::conda-4.8.3-py37hc8dfbb8_0
The following packages will be SUPERSEDED by a higher-priority channel:
certifi pkgs/main --> conda-forge
Proceed ([y]/n)? y
(base) C:\Users\cloudvikas>
Step 4: Download and Install Spark
Go to Spark home page, and download the .tgz file from 2.3.2 version from
apache spark download page
Extract the file to your chosen directory (7z can open tgz). In my case, it was C:\spark. There is another compressed directory in the tar, extract it (into here) as well.
Setup the environment variables
SPARK_HOME = C:\spark\spark-2.3.2-bin-hadoop2.7
HADOOP_HOME = C:\spark\spark-2.3.2-bin-hadoop2.7
Add the following path to PATH environment variable:
C:\spark\spark-2.3.2-bin-hadoop2.7\bin
Step 5: Download and setup winutils.exe
In hadoop binaries repository, https://github.com/steveloughran/winutils choose your hadoop version, then goto bin, and download the winutils.exe file. In my case: https://github.com/steveloughran/winutils/blob/master/hadoop-2.7.1/bin/winutils.exe
Save winutils.exe in to bin directory of your spark installation, SPARK_HOME\bin directory. In my case: C:\spark\spark-2.3.2-bin-hadoop2.7\bin.
Step 6: PySpark with Jupyter notebook
Install conda findspark, to access spark instance from jupyter notebook. Check current installation in Anaconda cloud. In time of writing:
conda install -c conda-forge findspark
Open your python jupyter notebook, and write inside:
import findspark
findspark.init()findspark.find()
import pyspark
findspark.find()
Last line will output SPARK_HOME path. It’s just for test, you can delete it.
Run below commands:
In this way, we can setup jupyter using Anaconda and execute simple pyspark program.
QUESTION 5:How to connect mysql database through jupyter notebook?
Prerequisite:
Install mysql workbench.
https://mysql-workbench.en.uptodown.com/windows/download/269553
after downloading , file name will be :
mysql-workbench-6-3-4-64-bit-multi-win
To install this we have to install below software as well:
or you can ignore if you have done already.
once its installed then open it.
Run below code in jupyter:
from pyspark.sql import SparkSession
#creation of sparksession driver process
Sparkdriver =SparkSession.builder.master(‘local’).\
config(‘spark.jars.packages’,’mysql:mysql-connector-java:5.1.44’).\
appName(‘cloudvikas’).getOrCreate()
#SparkDriver
Df1=sparkdriver.read.format(‘jdbc’).\
option(‘url’,’jdbc:mysql://localhost:3306’).\
option(‘driver’,’com.mysql.jdbc.Driver’).\
option(‘user’,’root’).\
option(‘password’,’cloudvikas’)
option(‘dbtable’,’dbsample.cloudvikastable’).\
load()
df1.printSchema()
df1.show(5)
QUESTION:6 How will you read json file in pyspark?
Ans:
Step 1: Create any json file:
[{
"id": 1,
"first_name": "Jeanette",
"last_name": "Penddreth",
"email": "vikas1",
"gender": "Female",
"ip_address": "26.58.193.2"
}, {
"id": 2,
"first_name": "Giavani",
"last_name": "Frediani",
"email": "vikas2",
"gender": "Male",
"ip_address": "229.179.4.212"
}, {
"id": 3,
"first_name": "Noell",
"last_name": "Bea",
"email": "vikas3",
"gender": "Female",
"ip_address": "180.66.162.255"
}, {
"id": 4,
"first_name": "Willard",
"last_name": "Valek",
"email": "vikas4",
"gender": "Male",
"ip_address": "67.76.188.26"
}]
Step 2: Write Spark SQL code:
from pyspark.sql import SparkSession
from pyspark.sql.types import *
import os
import sqlite3
os.environ['HADOOP_HOME']='C:\\hadoop'
spark=SparkSession.builder.master('local').config('spark.jars.packages','mysql:mysql-connector-java:5.1.44,com.databricks:spark-xml_2.11:0.4.1').appName('demoapp').getOrCreate()
print(spark)
# A JSON dataset is pointed to by path.
# The path can be either a single text file or a directory storing text files
Path="C:\\Users\\cloudvikas\\Downloads\\example_2.json"
df1=spark.read.json(path)
# The inferred schema can be visualized using the printSchema() method
df1.printSchema()
# Creates a temporary view using the DataFrame
df1.createOrReplaceTempView("student")
# SQL statements can be run by using the sql methods provided by spark
data1=spark.sql("select * from student")
data1.show()
Step 3: Execute this script :
We are executing in pycharm:
There is an issue found while executing:
Issue is related to json file.
Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the referenced columns only include the internal corrupt record column.We must keep data in single line when we are using json file.
Reason:Spark while processing json data considers each new line as a complete json. Thus it is failing.You should keep your complete json in a single line in a compact form by removing all white spaces and newlines.
Like
{"a":{"b":1}}
If you want multiple jsons in a single file keep them like this
{"a":{"b":1}}
{"a":{"b":2}}
{"a":{"b":3}}
Lets update file in such way:
Through notepad++ -> EDIT option->Blank operations.
[{ "id": 1, "first_name": "Jeanette", "last_name": "Penddreth", "email": "vikas1", "gender": "Female", "ip_address": "26.58.193.2" }, { "id": 2, "first_name": "Giavani", "last_name": "Frediani", "email": "vikas2", "gender": "Male", "ip_address": "229.179.4.212" }, { "id": 3, "first_name": "Noell", "last_name": "Bea", "email": "vikas3", "gender": "Female", "ip_address": "180.66.162.255" }, { "id": 4, "first_name": "Willard", "last_name": "Valek", "email": "vikas4", "gender": "Male", "ip_address": "67.76.188.26" }]
Step 3: Execute same script again:
Script is executed successfully and got output:
Spark SQL can automatically infer the schema of a JSON dataset and load it as a DataFrame. This conversion can be done using SparkSession.read.json
on a JSON file.
Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object.For a regular multi-line JSON file, set the multiLine
parameter to True
.
Question:7 How to store output to mysql table?
Run below command:
First we have to create dataframe with some columns(column count depends on schema of output). Then we have to write dataframe data into mysql table.Currently I am assuming two columns.
Df_local2 = df_local.select(‘column1’,’column2’)
df_local2.write.format(‘jdbc’).\
option(‘url’,’jdbc:mysql://localhost:3306’).\
option(‘driver’,’com.mysql.jdbc.driver’).\
option(‘user’,’root’).\
option(‘password’,’vikas’).\
option(‘dbtable’,’dbsample.table).
save()