PYSPARK Interview Questions

How to create Spark Project in Pycharm and run any simple program?

Ans: First we have to install Pycharm and python. Follow below steps to complete this task.

Step 1: Download and install Pycharm Community Edition.

https://www.jetbrains.com/pycharm/download/#section=windows

python version : 2.7

Step 2: Create Project PySparkPractice and navigate to settings.

Be ensure that , Interpreter is selected as python 2.7.

Step 3: Create SQLPackege and Create demo file

Run this program and will get output.

from pyspark.sql import SparkSession
from pyspark.sql.types import *
import os
import sqlite3

os.environ['HADOOP_HOME']='C:\\hadoop'
spark=SparkSession.builder.master('local').config('spark.jars.packages','mysql:mysql-connector-java:5.1.44,com.databricks:spark-xml_2.11:0.4.1').appName('demoapp').getOrCreate()
print(spark)
df = spark.read.format('xml').options(rowTag='contact').load('C:\\Users\\cloudvikas\\Documents\\booknew.xml')
df.show()
df.printSchema()
df2 = df.select('communications.communication.emailid')
df2.show()
df2.coalesce(1).write.saveAsTable('output2', format='parquet', mode='overwrite', path='C:\\Users\\cloudvikas\\PycharmProjects\\Project1\\SQLPackege\\spark-warehouse\\ouput2')
df2.registerTempTable("Students")
result = spark.sql("select * from Students")
print("result")
result.show()
spark.stop()

QUESTION 2:

Why do we have to configure winutils as Hadoop path in spark program?

Ans: If we have to apply Hadoop functions in window environment then we have to configure Hadoop path as winutils:

Step 1: write below code in your driver class:

import os
os.environ['HADOOP_HOME']='C:\\hadoop'

Step 2: Download winutils and keep under same path mentioned above:

https://github.com/steveloughran/winutils/blob/master/hadoop-2.7.1/bin/winutils.exe

If you don’t use this then you can get below error in your program:

Why does Spark application fail with “IOException: (null) entry in command string: null chmod 0644”? [duplicate]

Question 3)

How to overwrite files in saveastable command:
df2.coalesce(1).write.saveAsTable('output2', format='parquet', mode='overwrite', path='C:\\Users\\cloudvikas\\PycharmProjects\\Project1\\SQLPackege\\spark-warehouse\\ouput2')

Question 4)

How to Launch Jupyter and execute simple PySpark Program?

Ans: We are going to discuss two ways to implement this.

Way 1:

Step 1: Install python 3.7.5 and run below command to install jupyter in cmd:

pip install jupyter

—first navigate to —

C:\Users\cloudvikas\AppData\Local\Programs\Python\Python37\Scripts

Then execute this command in cmd.

C:\Spark\spark-2.4.5-bin-hadoop2.6\bin>pip install jupyter
Collecting jupyter
  Downloading https://files.pythonhosted.org/packages/83/df/0f5dd132200728a86190397e1ea87cd76244e42d39ec5e88efd25b2abd7e/jupyter-1.0.0-py2.py3-none-any.whl
Collecting qtconsole (from jupyter)
  Downloading https://files.pythonhosted.org/packages/d6/de/2a0bda85367881e27370a206a561326a99fbb05ab9402f4c4ad59761eec4/qtconsole-4.7.1-py2.py3-none-any.whl (117kB)
     |████████████████████████████████| 122kB 939kB/s
Collecting jupyter-console (from jupyter)
  Downloading https://files.pythonhosted.org/packages/0a/89/742fa5a80b552ffcb6a8922712697c6e6828aee7b91ee4ae2b79f00f8401/jupyter_console-6.1.0-py2.py3-none-any.whl
Collecting notebook (from jupyter)
  Downloading https://files.pythonhosted.org/packages/b1/f1/0a67f09ef53a342403ffa66646ee39273e0ac79ffa5de5dbe2f3e28b5bdf/notebook-6.0.3-py3-none-any.whl (9.7MB)
     |████████████████████████████████| 9.7MB 731kB/s
Collecting ipykernel (from jupyter)
  Downloading https://files.pythonhosted.org/packages/d7/62/d1a5d654b7a21bd3eb99be1b59a608cc18a7a08ed88495457a87c40a0495/ipykernel-5.1.4-py3-none-any.whl (116kB)
     |████████████████████████████████| 122kB 1.3MB/s
Collecting ipywidgets (from jupyter)
  Downloading https://files.pythonhosted.org/packages/56/a0/dbcf5881bb2f51e8db678211907f16ea0a182b232c591a6d6f276985ca95/ipywidgets-7.5.1-py2.py3-none-any.whl (121kB)
     |████████████████████████████████| 122kB 1.1MB/s
Collecting nbconvert (from jupyter)
  Downloading https://files.pythonhosted.org/packages/79/6c/05a569e9f703d18aacb89b7ad6075b404e8a4afde2c26b73ca77bb644b14/nbconvert-5.6.1-py2.py3-none-any.whl (455kB)
     |████████████████████████████████| 460kB 1.3MB/s
Collecting jupyter-core (from qtconsole->jupyter)
  Downloading https://files.pythonhosted.org/packages/63/0d/df2d17cdf389cea83e2efa9a4d32f7d527ba78667e0153a8e676e957b2f7/jupyter_core-4.6.3-py2.py3-none-any.whl (83kB)
     |████████████████████████████████| 92kB 2.0MB/s
Collecting ipython-genutils (from qtconsole->jupyter)
  Downloading https://files.pythonhosted.org/packages/fa/bc/9bd3b5c2b4774d5f33b2d544f1460be9df7df2fe42f352135381c347c69a/ipython_genutils-0.2.0-py2.py3-none-any.whl
Collecting traitlets (from qtconsole->jupyter)
  Downloading https://files.pythonhosted.org/packages/ca/ab/872a23e29cec3cf2594af7e857f18b687ad21039c1f9b922fac5b9b142d5/traitlets-4.3.3-py2.py3-none-any.whl (75kB)
     |████████████████████████████████| 81kB 2.6MB/s
Collecting pygments (from qtconsole->jupyter)
  Downloading https://files.pythonhosted.org/packages/2d/68/106af3ae51daf807e9cdcba6a90e518954eb8b70341cee52995540a53ead/Pygments-2.6.1-py3-none-any.whl (914kB)
     |████████████████████████████████| 921kB 1.6MB/s
Collecting qtpy (from qtconsole->jupyter)
  Downloading https://files.pythonhosted.org/packages/cd/fd/9972948f02e967b691cc0ca1f26124826a3b88cb38f412a8b7935b8c3c72/QtPy-1.9.0-py2.py3-none-any.whl (54kB)
     |████████████████████████████████| 61kB 2.0MB/s
Collecting jupyter-client>=4.1 (from qtconsole->jupyter)
  Downloading https://files.pythonhosted.org/packages/40/75/4c4eb43749e59db3c1c7932b50eaf8c4b8219b1b5644fe379ea796f8dbe5/jupyter_client-6.0.0-py3-none-any.whl (104kB)
     |████████████████████████████████| 112kB 3.3MB/s
Collecting ipython (from jupyter-console->jupyter)
  Downloading https://files.pythonhosted.org/packages/61/6f/69f1eec859ce48a86660529b166b6ea466f0f4ab98e4fc0807b835aa22c6/ipython-7.13.0-py3-none-any.whl (780kB)
     |████████████████████████████████| 788kB 1.7MB/s
Collecting prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0 (from jupyter-console->jupyter)
  Downloading https://files.pythonhosted.org/packages/ab/29/d744cee89937b7e52a5c20ca237a6c77298f757965eb3eb0c653df1bfb14/prompt_toolkit-3.0.4-py3-none-any.whl (351kB)
     |████████████████████████████████| 358kB 1.1MB/s
Collecting prometheus-client (from notebook->jupyter)
  Downloading https://files.pythonhosted.org/packages/b3/23/41a5a24b502d35a4ad50a5bb7202a5e1d9a0364d0c12f56db3dbf7aca76d/prometheus_client-0.7.1.tar.gz
Collecting Send2Trash (from notebook->jupyter)
  Downloading https://files.pythonhosted.org/packages/49/46/c3dc27481d1cc57b9385aff41c474ceb7714f7935b1247194adae45db714/Send2Trash-1.5.0-py3-none-any.whl
Collecting pyzmq>=17 (from notebook->jupyter)
  Downloading https://files.pythonhosted.org/packages/26/35/d50bbc114c4da8a344efedc82eeb010b6eb2cc1906ccf89c725259a0d0fe/pyzmq-19.0.0-cp38-cp38-win32.whl (910kB)
     |████████████████████████████████| 911kB 1.1MB/s
Collecting jinja2 (from notebook->jupyter)
  Downloading https://files.pythonhosted.org/packages/27/24/4f35961e5c669e96f6559760042a55b9bcfcdb82b9bdb3c8753dbe042e35/Jinja2-2.11.1-py2.py3-none-any.whl (126kB)
     |████████████████████████████████| 133kB 3.2MB/s
Collecting nbformat (from notebook->jupyter)
  Downloading https://files.pythonhosted.org/packages/ac/eb/de575b7a64e7ab8d8c95e4c180ccc36deda3f1379186c4ee7adf6c2f1586/nbformat-5.0.4-py3-none-any.whl (169kB)
     |████████████████████████████████| 174kB 3.2MB/s
Collecting terminado>=0.8.1 (from notebook->jupyter)
  Downloading https://files.pythonhosted.org/packages/ff/96/1d9a2c23990aea8f8e0b5c3b6627d03196a73771a17a2d9860bbe9823ab6/terminado-0.8.3-py2.py3-none-any.whl
Collecting tornado>=5.0 (from notebook->jupyter)
  Downloading https://files.pythonhosted.org/packages/30/38/91349845eea916575ddbe59a08866de3661b1e4564d42a845545066d51c4/tornado-6.0.4-cp38-cp38-win32.whl (416kB)
     |████████████████████████████████| 419kB 3.3MB/s
Collecting widgetsnbextension~=3.5.0 (from ipywidgets->jupyter)
  Downloading https://files.pythonhosted.org/packages/6c/7b/7ac231c20d2d33c445eaacf8a433f4e22c60677eb9776c7c5262d7ddee2d/widgetsnbextension-3.5.1-py2.py3-none-any.whl (2.2MB)
     |████████████████████████████████| 2.2MB 1.1MB/s
Collecting mistune<2,>=0.8.1 (from nbconvert->jupyter)
  Downloading https://files.pythonhosted.org/packages/09/ec/4b43dae793655b7d8a25f76119624350b4d65eb663459eb9603d7f1f0345/mistune-0.8.4-py2.py3-none-any.whl
Collecting defusedxml (from nbconvert->jupyter)
  Downloading https://files.pythonhosted.org/packages/06/74/9b387472866358ebc08732de3da6dc48e44b0aacd2ddaa5cb85ab7e986a2/defusedxml-0.6.0-py2.py3-none-any.whl
Collecting testpath (from nbconvert->jupyter)
  Downloading https://files.pythonhosted.org/packages/1b/9e/1a170feaa54f22aeb5a5d16c9015e82234275a3c8ab630b552493f9cb8a9/testpath-0.4.4-py2.py3-none-any.whl (163kB)
     |████████████████████████████████| 174kB 652kB/s
Collecting entrypoints>=0.2.2 (from nbconvert->jupyter)
  Downloading https://files.pythonhosted.org/packages/ac/c6/44694103f8c221443ee6b0041f69e2740d89a25641e62fb4f2ee568f2f9c/entrypoints-0.3-py2.py3-none-any.whl
Collecting pandocfilters>=1.4.1 (from nbconvert->jupyter)
  Downloading https://files.pythonhosted.org/packages/4c/ea/236e2584af67bb6df960832731a6e5325fd4441de001767da328c33368ce/pandocfilters-1.4.2.tar.gz
Collecting bleach (from nbconvert->jupyter)
  Downloading https://files.pythonhosted.org/packages/6e/87/f88b0e33914420fe11fe8b820c10d045a342c2a015e79ad8309de4bba820/bleach-3.1.1-py2.py3-none-any.whl (150kB)
     |████████████████████████████████| 153kB 819kB/s
Collecting pywin32>=1.0; sys_platform == "win32" (from jupyter-core->qtconsole->jupyter)
  Downloading https://files.pythonhosted.org/packages/9f/cb/d693f7cdaed51d83bfec5c77a9fa895c1a5a18cf242ba53b06b98ee366dd/pywin32-227-cp38-cp38-win32.whl (8.4MB)
     |████████████████████████████████| 8.4MB 504kB/s
Collecting decorator (from traitlets->qtconsole->jupyter)
  Downloading https://files.pythonhosted.org/packages/ed/1b/72a1821152d07cf1d8b6fce298aeb06a7eb90f4d6d41acec9861e7cc6df0/decorator-4.4.2-py2.py3-none-any.whl
Collecting six (from traitlets->qtconsole->jupyter)
  Downloading https://files.pythonhosted.org/packages/65/eb/1f97cb97bfc2390a276969c6fae16075da282f5058082d4cb10c6c5c1dba/six-1.14.0-py2.py3-none-any.whl
Collecting python-dateutil>=2.1 (from jupyter-client>=4.1->qtconsole->jupyter)
  Downloading https://files.pythonhosted.org/packages/d4/70/d60450c3dd48ef87586924207ae8907090de0b306af2bce5d134d78615cb/python_dateutil-2.8.1-py2.py3-none-any.whl (227kB)
     |████████████████████████████████| 235kB 409kB/s
Collecting colorama; sys_platform == "win32" (from ipython->jupyter-console->jupyter)
  Downloading https://files.pythonhosted.org/packages/c9/dc/45cdef1b4d119eb96316b3117e6d5708a08029992b2fee2c143c7a0a5cc5/colorama-0.4.3-py2.py3-none-any.whl
Collecting pickleshare (from ipython->jupyter-console->jupyter)
  Downloading https://files.pythonhosted.org/packages/9a/41/220f49aaea88bc6fa6cba8d05ecf24676326156c23b991e80b3f2fc24c77/pickleshare-0.7.5-py2.py3-none-any.whl
Collecting backcall (from ipython->jupyter-console->jupyter)
  Downloading https://files.pythonhosted.org/packages/84/71/c8ca4f5bb1e08401b916c68003acf0a0655df935d74d93bf3f3364b310e0/backcall-0.1.0.tar.gz
Requirement already satisfied: setuptools>=18.5 in c:\users\cloudvikas\appdata\local\programs\python\python38-32\lib\site-packages (from ipython->jupyter-console->jupyter) (41.2.0)
Collecting jedi>=0.10 (from ipython->jupyter-console->jupyter)
  Downloading https://files.pythonhosted.org/packages/01/67/333e2196b70840f411fd819407b4e98aa3150c2bd24c52154a451f912ef2/jedi-0.16.0-py2.py3-none-any.whl (1.1MB)
     |████████████████████████████████| 1.1MB 2.2MB/s
Collecting wcwidth (from prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0->jupyter-console->jupyter)
  Downloading https://files.pythonhosted.org/packages/58/b4/4850a0ccc6f567cc0ebe7060d20ffd4258b8210efadc259da62dc6ed9c65/wcwidth-0.1.8-py2.py3-none-any.whl
Collecting MarkupSafe>=0.23 (from jinja2->notebook->jupyter)
  Downloading https://files.pythonhosted.org/packages/93/b8/95b1c38f5b00ed2c0d16cf65f2b07a5ae73eeacf66d2010c0e934737d1d9/MarkupSafe-1.1.1-cp38-cp38-win32.whl
Collecting jsonschema!=2.5.0,>=2.4 (from nbformat->notebook->jupyter)
  Downloading https://files.pythonhosted.org/packages/c5/8f/51e89ce52a085483359217bc72cdbf6e75ee595d5b1d4b5ade40c7e018b8/jsonschema-3.2.0-py2.py3-none-any.whl (56kB)
     |████████████████████████████████| 61kB 787kB/s
Collecting pywinpty>=0.5; os_name == "nt" (from terminado>=0.8.1->notebook->jupyter)
  Downloading https://files.pythonhosted.org/packages/5d/97/8e43c2152a638cdb83d45644eb125c752abe67249f94bb3e3e29b0709685/pywinpty-0.5.7.tar.gz (49kB)
     |████████████████████████████████| 51kB 1.7MB/s
Collecting webencodings (from bleach->nbconvert->jupyter)
  Downloading https://files.pythonhosted.org/packages/f4/24/2a3e3df732393fed8b3ebf2ec078f05546de641fe1b667ee316ec1dcf3b7/webencodings-0.5.1-py2.py3-none-any.whl
Collecting parso>=0.5.2 (from jedi>=0.10->ipython->jupyter-console->jupyter)
  Downloading https://files.pythonhosted.org/packages/da/15/888f80e429a971d3838124adde705d7b07650aa3a59f4db07d61f653b8cd/parso-0.6.2-py2.py3-none-any.whl (97kB)
     |████████████████████████████████| 102kB 1.1MB/s
Collecting attrs>=17.4.0 (from jsonschema!=2.5.0,>=2.4->nbformat->notebook->jupyter)
  Downloading https://files.pythonhosted.org/packages/a2/db/4313ab3be961f7a763066401fb77f7748373b6094076ae2bda2806988af6/attrs-19.3.0-py2.py3-none-any.whl
Collecting pyrsistent>=0.14.0 (from jsonschema!=2.5.0,>=2.4->nbformat->notebook->jupyter)
  Downloading https://files.pythonhosted.org/packages/90/aa/cdcf7ef88cc0f831b6f14c8c57318824c9de9913fe8de38e46a98c069a35/pyrsistent-0.15.7.tar.gz (107kB)
     |████████████████████████████████| 112kB 939kB/s
Installing collected packages: pywin32, decorator, ipython-genutils, six, traitlets, jupyter-core, tornado, wcwidth, prompt-toolkit, colorama, pygments, pickleshare, backcall, parso, jedi, ipython, pyzmq, python-dateutil, jupyter-client, ipykernel, qtpy, qtconsole, jupyter-console, prometheus-client, Send2Trash, MarkupSafe, jinja2, attrs, pyrsistent, jsonschema, nbformat, pywinpty, terminado, mistune, defusedxml, testpath, entrypoints, pandocfilters, webencodings, bleach, nbconvert, notebook, widgetsnbextension, ipywidgets, jupyter
  Running setup.py install for backcall ... done
  Running setup.py install for prometheus-client ... done
  Running setup.py install for pyrsistent ... done
  Running setup.py install for pywinpty ... done
  Running setup.py install for pandocfilters ... done
Successfully installed MarkupSafe-1.1.1 Send2Trash-1.5.0 attrs-19.3.0 backcall-0.1.0 bleach-3.1.1 colorama-0.4.3 decorator-4.4.2 defusedxml-0.6.0 entrypoints-0.3 ipykernel-5.1.4 ipython-7.13.0 ipython-genutils-0.2.0 ipywidgets-7.5.1 jedi-0.16.0 jinja2-2.11.1 jsonschema-3.2.0 jupyter-1.0.0 jupyter-client-6.0.0 jupyter-console-6.1.0 jupyter-core-4.6.3 mistune-0.8.4 nbconvert-5.6.1 nbformat-5.0.4 notebook-6.0.3 pandocfilters-1.4.2 parso-0.6.2 pickleshare-0.7.5 prometheus-client-0.7.1 prompt-toolkit-3.0.4 pygments-2.6.1 pyrsistent-0.15.7 python-dateutil-2.8.1 pywin32-227 pywinpty-0.5.7 pyzmq-19.0.0 qtconsole-4.7.1 qtpy-1.9.0 six-1.14.0 terminado-0.8.3 testpath-0.4.4 tornado-6.0.4 traitlets-4.3.3 wcwidth-0.1.8 webencodings-0.5.1 widgetsnbextension-3.5.1
WARNING: You are using pip version 19.2.3, however version 20.0.2 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.

Once it is installed then we can navigate Step 2.

Step 2: Run Below command in jupyter notebook.

Navigate to Python->Scripts folder and open cmd.

Run jupyter notebook command in cmd.

C:\Python27\Scripts>jupyter notebook
[I 10:43:00.847 NotebookApp] Writing notebook server cookie secret to C:\Users\cloudvikas\AppData\Roaming\jupyter\runtime\notebook_cookie_secret
[W 10:43:01.988 NotebookApp] Terminals not available (error was No module named 'winpty.cywinpty')
[I 10:43:01.989 NotebookApp] Serving notebooks from local directory: C:\Python27\Scripts
[I 10:43:01.989 NotebookApp] The Jupyter Notebook is running at:
[I 10:43:01.989 NotebookApp] http://localhost:8888/?token=e20c5af58e28aef74db963a4e3169042d1a54fa71061fa87
[I 10:43:01.989 NotebookApp]  or http://127.0.0.1:8888/?token=e20c5af58e28aef74db963a4e3169042d1a54fa71061fa87
[I 10:43:01.989 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[C 10:43:02.158 NotebookApp]

    To access the notebook, open this file in a browser:
        file:///C:/Users/cloudvikas/AppData/Roaming/jupyter/runtime/nbserver-12400-open.html
    Or copy and paste one of these URLs:
        http://localhost:8888/?token=e20c5af58e28aef74db963a4e3169042d1a54fa71061fa87
     or http://127.0.0.1:8888/?token=e20c5af58e28aef74db963a4e3169042d1a54fa71061fa87

Step 3:

We can see url link . it opens jupyter notebook in browser.

Launch Jupiter and login with url in browser.

Create New Python3 project and execute pyspark program.

Select python3.

sparksession will be active.

Way 2: Through Anaconda:

Step 1: Install Gnu On Windows

1. Download and install Gnu On Windows (Gow) from https://github.com/bmatzelle/gow/releases/download/v0.8.0/Gow-0.8.0.exe. Select the default options when prompted during the installation of Gow.

Step 2: Install Anaconda and Jupyter Notebook

  1. Downloads and install Anaconda.

Select python 3.7 version.

Once it is downloaded then downloaded file name looks like:

Anaconda3-2020.02-Windows-x86_64

Select the both options when prompted during the installation of Anaconda.

  1. Open “Anaconda Prompt” by finding it in the Windows (Start) Menu.

Step 3: Anaconda with Jupyter notebook

Install conda findspark, to access spark instance from jupyter notebook. Check current installation in Anaconda cloud. In time of writing:

conda install -c conda-forge findspark
(base) C:\Users\cloudvikas>conda install -c conda-forge findspark
Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: C:\Users\cloudvikas\Anaconda3

  added / updated specs:
    - findspark


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    certifi-2019.11.28         |           py37_0         148 KB  conda-forge
    conda-4.8.3                |   py37hc8dfbb8_0         3.1 MB  conda-forge
    findspark-1.3.0            |             py_1           6 KB  conda-forge
    python_abi-3.7             |          1_cp37m           4 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         3.2 MB

The following NEW packages will be INSTALLED:

  findspark          conda-forge/noarch::findspark-1.3.0-py_1
  python_abi         conda-forge/win-64::python_abi-3.7-1_cp37m

The following packages will be UPDATED:

  conda                       pkgs/main::conda-4.8.2-py37_0 --> conda-forge::conda-4.8.3-py37hc8dfbb8_0

The following packages will be SUPERSEDED by a higher-priority channel:

  certifi                                         pkgs/main --> conda-forge


Proceed ([y]/n)? y


Downloading and Extracting Packages
python_abi-3.7       | 4 KB      | ################################################################################################################################ | 100%
findspark-1.3.0      | 6 KB      | ################################################################################################################################ | 100%
certifi-2019.11.28   | 148 KB    | ################################################################################################################################ | 100%
conda-4.8.3          | 3.1 MB    | ################################################################################################################################ | 100%
Preparing transaction: done
Verifying transaction: done
Executing transaction: done

(base) C:\Users\cloudvikas>

Step 4: Download and Install Spark

Go to Spark home page, and download the .tgz file from 2.3.2 version from

apache spark download page

Extract the file to your chosen directory (7z can open tgz). In my case, it was C:\spark. There is another compressed directory in the tar, extract it (into here) as well.

Setup the environment variables

SPARK_HOME  = C:\spark\spark-2.3.2-bin-hadoop2.7
HADOOP_HOME = C:\spark\spark-2.3.2-bin-hadoop2.7

Add the following path to PATH environment variable:

C:\spark\spark-2.3.2-bin-hadoop2.7\bin

Step 5: Download and setup winutils.exe

In hadoop binaries repository, https://github.com/steveloughran/winutils choose your hadoop version, then goto bin, and download the winutils.exe file. In my case: https://github.com/steveloughran/winutils/blob/master/hadoop-2.7.1/bin/winutils.exe

Save winutils.exe in to bin directory of your spark installation, SPARK_HOME\bin directory. In my case: C:\spark\spark-2.3.2-bin-hadoop2.7\bin.

Step 6: PySpark with Jupyter notebook

Install conda findspark, to access spark instance from jupyter notebook. Check current installation in Anaconda cloud. In time of writing:

conda install -c conda-forge findspark

Open your python jupyter notebook, and write inside:

import findspark
findspark.init()findspark.find()
import pyspark
findspark.find()

Last line will output SPARK_HOME path. It’s just for test, you can delete it.

Run below commands:

In this way, we can setup jupyter using Anaconda and execute simple pyspark program.

QUESTION 5:

How to connect mysql database through jupyter notebook?

Prerequisite:

Install mysql workbench.

https://mysql-workbench.en.uptodown.com/windows/download/269553

after downloading , file name will be :

mysql-workbench-6-3-4-64-bit-multi-win

To install this we have to install below software as well:

https://visualstudio.microsoft.com/downloads/?q=Visual+C%2B%2B+Redistributable+for+Visual+Studio+2019

or you can ignore if you have done already.

once its installed then open it.

************************************************************

Run below code in jupyter:

from pyspark.sql import SparkSession
#creation of sparksession driver process
Sparkdriver =SparkSession.builder.master(‘local’).\
config(‘spark.jars.packages’,’mysql:mysql-connector-java:5.1.44’).\
appName(‘cloudvikas’).getOrCreate()
#SparkDriver
Df1=sparkdriver.read.format(‘jdbc’).\
option(‘url’,’jdbc:mysql://localhost:3306’).\
option(‘driver’,’com.mysql.jdbc.Driver’).\
option(‘user’,’root’).\
option(‘password’,’cloudvikas’)
option(‘dbtable’,’dbsample.cloudvikastable’).\
load()
df1.printSchema()
df1.show(5)

QUESTION:6

How will you read json file in pyspark?

Ans:

Step 1: Create any json file:

[{
  "id": 1,
  "first_name": "Jeanette",
  "last_name": "Penddreth",
  "email": "vikas1",
  "gender": "Female",
  "ip_address": "26.58.193.2"
}, {
  "id": 2,
  "first_name": "Giavani",
  "last_name": "Frediani",
  "email": "vikas2",
  "gender": "Male",
  "ip_address": "229.179.4.212"
}, {
  "id": 3,
  "first_name": "Noell",
  "last_name": "Bea",
  "email": "vikas3",
  "gender": "Female",
  "ip_address": "180.66.162.255"
}, {
  "id": 4,
  "first_name": "Willard",
  "last_name": "Valek",
  "email": "vikas4",
  "gender": "Male",
  "ip_address": "67.76.188.26"
}]

Step 2: Write Spark SQL code:

from pyspark.sql import SparkSession
from pyspark.sql.types import *
import os
import sqlite3


os.environ['HADOOP_HOME']='C:\\hadoop'
spark=SparkSession.builder.master('local').config('spark.jars.packages','mysql:mysql-connector-java:5.1.44,com.databricks:spark-xml_2.11:0.4.1').appName('demoapp').getOrCreate()
print(spark)

# A JSON dataset is pointed to by path.
# The path can be either a single text file or a directory storing text files
Path="C:\\Users\\cloudvikas\\Downloads\\example_2.json"
df1=spark.read.json(path)

# The inferred schema can be visualized using the printSchema() method

df1.printSchema()

# Creates a temporary view using the DataFrame

df1.createOrReplaceTempView("student")
# SQL statements can be run by using the sql methods provided by spark

data1=spark.sql("select * from student")
data1.show()

Step 3: Execute this script :

We are executing in pycharm:

There is an issue found while executing:

C:\Users\cloudvikas\.conda\envs\Project1\python.exe C:/Users/cloudvikas/PycharmProjects/Project1/SQLPackege/jsonread.py
Ivy Default Cache set to: C:\Users\cloudvikas\.ivy2\cache
The jars for the packages stored in: C:\Users\cloudvikas\.ivy2\jars
:: loading settings :: url = jar:file:/C:/opt/spark/spark-2.4.5-bin-hadoop2.6/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
mysql#mysql-connector-java added as a dependency
com.databricks#spark-xml_2.11 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-352043a3-f1c0-486c-828b-1e428f6afe60;1.0
	confs: [default]
	found mysql#mysql-connector-java;5.1.44 in central
	found com.databricks#spark-xml_2.11;0.4.1 in central
:: resolution report :: resolve 363ms :: artifacts dl 5ms
	:: modules in use:
	com.databricks#spark-xml_2.11;0.4.1 from central in [default]
	mysql#mysql-connector-java;5.1.44 from central in [default]
	---------------------------------------------------------------------
	|                  |            modules            ||   artifacts   |
	|       conf       | number| search|dwnlded|evicted|| number|dwnlded|
	---------------------------------------------------------------------
	|      default     |   2   |   0   |   0   |   0   ||   2   |   0   |
	---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent-352043a3-f1c0-486c-828b-1e428f6afe60
	confs: [default]
	0 artifacts copied, 2 already retrieved (0kB/13ms)
20/03/19 09:39:44 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
20/03/19 09:39:46 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
<pyspark.sql.session.SparkSession object at 0x000001A3CA613548>
20/03/19 09:39:48 WARN SizeEstimator: Failed to check whether UseCompressedOops is set; assuming yes
root
 |-- _corrupt_record: string (nullable = true)

Traceback (most recent call last):
  File "C:\Users\cloudvikas\.conda\envs\Project1\lib\site-packages\pyspark\sql\utils.py", line 63, in deco
    return f(*a, **kw)
  File "C:\Users\cloudvikas\.conda\envs\Project1\lib\site-packages\py4j\protocol.py", line 328, in get_return_value
    format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o47.showString.
: org.apache.spark.sql.AnalysisException: Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the
referenced columns only include the internal corrupt record column
(named _corrupt_record by default). For example:
spark.read.schema(schema).json(file).filter($"_corrupt_record".isNotNull).count()
and spark.read.schema(schema).json(file).select("_corrupt_record").show().
Instead, you can cache or save the parsed results and then send the same query.
For example, val df = spark.read.schema(schema).json(file).cache() and then
df.filter($"_corrupt_record".isNotNull).count().;
	at org.apache.spark.sql.execution.datasources.json.JsonFileFormat.buildReader(JsonFileFormat.scala:120)
	at org.apache.spark.sql.execution.datasources.FileFormat$class.buildReaderWithPartitionValues(FileFormat.scala:129)
	at org.apache.spark.sql.execution.datasources.TextBasedFileFormat.buildReaderWithPartitionValues(FileFormat.scala:165)
	at org.apache.spark.sql.execution.FileSourceScanExec.inputRDD$lzycompute(DataSourceScanExec.scala:309)
	at org.apache.spark.sql.execution.FileSourceScanExec.inputRDD(DataSourceScanExec.scala:305)
	at org.apache.spark.sql.execution.FileSourceScanExec.inputRDDs(DataSourceScanExec.scala:327)
	at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:627)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
	at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
	at org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:247)
	at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:339)
	at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
	at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3389)
	at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2550)
	at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2550)
	at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3370)
	at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:80)
	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:127)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:75)
	at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3369)
	at org.apache.spark.sql.Dataset.head(Dataset.scala:2550)
	at org.apache.spark.sql.Dataset.take(Dataset.scala:2764)
	at org.apache.spark.sql.Dataset.getRows(Dataset.scala:254)
	at org.apache.spark.sql.Dataset.showString(Dataset.scala:291)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
	at java.lang.reflect.Method.invoke(Unknown Source)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Unknown Source)


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:/Users/cloudvikas/PycharmProjects/Project1/SQLPackege/jsonread.py", line 14, in <module>
    data1.show()
  File "C:\Users\cloudvikas\.conda\envs\Project1\lib\site-packages\pyspark\sql\dataframe.py", line 380, in show
    print(self._jdf.showString(n, 20, vertical))
  File "C:\Users\cloudvikas\.conda\envs\Project1\lib\site-packages\py4j\java_gateway.py", line 1257, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "C:\Users\cloudvikas\.conda\envs\Project1\lib\site-packages\pyspark\sql\utils.py", line 69, in deco
    raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: 'Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the\nreferenced columns only include the internal corrupt record column\n(named _corrupt_record by default). For example:\nspark.read.schema(schema).json(file).filter($"_corrupt_record".isNotNull).count()\nand spark.read.schema(schema).json(file).select("_corrupt_record").show().\nInstead, you can cache or save the parsed results and then send the same query.\nFor example, val df = spark.read.schema(schema).json(file).cache() and then\ndf.filter($"_corrupt_record".isNotNull).count().;'

Process finished with exit code 1

Issue is related to json file.

Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the referenced columns only include the internal corrupt record column

We must keep data in single line when we are using json file.

Reason:

Spark while processing json data considers each new line as a complete json. Thus it is failing.

You should keep your complete json in a single line in a compact form by removing all white spaces and newlines.

Like

{"a":{"b":1}}

If you want multiple jsons in a single file keep them like this

{"a":{"b":1}}
{"a":{"b":2}}
{"a":{"b":3}} 

Lets update file in such way:

Through notepad++ -> EDIT option->Blank operations.


[{ "id": 1, "first_name": "Jeanette", "last_name": "Penddreth", "email": "vikas1", "gender": "Female", "ip_address": "26.58.193.2" }, { "id": 2, "first_name": "Giavani", "last_name": "Frediani", "email": "vikas2", "gender": "Male", "ip_address": "229.179.4.212" }, { "id": 3, "first_name": "Noell", "last_name": "Bea", "email": "vikas3", "gender": "Female", "ip_address": "180.66.162.255" }, { "id": 4, "first_name": "Willard", "last_name": "Valek", "email": "vikas4", "gender": "Male", "ip_address": "67.76.188.26" }]

Step 3: Execute same script again:

Script is executed successfully and got output:

C:\Users\cloudvikas\.conda\envs\Project1\python.exe C:/Users/cloudvikas/PycharmProjects/Project1/SQLPackege/jsonread.py
Ivy Default Cache set to: C:\Users\cloudvikas\.ivy2\cache
The jars for the packages stored in: C:\Users\cloudvikas\.ivy2\jars
:: loading settings :: url = jar:file:/C:/opt/spark/spark-2.4.5-bin-hadoop2.6/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
mysql#mysql-connector-java added as a dependency
com.databricks#spark-xml_2.11 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-dbcd4693-5986-4263-82fc-61708f7c5df3;1.0
	confs: [default]
	found mysql#mysql-connector-java;5.1.44 in central
	found com.databricks#spark-xml_2.11;0.4.1 in central
:: resolution report :: resolve 295ms :: artifacts dl 4ms
	:: modules in use:
	com.databricks#spark-xml_2.11;0.4.1 from central in [default]
	mysql#mysql-connector-java;5.1.44 from central in [default]
	---------------------------------------------------------------------
	|                  |            modules            ||   artifacts   |
	|       conf       | number| search|dwnlded|evicted|| number|dwnlded|
	---------------------------------------------------------------------
	|      default     |   2   |   0   |   0   |   0   ||   2   |   0   |
	---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent-dbcd4693-5986-4263-82fc-61708f7c5df3
	confs: [default]
	0 artifacts copied, 2 already retrieved (0kB/11ms)
20/03/19 09:51:42 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
20/03/19 09:51:43 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
<pyspark.sql.session.SparkSession object at 0x000001DCFA163508>
20/03/19 09:51:45 WARN SizeEstimator: Failed to check whether UseCompressedOops is set; assuming yes
root
 |-- email: string (nullable = true)
 |-- first_name: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- id: long (nullable = true)
 |-- ip_address: string (nullable = true)
 |-- last_name: string (nullable = true)

+------+----------+------+---+--------------+---------+
| email|first_name|gender| id|    ip_address|last_name|
+------+----------+------+---+--------------+---------+
|vikas1|  Jeanette|Female|  1|   26.58.193.2|Penddreth|
|vikas2|   Giavani|  Male|  2| 229.179.4.212| Frediani|
|vikas3|     Noell|Female|  3|180.66.162.255|      Bea|
|vikas4|   Willard|  Male|  4|  67.76.188.26|    Valek|
+------+----------+------+---+--------------+---------+


Process finished with exit code 0

Finally we have completed this solution and came to know:

Spark SQL can automatically infer the schema of a JSON dataset and load it as a DataFrame. This conversion can be done using SparkSession.read.json on a JSON file.

Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object.For a regular multi-line JSON file, set the multiLine parameter to True.

Question:7

How to store output to mysql table?

Run below command:

First we have to create dataframe with some columns(column count depends on schema of output). Then we have to write dataframe data into mysql table.Currently I am assuming two columns.

Df_local2 = df_local.select(‘column1’,’column2’)
df_local2.write.format(‘jdbc’).\
option(‘url’,’jdbc:mysql://localhost:3306’).\
option(‘driver’,’com.mysql.jdbc.driver’).\
option(‘user’,’root’).\
option(‘password’,’vikas’).\
option(‘dbtable’,’dbsample.table).
save()