SPARK – THEORY-PROBLEM IQ

Theory IQProblem IQ
1) What is the comparison between Spark RDD vs DataFrame vs Dataset?Program 1-Consider you have records of employee id and its joining date details. So write script to find out number of employees who joined in 4th may 2020
2) Why is Parquet file format is recommended?How will you find out partitions count and control output file size?
3) Explain transformations respective to Dataframe?How to read data from column and add DOB column?
4) How will you define Schema explicitly and What is Spark DataFrameWriter API?How will you remove duplicate name and drop any column?
5) How will you store data into Avro file and in what scenario,number of output files and number of partitions are not same?How to display person id and distance covered in 2 hour. where col1 is id and col2 is speed in km/h?

Dataframe DF2 has details of male and female students. How will you find only Male students details?

In Dataframe DF1, How will you find Records Where Occupation Is Programmer and age > 25? where DF1 has 3 columns.Col1 is Occupation Col2 is age and Col3 is place.
6) How will you get same number of output files as per partitions count and why do we use partitionby in Dataframe? How can we control output file size?Suppose you have one table having employee details. It has 2 coulmns name and salary. How will you add 3rd column as Lebel whose value depends on salary? it will return High if salary is more than 20000 else its value would be Low.

Write a query to find High Score, Low Score, Total No. of Users (irrespective of status) & Total No. of Active Users ( based on enrollment.status=’active’ and user.status=TRUE ) per each Course.Write Dataframe code to generate top salary,employee name department wise following table data:
7) What are the types of Table in Spark?

Why do we need enableHiveSupport in spark? and how can we use this?

Why do we save dataframe into managed table instead of parquet file?
How will you read json file and perform below operations: Show Data,PrintSchema,Show columns name,Describe dataframe, If you have JSON file having age and name as columns.If we want to have count of records,mean of their age,min and max age value,then how will you get it? Spark SQL.
8) We can save dataframe into managed table.What will happen when 1)user save data without using partitionby and bucketby? 2) user save data using partitionby? 3) user save data using bucketby?

What is purpose of transformations in Dataframe?

We got one requirement in which we have to convert date string into date format. EXAMPLE: 22/3/2020 INTO 2020-03-22

You want to read an Employee CSV file (having 3 string type columns and1 float type column)and perform below operations: Read, print schema.
9) What are the different functions in Dataframe?

How to read data from column?
Add DOB column at the end of table.
There is a running competition. The race is for 20 meters. The number of participants in the competition is 10. The time in seconds has been noted.

How to select one column CL1 from dataframe DF2?
10) Why do we need Performance tuning in Spark?

How to improve performance tuning in Spark?
How will you create dataframe having multiple columns without importing any api in pyspark?
11) What are the various levels of persistence in Apache Spark?

Spark – Difference between Cache and Persist?

How Persistence is important in Spark?
How can we Create DataFrame With StructType Schema ?
12) Explain Broadcast variable in Spark and how it is useful in performance tuning?

Explain any use case on Broadcast variable.

How does Spark Broadcast work?

How to create Broadcast variable
In Spark shell

Spark RDD Broadcast variable example
Spark DataFrame Broadcast variable example
13) Explain Accumulator:

Why do we need Dataframe?

How to save the contents of a DataFrame to a CSV file?

write the DataFrame to a CSV file.

How to Save a DataFrame as a JSON File?


How will you read a Parquet file and save a DataFrame as a Parquet file.
How will you save a DataFrame as an ORC file?
How will you read a table of data from MySQL and save the contents of a DataFrame to MySQL.?

How will you read a table of data from a PostgreSQL database and save to postgresql?

How will you read a table of data from a Cassandra database?
14)