SPARK-IQ-7

What are the types of Table in Spark?

Managed Table
Unmanaged Table(External Table)

From a managed table, we can manage metadata and data.
All managed tables are stored into spark.sql.warehouse.dir

Why do we need enableHiveSupport in spark? and how can we use this?

In spark, if a user wants to connect Hive then we can use enableHiveSupport.

if __name__ == "__main__":
    spark = SparkSession \
        .builder \
        .master("local[3]") \
        .appName("SparkSQLTableDemo") \
        .enableHiveSupport() \
        .getOrCreate()

Why do we save dataframe into managed table instead of parquet file?

After saving output data into table/files, we may need these data for Tableu, PowerBI, or other tools for reporting or analysis.
We can save output data into a parquet/Avro file format. But whenever we read data from these files then we need dataframe reader API to read it.
In another way, if we save data into the managed table then we can access these data through JDBC/ODBC connection in SQL fashion. There are many SQL tools to read managed table data.

How will you read json file and perform below operations:
Show Data,PrintSchema,Show columns name,Describe dataframe,
If you have JSON file having age and name as columns.If we want to have count of records,mean of their age,min and max age value,then how will you get it?
Spark SQL.
1)Create SparkSession
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('cloudvikas').getOrCreate()
2)Read JSON file
df = spark.read.json('cloudvikas.json')
3)Show Data
df.show()
4)PrintSchema
df.printSchema()
5) Show columns name
df.columns
6)Describe dataframe 
df.describe()
7)If you have JSON file having age and name as columns.If we want 
to have count of records,mean of their age,min and max age value,
then how will you get it?
df.describe().show()
it will give count of all records.
it provides mean value,max value,min value and stddev value.
8)Perform SQL :
from pyspark.sql.types import (StructField,StringType,IntegerType,
StructType)
data_schema = [StructField('age',IntegerType(),True),StructField
('name',StringType(),True)]
final_struc = StructType(fields=data_schema)
df=spark.read.json('cloudvikas.json',schema=final_struc)
df.printSchema()