What are the types of Table in Spark?
Unmanaged Table(External Table)
From a managed table, we can manage metadata and data.
All managed tables are stored into spark.sql.warehouse.dir
Why do we need enableHiveSupport in spark? and how can we use this?
In spark, if a user wants to connect Hive then we can use enableHiveSupport.
if __name__ == "__main__": spark = SparkSession \ .builder \ .master("local") \ .appName("SparkSQLTableDemo") \ .enableHiveSupport() \ .getOrCreate()
Why do we save dataframe into managed table instead of parquet file?
After saving output data into table/files, we may need these data for Tableu, PowerBI, or other tools for reporting or analysis.
We can save output data into a parquet/Avro file format. But whenever we read data from these files then we need dataframe reader API to read it.
In another way, if we save data into the managed table then we can access these data through JDBC/ODBC connection in SQL fashion. There are many SQL tools to read managed table data.
How will you read json file and perform below operations:
Show Data,PrintSchema,Show columns name,Describe dataframe,
If you have JSON file having age and name as columns.If we want to have count of records,mean of their age,min and max age value,then how will you get it?
1)Create SparkSession from pyspark.sql import SparkSession spark = SparkSession.builder.appName('cloudvikas').getOrCreate() 2)Read JSON file df = spark.read.json('cloudvikas.json') 3)Show Data df.show() 4)PrintSchema df.printSchema() 5) Show columns name df.columns 6)Describe dataframe df.describe() 7)If you have JSON file having age and name as columns.If we want to have count of records,mean of their age,min and max age value, then how will you get it? df.describe().show() it will give count of all records. it provides mean value,max value,min value and stddev value. 8)Perform SQL : from pyspark.sql.types import (StructField,StringType,IntegerType, StructType) data_schema = [StructField('age',IntegerType(),True),StructField ('name',StringType(),True)] final_struc = StructType(fields=data_schema) df=spark.read.json('cloudvikas.json',schema=final_struc) df.printSchema()