How will you define Schema explicitly?
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, DateType, StringType, IntegerType
from lib.logger import Log4j
if __name__ == "__main__":
spark = SparkSession \
.builder \
.master("local[3]") \
.appName("SparkSchemaDemo") \
.getOrCreate()
logger = Log4j(spark)
inputSchemaStruct = StructType([
StructField("col1", DateType()),
StructField("col2", StringType()),
StructField("col3", IntegerType()),
StructField("col4", IntegerType())
])
inputSchemaDDL = """col1 DATE, col2 STRING, col3 INT, col4 INT"""
inputTimeCsvDF = spark.read \
.format("csv") \
.option("header", "true") \
.schema(inputSchemaStruct) \
.option("mode", "FAILFAST") \
.option("dateFormat", "M/d/y") \
.load("data/input*.csv")
inputTimeCsvDF.show(5)
logger.info("CSV Schema:" + inputTimeCsvDF.schema.simpleString())
What is Spark DataFrameWriter API?
DataFrameWriter is the interface to describe data (as the result of executing a structured query) should be saved to an external data source. DataFrameWriter defaults to parquet data source format.
You can change the default format using spark.
Its General structure is :DataFrameWriter
.format(…)
.option(…)
.partitionBy(…)
.bucketBy(…)
.sortBy(…)
. save ()
Take an example:dataframe.write
.format("parquet")
.mode (saveMode)
.option(”path”, “/data/cloud”)
. save ()