Please go through previous IQ Sets (below link) before reading IQ 5.
How will you store data into Avro file?
First we should include scala packege in spark-default.conf file to use avro file.
write below line in spark-default.conf file:
spark. jars . packages org.apache.spark:spark=avro 2.11 :2.4.5
Next, we can write dataframewrite API to write data into Avro file.
from pyspark.sql import *
from pyspark.sql.functions import spark_partition_id
from lib.logger import Log4j
if __name__ == "__main__":
spark = SparkSession \
.builder \
.master("local[3]") \
.appName("SparkSchemaDemo") \
.getOrCreate()
logger = Log4j(spark)
flightTimeParquetDF = spark.read \
.format("parquet") \
.load("dataSource/cloud*.parquet")
partitionedDF.write \
.format("avro") \
.mode("overwrite") \
.option("path", "data/avro/") \
.save()
In what scenario,number of output files and number of partitions are not same?
It is not mandatory. number of output files and number of partitions can be different.
Sometimes, some partition does not have records so the output file does not create for that kind of partition.
Consider you have two partitions and one partition is blank then only one output file would be created.
logger.info("Num Partitions before: " + str(flightTimeParquetDF.rdd.getNumPartitions()))
flightTimeParquetDF.groupBy(spark_partition_id()).count().show() -- it will display records count grouped by partition.
How to display person id and distance covered in 2 hour. where col1 is id and col2 is speed in km/h?
df2.show(2)
ID Speed
vikas 4
mohan 6
Df5 = Df2.select("id",Df2.Speed*2)
Dataframe DF2 has details of male and female students. How will you find only Male students details?
Df3 = Df2.filter(Df2.Gender == 'Male')
Df3.show()
output:
+----+------+-----------+----------------+------------+
| id|Gender| Occupation|TimeInSecond|Speed|
+----+------+-----------+----------------+------------+
| id1| Male| Programmer| 16| 1|
| id3| Male| Manager| 15| 1|
In Dataframe DF1, How will you find Records Where Occupation Is Programmer and age > 25? where DF1 has 3 columns.Col1 is Occupation
Col2 is age and Col3 is place.
Df1.filter((Df1.Occupation =='Programmer') & (Df1.age > 25) ).show(3)