Why do we need Performance tuning in Spark?
  • In Spark, data is processed in memory so sometimes issues happen related to CPU, Network Bandwidth, or memory.
  • In some cases, Processing data does not fit into the memory which creates a big performance issue/penalty.
  • If the data packet size is big then it creates a problem in the network during shuffling.
How to improve performance tuning in Spark?

we have below performance tuning techniques

  1. Data Serialization
  2. Memory Tuning
  3. Tuning Data Structures
  4. Serialized RDD Storage
  5. Garbage Collection Tuning
  6. Level of Parallelism
  7. Broadcast Large Variables
  8. Data Locality

Data Serialization:

1There are 2 serialization libraries, Java, and Kyro Serialization. Java Serialization is quite slow and it leads to large serialized formats for many classes. Because of that, we face issues in network bandwidth and memory while data movement.
2Incase of Kyro Serialization, it is 10x faster than java serialization.For using Kyro,we can set below configuration:
3We can register for custom classes with Kyro. Kyro will still work without registering custom classes, but it will not be useful.
To register custom classes with Kryo, we can use the registerKryoClasses method

Val conf = new SparkConf().setMaster().setAppName() Conf.registerKryoClasses(Array(classOf[MyClass1],classOf[Myexample]))

val sc = new SparkContext(conf)
4Consider your objects are large to fit into memory then we can increase the buffer max size

Spark.kryoserializer.buffer.max 64Mb
5Memory_only_ser and Memory_and_disk_ser are two storage levels which support serialized RDDs.
In Memory_only_ser, RDDs are stored as serialized objects and it will create one byte array per partition.
6In memory_and_disk_ser, partitions are spilled into disk which is not fitted into memory.

Val words =sc.textFile(“words”)

7It adds extra CPU cycles due to deserialization.