VISIT PREVIOUS SETS ON
Why do we need Performance tuning in Spark?
- In Spark, data is processed in memory so sometimes issues happen related to CPU, Network Bandwidth, or memory.
- In some cases, Processing data does not fit into the memory which creates a big performance issue/penalty.
- If the data packet size is big then it creates a problem in the network during shuffling.
How to improve performance tuning in Spark?
we have below performance tuning techniques
- Data Serialization
- Memory Tuning
- Tuning Data Structures
- Serialized RDD Storage
- Garbage Collection Tuning
- Level of Parallelism
- Broadcast Large Variables
- Data Locality
Data Serialization:
1 | There are 2 serialization libraries, Java, and Kyro Serialization. Java Serialization is quite slow and it leads to large serialized formats for many classes. Because of that, we face issues in network bandwidth and memory while data movement. |
2 | Incase of Kyro Serialization, it is 10x faster than java serialization.For using Kyro,we can set below configuration:Sparkconf.set(“spark.serializer”,”org.apache.spark.serializer.kyroSerializer”) |
3 | We can register for custom classes with Kyro. Kyro will still work without registering custom classes, but it will not be useful. To register custom classes with Kryo, we can use the registerKryoClasses method Val conf = new SparkConf().setMaster().setAppName() Conf.registerKryoClasses(Array(classOf[MyClass1],classOf[Myexample])) val sc = new SparkContext(conf) |
4 | Consider your objects are large to fit into memory then we can increase the buffer max sizeSpark.kryoserializer.buffer.max 64Mb |
5 | Memory_only_ser and Memory_and_disk_ser are two storage levels which support serialized RDDs. In Memory_only_ser, RDDs are stored as serialized objects and it will create one byte array per partition. |
6 | In memory_and_disk_ser, partitions are spilled into disk which is not fitted into memory.Val words =sc.textFile(“words”) Words.persist(MEMORY_ONLY_SER) |
7 | It adds extra CPU cycles due to deserialization. |