SPARK-IQ-11

VISIT PREVIOUS SETS ON

What are the various levels of persistence in Apache Spark?
MEMORY_ONLYThis is the default behavior of the RDD cache() method and stores the RDD or DataFrame as deserialized objects to JVM memory.
When there is no enough memory available it will not save DataFrame of some partitions and these will be re-computed as and when required.
MEMORY_ONLY_SER This is the same as MEMORY_ONLY but the difference being it stores RDD as serialized objects to JVM memory.
It takes lesser memory (space-efficient) then MEMORY_ONLY as it saves objects as serialized and takes an additional few more CPU cycles in order to deserialize.
MEMORY_AND_DISKThis is the default behavior of the DataFrame or Dataset.
In this Storage Level, The DataFrame will be stored in JVM memory as a deserialized object.
When required storage is greater than available memory, it stores some of the excess partitions into the disk and reads the data from the disk when required.
It is slower as there is I/O involved.
MEMORY_AND_DISK_SERThis is the same as MEMORY_AND_DISK storage level difference being it serializes the DataFrame objects in memory and on disk when space is not available.
DISK_ONLYIn this storage level, DataFrame is stored only on disk and the CPU computation time is high as I/O is involved.

import org.apache.spark.storage.StorageLevel
val rdd2 = rdd.persist(StorageLevel.MEMORY_ONLY_SER)
or
val df2 = df.persist(StorageLevel.MEMORY_ONLY_SER)

Spark – Difference between Cache and Persist?
  • Spark Cache and persist are optimization techniques for iterative and interactive Spark applications to improve the performance of the jobs or applications.
  • Both caching and persisting are used to save the Spark RDD, Dataframe, and Dataset’s. But, the difference is, RDD cache() method default saves it to memory (MEMORY_ONLY) whereas persist() method is used to store it to the user-defined storage level.
  • Cost efficient – Spark computations are very expensive hence reusing the computations are used to save cost.
  • Time efficient – Reusing the repeated computations saves lots of time.
  • Execution time – Saves execution time of the job and we can perform more jobs on the same cluster.

Spark DataFrame or Dataset caching by default saves it to storage level `MEMORY_AND_DISK` because recomputing the in-memory columnar representation of the underlying table is expensive. Note that this is different from the default cache level of `RDD.cache()` which is ‘MEMORY_ONLY‘.

How Persistence is important in Spark?
  • Spark automatically monitors every persist() and cache() calls you make and it checks usage on each node and drops persisted data if not used or using least-recently-used (LRU) algorithm.
  • It is one of the optimization techniques to improve the performance of Spark jobs.
  • For RDD cache() default storage level is ‘MEMORY_ONLY‘ but, for DataFrame and Dataset, default is ‘MEMORY_AND_DISK
  • On Spark UI, we can see the Storage tab where partitions exist in memory or disk across the cluster.
  • Dataset cache() is an alias for persist(StorageLevel.MEMORY_AND_DISK)
  • Caching of Spark DataFrame or Dataset is a lazy operation, meaning a DataFrame will not be cached until you trigger an action.