VISIT PREVIOUS SETS ON
What are the various levels of persistence in Apache Spark?
|MEMORY_ONLY||This is the default behavior of the RDD |
When there is no enough memory available it will not save DataFrame of some partitions and these will be re-computed as and when required.
|MEMORY_ONLY_SER|| This is the same as |
It takes lesser memory (space-efficient) then MEMORY_ONLY as it saves objects as serialized and takes an additional few more CPU cycles in order to deserialize.
|MEMORY_AND_DISK||This is the default behavior of the DataFrame or Dataset. |
In this Storage Level, The DataFrame will be stored in JVM memory as a deserialized object.
When required storage is greater than available memory, it stores some of the excess partitions into the disk and reads the data from the disk when required.
It is slower as there is I/O involved.
|MEMORY_AND_DISK_SER||This is the same as |
|DISK_ONLY||In this storage level, DataFrame is stored only on disk and the CPU computation time is high as I/O is involved.|
val rdd2 = rdd.persist(StorageLevel.MEMORY_ONLY_SER)
val df2 = df.persist(StorageLevel.MEMORY_ONLY_SER)
Spark – Difference between Cache and Persist?
- Spark Cache and persist are optimization techniques for iterative and interactive Spark applications to improve the performance of the jobs or applications.
- Both caching and persisting are used to save the Spark RDD, Dataframe, and Dataset’s. But, the difference is, RDD cache() method default saves it to memory (MEMORY_ONLY) whereas persist() method is used to store it to the user-defined storage level.
- Cost efficient – Spark computations are very expensive hence reusing the computations are used to save cost.
- Time efficient – Reusing the repeated computations saves lots of time.
- Execution time – Saves execution time of the job and we can perform more jobs on the same cluster.
Spark DataFrame or Dataset caching by default saves it to storage level `
MEMORY_AND_DISK` because recomputing the in-memory columnar representation of the underlying table is expensive. Note that this is different from the default cache level of `
RDD.cache()` which is ‘
How Persistence is important in Spark?
- Spark automatically monitors every persist() and cache() calls you make and it checks usage on each node and drops persisted data if not used or using least-recently-used (LRU) algorithm.
- It is one of the optimization techniques to improve the performance of Spark jobs.
- For RDD cache() default storage level is ‘
MEMORY_ONLY‘ but, for DataFrame and Dataset, default is ‘
- On Spark UI, we can see the Storage tab where partitions exist in memory or disk across the cluster.
cache()is an alias for
- Caching of Spark DataFrame or Dataset is a lazy operation, meaning a DataFrame will not be cached until you trigger an action.