Speed Up Your Spark Jobs Using Caching
Vložit
- čas přidán 26. 07. 2024
- Welcome to our easy-to-follow guide on Spark Performance Tuning, honing in on the essentials of Caching in Apache Spark. Ever been curious about Lazy Evaluation in Spark? I’'ve got it broken down for you. Dive into the world of Spark's Lineage Graph and understand its role in performance.
The age-old debate, Spark Persist vs. Cache, is also tackled in this video to clear up any confusion. Learn about the different Storage Level in Spark used with Persist and how it can make a difference in your tasks.
📄 Complete Code on GitHub: github.com/afaqueahmad7117/sp...
🎥 Full Spark Performance Tuning Playlist: • Apache Spark Performan...
🔗 LinkedIn: / afaque-ahmad-5a5847129
Table credits (Storage Levels, When to use what?): sparkbyexamples.com/spark/spa...
Chapters:
00:00 Introduction
00:39 Why Should You Use Caching?
06:45 Lazy Evaluation & How Could Caching Help You?
10:12 Code + Spark UI Explanation Caching vs No Caching
14:21 Persist & Storage Levels In Persist
#spark #dataengineering #apachespark #lazyevaluation #lineagegraph #storagelevel #persist #cache #persistvscache #sparkperformancetuning #sparkoptimization #uncache #unpersist
Content is useful.
Please make more video 😊
Appreciate it @HimanshuGupta-xq2td, thank you :)
great explanation, plz create one end-to-end project also
Great explanation. Waiting for new videos.
Explained very well!
Great content!
Very informative video.Thanks for sharing
Excellent content. Very Helpful.
Thanks for the videos... keep going
kindly cover apache spark scenerio based questions also
Can we persist any dataframe irrespective of the size of the data it has? Or are there any limitations in caching dataframes?
Thanks for sharing, small query
Do we need to cache based on number of transformations being done on that dataframe or if we are doing more actions on that dataframe/using that dataframe
Thanks @gananjikumar5715, transformations are accumulated until an action is called. So, it would be based on the number of actions; If you're performing several actions, better to cache the Dataframe first, otherwise Spark will re-create the DAG when executing a new action.
If we do not explicitly unpersist, what would happen to the data? Would it be cleaned by the next GC cycle ? Also what is the best practice , explicitly unpersist or leave it to GC.
Hey @anirbansom6682, Data would be kept in memory until the Spark application ends, or the context is stopped or is evicted because Spark needs to free up memory to make room for other data. It may also be evicted during next GC cycle. But this process is a little uncertain as it depends completely on Spark's own memory management policies and JVM's garbage collection process.
Leaving it to GC would be a passive approach over which you've lesser control and is much more like a black box unless you're well aware of its policies.
The best practice, however, is to explicitly unpersist when they're no longer needed. This will give you more control over your application's memory usage and can help prevent memory issues in long running Spark applications where different datasets are cached over time.
Nice video. By the what device you use to write on the screen for teaching bro
Thanks @reyazahmed4855, I use iPad