20 Data Caching in Spark
Vložit
- čas přidán 26. 07. 2024
- Video explains - How Spark works with Cache data? What is the difference in Spark Cache vs Persist ? Understand what is the impact of partial caching.
Chapters
00:00 - Introduction
00:29 - Demonstration
03:20 - Spark Cache
09:20 - Spark Storage Level with Persist
12:54 - Cache vs Persist
Local PySpark Jupyter Lab setup - • 03 Data Lakehouse | Da...
Python Basics - www.learnpython.org/
GitHub URL for code - github.com/subhamkharwal/pysp...
The series provides a step-by-step guide to learning PySpark, a popular open-source distributed computing framework that is used for big data processing.
New video in every 3 days ❤️
#spark #pyspark #python #dataengineering
one of the best explanation in depth, Thanks :)
Could you please make a video on "end to end Data engineering" project, from requirement gathering to the deployment.
Thanks ❤️ Please make sure to share with your network on LinkedIn 🛜
thanks for your efforts it helps lot
Thanks ❤️ Please make sure to share with your network over LinkedIn 🛜
Excellent content in this playlist! Thanks for sharing and keep up the good work 🚀
Nice job and can you please provide more details on serialized and deserialized when dealing with cache/persist in upcoming lectures ?
Thanks. Your explanation is too good. Keep making such videos.
Also, if possible, make some videos on scenario based interview questions
as already mentioned in a comment, pls make a video on ser/deserialization of the objects
will definitely try.
I have one query, Cache() is equal to persist(pyspark.StorageLevel.MEMORY_AND_DISK). Only difference in this scenario is that cache() uses deserialized and persist used serialized data. So, if persist is better in terms of data serialization and functionality, what is the use case of using cache over persist ?
You already have the answer in your question, for cache the data is already de serialized thus no hassle but in persist the data is serialized and need to be deserialized before processing.
@@easewithdata Got it.. Thank you for the explanation !! I went through all the videos in this playlist. I really loved it !!
Consider you have a orders dataframe with 25 million records
now you applied a projection and a filter and cached this dataframe as shown below
orders_df.select("order_id","order_status").filter("order_status == 'CLOSED'").cache()
Now you execute the below statements...
1) orders_df.select("order_id","order_status").filter("order_status == 'CLOSED'").count()
2) orders_df.filter("order_status == 'CLOSED'").select("order_id","order_status").count()
3) orders_df.select("order_id").filter("order_status == 'CLOSED'").count()
4) orders_df.select("order_id","order_status").filter("order_status == 'OPEN'").count()
please answer the below queries...
question 1) what point of time the data is cached (partially/completely) ?
question 2) Which all queries serves your request from the cache, and which all will have to go to the disk. Please explain.
As you have already written the complete query, why not just try it out and share the result with us.