Apache Spark Memory Management

Sdílet
Vložit
  • čas přidán 26. 07. 2024

Komentáře • 54

  • @senthilkumarpalanisamy365

    Excellent and clear cut explanation, thanks much for taking time and preaparing the content. Please do more.

  • @PratikPande-k5h
    @PratikPande-k5h Před 3 dny

    Really appreciate your efforts. This was very easy to understand and comprehensive as well.

  • @hritikapal683
    @hritikapal683 Před 3 měsíci +6

    Please don't stop making videos they're highly insightful!

  • @himanshuxyz87
    @himanshuxyz87 Před 3 měsíci +2

    I have read so many articles before on Spark Memory Management but this is the first time I have understood the allocation and other details so clearly. Thanks a lot. Really helpful video.

    • @afaqueahmad7117
      @afaqueahmad7117  Před 3 měsíci +1

      @himanshuxyz87 This means a lot, thank you for the appreciation :)

  • @nayanroy13
    @nayanroy13 Před 4 měsíci +3

    The best 23mins 8secs I have ever spent :). This is easily one of the most useful videos on CZcams!

  • @Pratik0917
    @Pratik0917 Před 3 měsíci +1

    All Videos are of high quality. I dont think we could this level of knowledge anywhere.. THank you, Afaque

    • @afaqueahmad7117
      @afaqueahmad7117  Před 2 měsíci +1

      Thank you @Pratik0917, appreciate it, means a lot to me :)

  • @cloudanddatauniverse
    @cloudanddatauniverse Před 25 dny

    Top Class brother! Simple, Amazing and impactful. You deserve great appreciation to bring these internals. May God bless you with great health, peace, mind and prosperity! Keep growing.

    • @afaqueahmad7117
      @afaqueahmad7117  Před 19 dny +1

      Many thanks @cloudanddatauniverse, this means a lot, thank you for the kind words :)

  • @iamexplorer6052
    @iamexplorer6052 Před 4 měsíci +2

    Thank you we are expecting you with solid content like this

  • @nikhillingam4630
    @nikhillingam4630 Před 10 dny

    It's very useful ❤

  • @rgv5966
    @rgv5966 Před 24 dny

    Hey @afaque, this is top class stuff, thanks for putting in all the effort and making it available for us. Keep going :)

    • @afaqueahmad7117
      @afaqueahmad7117  Před 19 dny

      Many thanks @rgv5966, this means a lot, appreciate it :)

  • @technicalsuranii
    @technicalsuranii Před 3 měsíci

    Very in-depth description of Apache Spark Memory management 🎉🎉❤

  • @vinitrai5020
    @vinitrai5020 Před 4 měsíci +2

    Hey Afique, thanks for the wonderful explanation.
    Ok, so now I have got a few questions, plz clear the doubts:
    1. In the unified memory, what if the the execution memory needs the full space that is occupied by storage memory, can the blocks from the storage memory be evicted to make room for the execution memory? So, can the execution memory occupy 100% of the space of unified memory (execution + storage)
    2. If yes, so let's suppose an event where the execution memory occupies the full unified memory and it still needs more memory.
    3. So, in this case, we have two choices -a disk spill or an off heap memory, we should opt for off heap memory over disk spill as u explained in your video .
    4. The most important question now is that if we can use disk spill or off heap memory why do we get Out Of Memory Error in executors.
    I hope that you got my points and will soon get clear explanations from your end.
    Thanks again.

    • @afaqueahmad7117
      @afaqueahmad7117  Před 4 měsíci +4

      Hi @vinitrai5020, Good question!
      Yes, execution can request 100% of space from the unified memory manager pool, however, in cases where you want to immune the cached blocks from eviction, you can always set `spark.memory.storageFraction` to a value. If you set this value to, for e.g. 0.1, 10% of the total memory cannot be evicted by execution. However, it is important to note that this is on-demand. If `spark.memory.storageFraction` is set to 0.1 (10%) but nothing is cached, execution will just go ahead and use that storage memory and storage will wait for that 10% memory to free up before it can use it. Refer Spark documentation here: spark.apache.org/docs/latest/tuning.html#memory-management-overview
      On Spark throwing OOM errors, despite always having the option to spill to disk is because most in-memory structures used for joins, aggregations, sorting, shuffling cannot be “split”. Consider an example where you’re doing a join or an aggregation. In this operation, the same keys land in the same partition. Imagine one of the join/aggregation key being so large that it doesn’t fit in-memory. Now, spill doesn’t work here because that in-memory structure “supposedly” holding that large key cannot be “split” i.e. depending on the nature of data, half of the join cannot be done while spilling the rest and then later getting the spilled data back and doing the join for this half. This is primarily because that in-memory structure cannot be “split”.
      Enabling off-heap memory would help reduce the memory pressure and now:
      - Total execution memory = execution (on-heap) + execution (off-heap)
      - Total storage memory = storage (on-heap) + storage (off-heap)
      If the size of the large key (as discussed above) is good enough to fit in the total execution memory after enabling off-heap memory, an OOM will be avoided.
      Hope this clarifies :)

    • @tahiliani22
      @tahiliani22 Před 3 měsíci

      @@afaqueahmad7117 Thanks for explaining this. I had the same question and this really helps.

  • @coledenesik
    @coledenesik Před 2 měsíci

    I have two accounts in CZcams and subscribed in both, Reason is you are putting some serious effort into the content. Beautiful Diagrams clear explanation accurate information is beauty of your content. Thanks, Afaque Bhai

    • @afaqueahmad7117
      @afaqueahmad7117  Před 2 měsíci +1

      Bohot shukriya @coledenesik bhai :) This comment made my day. Thank you for appreciating my efforts, it means a lot to me brother

  • @amiyakumarnayak8286
    @amiyakumarnayak8286 Před 3 měsíci

    very detailed explanation. Thanks

  • @iamkiri_
    @iamkiri_ Před 3 měsíci

    Good one Bro. You are one of the elite DataEngineer youtuber -)

  • @prabas5646
    @prabas5646 Před měsícem

    Excellent.. pls keep posting on internals of spark

  • @pratikparbhane8677
    @pratikparbhane8677 Před 3 měsíci

    You are the Real Gem❤ , Thanks Bhai for crystal clear explanation❤❤

  • @ybalasaireddy1248
    @ybalasaireddy1248 Před 4 měsíci

    Thanks for the Fabulous content. More power to you

    • @afaqueahmad7117
      @afaqueahmad7117  Před 4 měsíci

      Thank you @ybalasaireddy1248, really appreciate it :)

  • @user-sk8vi1xy7q
    @user-sk8vi1xy7q Před 2 měsíci

    Rare video thanks for making this video. Please make more videos ❤

    • @afaqueahmad7117
      @afaqueahmad7117  Před 2 měsíci

      Thank you @user-sk8vi1xy7q, appreciate the kind words :)

  • @rambabuposa5082
    @rambabuposa5082 Před 3 měsíci

    Thanks Afaque Ahmad, very good series and loved all of them. Good work
    I have a few questions for you, may be we can discuss here if possible or if you are planning a new video, I will wait for it.
    1. Here you discussed about Executor Memory Management. What about Driver Memory Management, how it works internally?
    2. What are the similarities between Executor and Driver Memory Management?
    3. What are the differences between Executor and Driver Memory Management?
    Many thanks in advance.

    • @afaqueahmad7117
      @afaqueahmad7117  Před 3 měsíci +2

      Hey @rambabuposa5082, thank you for the kind words, really appreciate it :)
      Regarding `Driver Memory Management`, appreciate the ask, but I do not have plans yet for a video. Reason is, I believe Driver & Executor memory management go hand-in-hand and relatively easy to manage Driver if your concepts are clear on Executor memory management because of several similarities (as you asked in one of your questions).
      Internally their memory components look similar in the sense that they both have JVM (on-heap) and off-heap memory and the division/logic of memory in the driver is just the same as the executor.
      Key differences are in terms of "roles and usage". You would have 1 driver which is solely responsible for creating tasks, scheduling those tasks, communicating back and forth with the executors on progress and aggregating the results (if needed), therefore its memory usage patterns differ from those of executors, which perform the actual data processing and storage.
      An important difference is on the ways OOM (out of memory errors) would happen on drivers vs executors. Hopefully, I'll be creating some content on OOM & other issues specifically and how to navigate through them.
      Hope that clarifies :)

  • @bhargaviakkineni
    @bhargaviakkineni Před 3 měsíci

    Excellent video sir. Could u please make a video on garbage collection in spark and jvm

  • @AlluArjun-ds9hh
    @AlluArjun-ds9hh Před 4 měsíci +1

    Can you please explain more about serialization and deserialization in spark?

  • @i_am_out_of_office_
    @i_am_out_of_office_ Před 4 měsíci

    keep it coming!!

  • @deepakgonugunta
    @deepakgonugunta Před 3 měsíci

    Please don't stop making videos

  • @avinash7003
    @avinash7003 Před 3 měsíci +1

    please do one full time project on Apache Spark

    • @afaqueahmad7117
      @afaqueahmad7117  Před 2 měsíci

      Thanks for the suggestion @avinash7003! It's in the plan.

    • @avinash7003
      @avinash7003 Před 2 měsíci +1

      @@afaqueahmad7117 upload most questions asked in Data engineering interview

  • @marreddyp3010
    @marreddyp3010 Před 4 měsíci

    Thanks for the excellent content. Could we see all the mentioned memory details in spark ui.

    • @afaqueahmad7117
      @afaqueahmad7117  Před 4 měsíci +1

      Thanks @marreddyp3010! RE: Spark UI, on the "Executors" tab, you can see most of the memory components - storage, on-heap, off-heap memory, disk usage

    • @marreddyp3010
      @marreddyp3010 Před 4 měsíci

      @@afaqueahmad7117 I am confused with user memory . As per spark documentation by default it is 40% of total memory. How can we check usage this memory in spark ui. Could you kindly please help to sort it. Kindly please make poc (proof of concept) video on resources usage by using GB's of data.

  • @grim_rreaperr
    @grim_rreaperr Před 3 měsíci

    Thanks a lot bhai

  • @dileepn2479
    @dileepn2479 Před 2 měsíci

    What is the use of overhead memory ?

    • @afaqueahmad7117
      @afaqueahmad7117  Před 2 měsíci

      Hey @dileepn2479, as mentioned at 4.18, it's used for internal system level operations - these are not directly related to data processing but are essential for the proper functioning of the executor e.g. managing memory for JVM, networking during shuffling etc.. Hope this clarifies :)

    • @dileepn2479
      @dileepn2479 Před 2 měsíci

      Thank you @@afaqueahmad7117 . I wasn't expecting such swift response from your end . Thanks much again !!