19 Understand and Optimize Shuffle in Spark

Sdílet
Vložit
  • čas přidán 26. 07. 2024
  • Video explains - How Shuffle works in Spark ? How to optimize Shuffle in Spark ?
    Chapters
    00:00 - Introduction
    00:20 - Understand Pipelining in Spark
    02:18 - Demonstration
    11:40 - Performance with Partitioned Data
    14:19 - Few More Tips
    Local PySpark Jupyter Lab setup - • 03 Data Lakehouse | Da...
    Python Basics - www.learnpython.org/
    GitHub URL for code - github.com/subhamkharwal/pysp...
    The series provides a step-by-step guide to learning PySpark, a popular open-source distributed computing framework that is used for big data processing.
    New video in every 3 days ❤️
    #spark #pyspark #python #dataengineering

Komentáře • 6

  • @anveshkonda8334
    @anveshkonda8334 Před 12 dny

    Thanks a lot for sharing. It will be very helpful if you add data directory in git hub repo

    • @easewithdata
      @easewithdata  Před 7 dny

      Some data files are too big to be uploaded in github. Most of the data is uploaded at - github.com/subhamkharwal/pyspark-zero-to-hero/tree/master/datasets

  • @at-cv9ky
    @at-cv9ky Před 5 měsíci

    great, explanation ! and the article in the comments section is too good.

  • @sarthaks
    @sarthaks Před 6 měsíci

    To your statement "to avoid un-necessary shuffle wherever necessary", can you give some example or scenarios..

    • @easewithdata
      @easewithdata  Před 6 měsíci

      Checkout this article - blog.devgenius.io/pyspark-worst-use-of-window-functions-f646754255d2
      An example of un-necessary use of shuffle

    • @sarthaks
      @sarthaks Před 6 měsíci

      @@easewithdata very very useful.. thanks for sharing the details