How to Read Spark DAGs | Rock the JVM

Sdílet
Vložit
  • čas přidán 26. 07. 2024
  • Written form to keep for later: blog.rockthejvm.com/spark-dags/
    Check out the Spark performance courses:
    rockthejvm.com/p/spark-optimi...
    rockthejvm.com/p/spark-perfor...
    Download Spark: spark.apache.org/downloads.html
    Related video on reading Spark Query Plans: • How to Read Spark Quer...
    This video is for Spark programmers who know the essentials (e.g. create a DataFrame, basic selects/joins) and wants a sneak peek at how Spark works internally and get an essential skill for performance analysis and improvement.
    In this video I'll show you how Spark creates computation dependencies before it can run anything, and I'll show you how to read and interpret Directed Acyclic Graphs (DAGs) and identify performance bottlenecks while the job is running.
    Follow Rock the JVM on:
    LinkedIn: / rockthejvm
    Twitter: / rockthejvm
    Blog: rockthejvm.com/blog
    -------------------------------------------------------------------------
    Home: rockthejvm.com
    -------------------------------------------------------------------------
  • Věda a technologie

Komentáře • 65

  • @anandahs6078
    @anandahs6078 Před 2 lety +2

    I don't know how to express my gratitude to you. I went through lot of other youtube videos but none of them explained DAG execution like you. I have become a fan. It is worth watching your video and recommend to everyone

  • @recker2326
    @recker2326 Před 3 lety +4

    This is the best video for explanation on dags. You helped me to do a most important company work! Cheers👍👍🥂🥂🍻

    • @rockthejvm
      @rockthejvm  Před 3 lety +1

      Happy to hear it! Help others by sharing the tips here.

  • @saketverma1471
    @saketverma1471 Před 2 lety +4

    Hi Daniel, Thanks for the video. Really appreciate the way you explained such a complex topic in simple terms.

  • @turnawayandgo
    @turnawayandgo Před 4 lety +1

    this is literally my favorite video on spark

  • @murthyspec2002
    @murthyspec2002 Před 2 lety

    Thanks daniel for your nice explaination of complex topic in a simpler way!! good job!!

  • @jacco2952
    @jacco2952 Před rokem +1

    This was really, really helpful for analysing big data tasks!

  • @mouadmuslim
    @mouadmuslim Před 2 měsíci +1

    very well explained

  • @ashishlimaye2408
    @ashishlimaye2408 Před 3 lety +2

    The best explaination for DAG.

  • @arpsdac
    @arpsdac Před 4 lety +3

    I love your explanations, DAG is very important for spark optimization stuff. I am a big follwer of your corses on Udemy and other platforms. Please keep sharing with us such amazing videos.

  • @book_beats
    @book_beats Před 2 lety

    Super clear explanation, thank you.

  • @arpsdac
    @arpsdac Před 3 lety +1

    Big fan of your explanation.

  • @rupalgakkhar5913
    @rupalgakkhar5913 Před rokem +1

    This one is useful.. Thanks for making it

  • @rishabhkohli7170
    @rishabhkohli7170 Před 2 lety

    Amazing explanation
    Thanks

  • @tadastadux
    @tadastadux Před rokem +1

    It's very useful video. Thank you. I'd like more advice on how to convert/read mapPartitions to the actual code.

  • @gian963
    @gian963 Před 4 lety +1

    Really great video! I was following you on udemy and I was hopping to have more videos on the subject of advance Spark optimization features. :)

    • @rockthejvm
      @rockthejvm  Před 4 lety +1

      Glad you like it! The home for future material will be rockthejvm.com, so check there for the latest and greatest!

  • @ygorallandefraga7805
    @ygorallandefraga7805 Před 3 lety +3

    Awesome. You should record other videos talking about what shuffle read and shuffle write mean (how to interpret them) and a little bit about how to understand the state of spark (mainly in the dag). I think this kind of information is missing in the Spark Documentation (or at least is pretty hidden). Keep doing videos like these! Thx

    • @rockthejvm
      @rockthejvm  Před 3 lety +1

      Glad it helps - we'll be doing more of these.

  • @abhi121able
    @abhi121able Před 2 lety

    Really meaningful content.

  • @learn_technology_with_panda

    Excellent

  • @animeshsrivastava2398
    @animeshsrivastava2398 Před 3 měsíci +1

    Such a detailed video. Why does this is low on views?

    • @rockthejvm
      @rockthejvm  Před 3 měsíci +1

      Anybody's guess: algorithm, I don't put an angry face on my thumbnails, or something else 😅 in any case, share it!

  • @rs-research-laboratory
    @rs-research-laboratory Před 4 lety +1

    Really a good explanation

  • @sagarghanwat2225
    @sagarghanwat2225 Před 3 lety

    Great explanation!!

  • @ashishchoksi8501
    @ashishchoksi8501 Před 2 lety +1

    Great video!
    Visiting your channel first time, glad that you tube showed this video in suggestion.
    I already know basics of spark learned few things from here and there, wondering might be missing many things.
    do you have spark for beginner course??

    • @rockthejvm
      @rockthejvm  Před 2 lety

      Yes! rockthejvm.com/p/spark-essentials

  • @ibnu0muhammad
    @ibnu0muhammad Před rokem +1

    oh my god, this is so good! can you create some video on how we are interacting with the executor through every operation?

    • @rockthejvm
      @rockthejvm  Před rokem

      Check out the course, we have this at length: rockthejvm.com/p/spark-essentials

  • @thevijayraj34
    @thevijayraj34 Před 2 lety +1

    Great

  • @shuaibsaqib5085
    @shuaibsaqib5085 Před 2 lety

    Hi Daniel, Nice explanation and a small suggestion from my side that you are speaking little fast, as iam watching ur video in 0.75x speed.

  • @romi2015
    @romi2015 Před 3 lety +1

    Excellent video and explanations
    One comment - in 14:18 the join seems to happen on stage 6 with the broadcasted dataframe from the previous job. The shuffle between stage 6 and 7 is for collecting the local sums of each partition and to sum them up in order to get the final result. The 413 bytes of the Shuffle Write on stage 6 is an indication. Is that correct?

  • @databiceps
    @databiceps Před 4 lety +1

    Hi Daniel,
    Thanks for the great Video. I have been following you on udemy. I was really looking out for the DAG explanation. Thanks a ton !!
    Also, I had a request if you can guide for any resource wherein I can learn about spark DAGs more.

    • @rockthejvm
      @rockthejvm  Před 4 lety

      I'll create some more material in time - for now, the best resources I have are in the Spark Optimization courses on Rock the JVM: rockthejvm.com/p/spark-optimization

  • @rmiliming
    @rmiliming Před rokem +1

    Nice video ! Can we have a video on how to avoid/reduce shuffling? Tks

    • @rockthejvm
      @rockthejvm  Před rokem

      A whole course, actually: rockthejvm.com/p/spark-optimization

  • @ishangaur9667
    @ishangaur9667 Před 3 lety +1

    Awesome explanation for Spark DAGs in a layman language. Had a follow-up question - how the number of tasks ( Partitions ) are decided for the other stages which were present in the DAG i.e. you did talk about the stages which had 7 or say 9 tasks and we were able to identify those using the transformation i.e. repartition(7) or repartition(9) you did BUT what about the other stages. How the tasks are calculated for those stages ?

    • @rockthejvm
      @rockthejvm  Před 3 lety

      There are various defaults that Spark uses. For example, parallelize uses the number of virtual cores as the number of partitions (in the absence of other config). Other operations use 200 partitions as a default, etc.

  • @anirbansom6682
    @anirbansom6682 Před 3 lety +1

    Hi Daniel, before explaining the join step you mentioned that the smaller table is broadcasted across the executors. Then why there is a exchange due to the join?

    • @rockthejvm
      @rockthejvm  Před 3 lety +1

      It should appear as BroadcastExchange in the query plans - that's a sign that the broadcast is working.

  • @akhilraj8614
    @akhilraj8614 Před 4 lety

    Hey Daniel, great content!! I was wondering if you can lay-out the order in which you would advise someone coming from a Python background to take your courses and if there are any supplementary reading materials that you would advise to read in-between courses. Thanks in advance :)

    • @rockthejvm
      @rockthejvm  Před 4 lety +1

      Yes, absolutely. Will put something out for people with different backgrounds.

  • @SpiritOfIndiaaa
    @SpiritOfIndiaaa Před 3 lety

    thank you so much ..nice explanation... you are doing "sum" instead can you do "count" of records after aggregation ... when we use "count" only one tast we see i.e. one partition...its killling performace....how to handle/repartition the data in such scenario?

    • @rockthejvm
      @rockthejvm  Před 2 lety

      It's not killing performance - that aggregation works in the same way, as partial aggregations are computed per partition before being collapsed into one value.

  • @ashutoshsamal7158
    @ashutoshsamal7158 Před rokem +1

    Why there is shuffle for sum? i think it could have been done in parallel

    • @rockthejvm
      @rockthejvm  Před rokem +1

      Partial sums are computed in parallel, but the final result is computed on one executor as the partial results are aggregated.

  • @Prashanth-yj6qx
    @Prashanth-yj6qx Před 2 lety

    Hi Rock, Can you create one data skew/data spill code and explain how to do performance improvments

    • @rockthejvm
      @rockthejvm  Před 2 lety

      Yep! rockthejvm.com/p/spark-optimization

  • @rishigc
    @rishigc Před 3 lety +1

    Hi Daniel - incredible video.. i work with Spark SQL queries and often find the need to optimize the poor performing sql queries by analyzing details in Spark UI. Are there any of your resources which discuss about the same in details ? If yes, then could you please point me in the right direction. Would love to hear more from you. Thanks

    • @rockthejvm
      @rockthejvm  Před 3 lety

      Yes - I have long-form, in-depth courses on Spark performance at rockthejvm:
      rockthejvm.com/p/spark-optimization
      rockthejvm.com/p/spark-performance-tuning

    • @rishigc
      @rishigc Před 3 lety

      @@rockthejvm Thanks a bunch, Daniel.. looking forward to take up the courses

    • @rishigc
      @rishigc Před 3 lety

      @@rockthejvm I took a look at the courses.. But I think they discuss in details with Scala which I don't know. Do you have a course to performance tuning Sql queries which generic in nature E.g select, Joins, Where clauses etc ?

    • @rishigc
      @rishigc Před 3 lety

      @@rockthejvm Hi Daniel I am looking forward to purchase the "spark-optimization" course. Could you please let me know which one of these 2 courses discuss extensively about SparkSQL query optimizations, reading query plans, join optimization and interpreting Spark UI (rewriting queries etc).Thanks

    • @rishigc
      @rishigc Před 3 lety

      @@rockthejvm Hi Daniel, I bought the "spark-optimization" course :-)

  • @rsamurti
    @rsamurti Před 2 lety

    Is it possible to capture the DAGs while a job is running for determining the average job completion time? This will be helpful to fine-tune the job scheduler to reduce the average job completion time.

    • @rockthejvm
      @rockthejvm  Před 2 lety

      You can build such a tool that inspects Spark at runtime.

  • @Gauravkumar-xw9ug
    @Gauravkumar-xw9ug Před 4 lety

    Hey Daniel, Are u planning to make your "spark optimization" course available on Udemy.

    • @rockthejvm
      @rockthejvm  Před 4 lety

      Nope, only on the rockthejvm.com website for the most dedicated people - hoping you'll join us there