DAG and Lazy Evaluation in spark

Sdílet
Vložit
  • čas přidán 22. 08. 2024
  • In this video I have talked about dag and lazy evaluation in spark in great detail. please follow video entirely and ask doubt in comment section below.
    Directly connect with me on:- topmate.io/man...
    For more queries reach out to me on my below social media handle.
    Follow me on LinkedIn:- / manish-kumar-373b86176
    Follow Me On Instagram:- / competitive_gyan1
    Follow me on Facebook:- / manish12340
    My Second Channel -- / @competitivegyan1
    Interview series Playlist:- • Interview Questions an...
    My Gear:-
    Rode Mic:-- amzn.to/3RekC7a
    Boya M1 Mic-- amzn.to/3uW0nnn
    Wireless Mic:-- amzn.to/3TqLRhE
    Tripod1 -- amzn.to/4avjyF4
    Tripod2:-- amzn.to/46Y3QPu
    camera1:-- amzn.to/3GIQlsE
    camera2:-- amzn.to/46X190P
    Pentab (Medium size):-- amzn.to/3RgMszQ (Recommended)
    Pentab (Small size):-- amzn.to/3RpmIS0
    Mobile:-- amzn.to/47Y8oa4 ( Aapko ye bilkul nahi lena hai)
    Laptop -- amzn.to/3Ns5Okj
    Mouse+keyboard combo -- amzn.to/3Ro6GYl
    21 inch Monitor-- amzn.to/3TvCE7E
    27 inch Monitor-- amzn.to/47QzXlA
    iPad Pencil:-- amzn.to/4aiJxiG
    iPad 9th Generation:-- amzn.to/470I11X
    Boom Arm/Swing Arm:-- amzn.to/48eH2we
    My PC Components:-
    intel i7 Processor:-- amzn.to/47Svdfe
    G.Skill RAM:-- amzn.to/47VFffI
    Samsung SSD:-- amzn.to/3uVSE8W
    WD blue HDD:-- amzn.to/47Y91QY
    RTX 3060Ti Graphic card:- amzn.to/3tdLDjn
    Gigabyte Motherboard:-- amzn.to/3RFUTGl
    O11 Dynamic Cabinet:-- amzn.to/4avkgSK
    Liquid cooler:-- amzn.to/472S8mS
    Antec Prizm FAN:-- amzn.to/48ey4Pj

Komentáře • 129

  • @tnmyk_
    @tnmyk_ Před 6 měsíci +8

    Faadu explanation! Finally someone explained why Lazy evaluation actually works betters for Big Data processing. Amazing examples, very nice code! Loved the way you explained each line and each job step by step

  • @prasadBoyane
    @prasadBoyane Před 4 měsíci +2

    I think spark considers 'sum' as action. hence 4 jobs. Greatt series !!!

  • @ChandanKumar-xj3md
    @ChandanKumar-xj3md Před rokem +2

    "Job kaise create hota hai?" ye question pehle kabi clear nai hua tha but thanks Manish for clearing this out and add on was lazy evaluation understanding. 👍

    • @vsbnr5992
      @vsbnr5992 Před rokem

      NameError: name 'flight_data_repartition' is not defined what to do in this case even i import functions and types from pyspark
      please I stuck here

  • @AnuragsMusicChannel
    @AnuragsMusicChannel Před měsícem

    sum() is an action, but the key here is to understand that its not triggerred until an action triggers it. Example : select(sum("value")) is indeed a transformation. It creates a new DataFrame that represents the sum of the value column but does not immediately compute the result, but the actual computation does not happen until an action is triggered. Later stage par, when we call show() (or even collect() ) tab jake, action trigger hoga and will call sum() which is inside select() which creates a job corresponding to sum(). Thats why 4 jobs were seen.

  • @SanjayKumar-rw2gj
    @SanjayKumar-rw2gj Před 2 měsíci

    Truly impressed Manish bhai. Great explanation as you mentioned already "Itna detail mein kahin nhi milega"

  • @akhiladevangamath1277
    @akhiladevangamath1277 Před 3 měsíci +1

    Thank you Thank you Thank you Manish for this video✨✨✨

  • @Watson22j
    @Watson22j Před rokem +3

    wow! very nicely explained. Thank you! :)

  • @krishnavamsirangu1727
    @krishnavamsirangu1727 Před 3 měsíci

    Hi Manish
    Thanks for explaining the concept in detail by running the code.
    I have understood the concept of dag ,lazy evaluation and optimization.

  • @PARESH_RANJAN_ROUT
    @PARESH_RANJAN_ROUT Před 5 dny

    Great Bhai

  • @maurifkhan3029
    @maurifkhan3029 Před rokem +4

    I too got confused as to why sometimes number of jobs as more or less than Actions. Try clearing the state using menu option run ->clear state and option and then run the cell again which has code from reading of file till all the things you want to perform . I think Data bricks intelligently stores state of system and later when you run same read command the Jobs count might not match
    I tried this and it seems to be working

    • @jatinyadav6158
      @jatinyadav6158 Před 7 měsíci +4

      Jobs count is right it is 4 only because sum() function is an action, which I guess Manish missed by mistake. Btw @Manish thank you so much for the amazing course.

    • @deepanshuaggarwal7042
      @deepanshuaggarwal7042 Před 4 měsíci

      @@jatinyadav6158 If 'sum' is an action then why it didn't create a job before adding 'show' codeline ?

    • @jatinyadav6158
      @jatinyadav6158 Před 4 měsíci +2

      @deepanshuaggarwal7042 yes sum is an action, I am not sure why it didn't show a job earlier

  • @arijitsamaddar268
    @arijitsamaddar268 Před 3 měsíci +1

    bohot sahi explanation !

  • @fervabatool1037
    @fervabatool1037 Před 8 dny

    Excellent

  • @abinashpandit5297
    @abinashpandit5297 Před rokem

    Very good Bhaiya. Aaj bhaut kuch isme indepth sikhne ko Mila jo phele pata hi nahi tha. Keep it up 👍

  • @souradeep.official
    @souradeep.official Před 5 měsíci

    Detailed Explanation. Better than paid lectures.

  • @prabhatgupta6415
    @prabhatgupta6415 Před rokem

    He has mastered and crunched the spark.

  • @prabhakarkumar8022
    @prabhakarkumar8022 Před 5 měsíci +1

    Awesome bhaiyaji!!!!!

  • @220piyush
    @220piyush Před 4 měsíci

    Maza aa gya lekin video dekh ke... Wahhh❤

  • @aasthagupta9381
    @aasthagupta9381 Před měsícem

    You are an excellent teacher, you make lectures so interesting! ye answer dekar to interview ko sikha denge :D

  • @bhavindedhia3976
    @bhavindedhia3976 Před 5 měsíci +1

    amazing content

  • @prathapganesh7021
    @prathapganesh7021 Před 5 měsíci

    Thank you so much for clarifying my doubts 🙏

  • @choubeysumit246
    @choubeysumit246 Před 2 měsíci

    one Action one job is true for rdd api only. one action in dataframe or dataset can lead to multiple actions being generated internally. or sue to adaptive query executions as well multiple jobs are created in databricks which you can see using describe method

  • @SqlMastery-fq8rq
    @SqlMastery-fq8rq Před 4 měsíci

    very well explained Sir..Thank You.

  • @ankitachauhan6084
    @ankitachauhan6084 Před 2 měsíci

    thank you ! great teaching style

  • @ravikiran3672
    @ravikiran3672 Před 7 hodinami

    For wide trabsformation there iwll be an extra job will be created. For n transformations there will be n+1 jobs.

  • @akashprabhakar6353
    @akashprabhakar6353 Před 4 měsíci

    Awesome lecture...thanks a lot!

  • @yugantshekhar782
    @yugantshekhar782 Před 5 měsíci

    Great explanation sir, really helpful!

  • @mmohammedsadiq2483
    @mmohammedsadiq2483 Před 10 měsíci +1

    I have confusion? read and inferSchema are typically used with Spark's DataFrame API, which is part of Spark SQL. They are not transformations or actions ,part of the logical and physical planning phase of Spark, which occurs before any actions are executed

  • @kavyabhatnagar716
    @kavyabhatnagar716 Před 10 měsíci

    Wow! Thank you for such great explanation. ❤

  • @a26426408
    @a26426408 Před 4 měsíci

    Very well explained.

  • @rishav144
    @rishav144 Před rokem +1

    great video Manish bro

  • @tahiliani22
    @tahiliani22 Před 4 měsíci +1

    Awesome. By the way, do we know why its creating 4 Spark Jobs instead of 3 ?

  • @welcomefoodies6901
    @welcomefoodies6901 Před 6 dny

    Hi manish bhaiya, yahan pr 4 actions hit hue ha : read, inferschema, sum, show

  • @arpitchaurasia5132
    @arpitchaurasia5132 Před 6 měsíci

    bhai gajabe padate ho yar maja hi a gya yar

  • @manish_kumar_1
    @manish_kumar_1  Před rokem

    Directly connect with me on:- topmate.io/manish_kumar25

  • @pramod3469
    @pramod3469 Před rokem

    very well explained...thanks Manish

  • @jasvirsinghwalia401
    @jasvirsinghwalia401 Před 2 měsíci +1

    Sir Read an inferschema to Transformations hai na and not actions? to inki alag jobs kyu bani hai?

  • @SanjayKumar-rw2gj
    @SanjayKumar-rw2gj Před 2 měsíci

    Is there any cheat sheet to know what all are transformations and actions, like read is a action whereas filter is a transformation?

  • @rohitjhunjhunwala9174
    @rohitjhunjhunwala9174 Před měsícem

    One thing, spark.read is a transformation not action. Is it an action because we included any options? Pls clarify

  • @manojkaransingh5848
    @manojkaransingh5848 Před rokem

    @wow...!..v.nice bro

  • @krushitmodi3882
    @krushitmodi3882 Před rokem

    Sir please ye series thodi jaldi finish karo taki ham interview de sake mene apki puri channel dekhli hai
    Thank you

  • @abhilovefood4102
    @abhilovefood4102 Před rokem

    Sir ur teaching is good

  • @mission_possible
    @mission_possible Před rokem

    Thanks for the session and Please make video on Spark Lineage

  • @Storytime389
    @Storytime389 Před 26 dny

    4 jobs come because u calculated sum and then .show(). I think sum() increases the job number. Correct me if am wrong

  • @roshniagrawal4777
    @roshniagrawal4777 Před 10 dny

    sum is action

  • @koushlendrasinghrajput6040
    @koushlendrasinghrajput6040 Před měsícem

    please give data set so we can practice on thAT

  • @user-ww6yf3iq8q
    @user-ww6yf3iq8q Před 4 měsíci

    Because of group by jobs is created

  • @DevendraYadav-yz2so
    @DevendraYadav-yz2so Před 11 měsíci

    Databricks community ko kaise use karege, Spark kaise setup karege databricks ke sath. Please ye bata dijiye so that code write kr sake

  • @AmitSharma-ow8wm
    @AmitSharma-ow8wm Před rokem

    waiting for ur next vidio...

    • @AmitSharma-ow8wm
      @AmitSharma-ow8wm Před rokem

      @@rampal4570 is it true bro

    • @manish_kumar_1
      @manish_kumar_1  Před rokem

      Aaj aa jayega

    • @vsbnr5992
      @vsbnr5992 Před rokem

      @@AmitSharma-ow8wm NameError: name 'flight_data_repartition' is not defined what to do in this case even i import functions and types from pyspark
      please I stuck here

  • @ankitas4019
    @ankitas4019 Před 4 měsíci

    Where he explained about fligh data download

  • @pramod3469
    @pramod3469 Před rokem

    is lazy evaluation consider the partition also
    like after we have applied orderby on salary col and now we want to show only first two highest salary
    so will lazy evaluation also works here spark will process only that partition which has these two salary records or it will process all partitions and then extract first two highest salary record for us

    • @manish_kumar_1
      @manish_kumar_1  Před rokem +1

      Yes until you write .head(2) for 2 highest record your process will not start although in backend it will create DAG.

  • @mahendrareddych334
    @mahendrareddych334 Před 4 měsíci

    Bro, you are explaining superbly but why don't you explain in English. Everyone doesn't know Hindi. I don't know Hindi but watching your videos to understand the concepts but not getting it fully because it was explained in Hindi.

  • @ruinmaster5039
    @ruinmaster5039 Před rokem

    Bro Plese add summery at the end.

  • @aditya_1005
    @aditya_1005 Před rokem +1

    well explained.....Sir could you please clarify, 3 actions and 4 jobs created?

    • @manish_kumar_1
      @manish_kumar_1  Před rokem

      Aapke 3 actions me 4 jobs create hue hai?
      Aapne show use kara hai? And aap apna code v paste kar dijiye comment section me

    • @hazard-le7ij123
      @hazard-le7ij123 Před 11 měsíci

      @@manish_kumar_1 Aapne jo code likha hai usme bhi 4 jobs create hue hain. Can you explain that?
      Below is my code and same thing is happening. 4 Jobs are getting created. Stage is getting skipped but why do we have an extra job with 4 diff Job Ids?
      from pyspark.sql import SparkSession
      from pyspark.sql.functions import *
      spark = SparkSession.builder.master('local[5]') \
      .appName("Lazy Evaluation internal working") \
      .getOrCreate()
      flight_data = spark.read.format("csv")\
      .option("header","true")\
      .option("inferSchema","true")\
      .load("D:\\Spark\\flight_data.csv")
      flight_data_repartition = flight_data.repartition(3)
      us_flight_data = flight_data.filter(col("DEST_COUNTRY_NAME")=='United States')
      us_india_data = us_flight_data.filter((col("ORIGIN_COUNTRY_NAME")=='India') | (col("ORIGIN_COUNTRY_NAME")=='Singapore'))
      total_flight_ind_sing = us_india_data.groupby("DEST_COUNTRY_NAME").sum("count")
      total_flight_ind_sing.show()
      input("Enter to terminate")

  • @vaibhavdimri7419
    @vaibhavdimri7419 Před 3 měsíci

    Sir apko samjh aya ki ek action hit karne par 2 jobs kaise create hui?

  • @abhishekchaturvedi9855
    @abhishekchaturvedi9855 Před 8 měsíci

    Hello Manish.
    When you mentioned the sql query gets optimized by spark. Just wanted to know will it help improve the execution time if we use the optimized query in our code itself so that spark need not do it ?

    • @manish_kumar_1
      @manish_kumar_1  Před 8 měsíci +1

      Spark optimization is very limited. So as a developer we should write optimized code to run our process faster

  • @chandanpatra1053
    @chandanpatra1053 Před 6 měsíci

    ek code likha hai using spark .usse dekhkar kese bataya ja sakta hai ki wo code 'action' hai ya 'transformation' hai.

    • @manish_kumar_1
      @manish_kumar_1  Před 6 měsíci

      Aapko google karke pata karna chahiye ki kon kon se actions hai. Rest are transformation

  • @user-gt3pi6ir5u
    @user-gt3pi6ir5u Před 4 měsíci

    any idea now, where the 4th job came from?

  • @vsbnr5992
    @vsbnr5992 Před rokem +1

    NameError: name 'flight_data_repartition' is not defined what to do in this case even i import functions and types from pyspark
    please I stuck here

  • @anirbanadhikary7997
    @anirbanadhikary7997 Před rokem

    Aj apne interview questions bataya Nehi.

    • @manish_kumar_1
      @manish_kumar_1  Před rokem

      Basic questions would be there. Like what is DAG and what is edges and vertices in it.

  • @abhayjr11
    @abhayjr11 Před 3 měsíci

    Bhai iske phle wala video dedo, mujhe mil nhi rha hai..

  • @ChetanSharma-oy4ge
    @ChetanSharma-oy4ge Před 6 měsíci

    I am trying to find , why 4 jobs are generating here although we have provided only 3 actions

  • @chethanmk5852
    @chethanmk5852 Před 5 měsíci

    Why do we have 4 jobs when we are using only 3 actions in the application??

  • @welcomefoodies6901
    @welcomefoodies6901 Před 12 dny

    I have 1 doubt bhai , what is difference between apache spark and apache airflow ?

    • @manish_kumar_1
      @manish_kumar_1  Před 12 dny +1

      Completely different purpose and different technology. Google kijiye aap

    • @welcomefoodies6901
      @welcomefoodies6901 Před 12 dny

      @@manish_kumar_1 thanku bhaiya , ap Apache airflow pr bhi series bna do please, i really connect with you , jo paid courses se samj nhi aata wo ap ache se samjha rhe ho , thanku bhai , love you bro 🙌❤️

  • @deepaliborde25
    @deepaliborde25 Před 8 měsíci

    where is the practical session link ?

  • @techworld5477
    @techworld5477 Před 6 měsíci

    Hi Sir..jab main yeh code run kar raha hoon I am getting error as--name 'col' is not defined
    isko kaise solve kare?

  • @snehalkathale98
    @snehalkathale98 Před 5 měsíci

    Where I get CSV file

  • @prateekpawar1871
    @prateekpawar1871 Před 11 měsíci

    Do you have theory notes for spark?

  • @prabhatsingh7391
    @prabhatsingh7391 Před rokem

    Hi Manish Bhaiya, in the code snippet you told there are three actions in this applications(read, infer schema and show) but in spark ui there are 4 jobs created ,can you please explain this.

    • @manish_kumar_1
      @manish_kumar_1  Před rokem

      1 job skip hua hoga. Agar data Kam hai to explain karke dekhiye 3 aana chahiye

  • @DevendraYadav-yz2so
    @DevendraYadav-yz2so Před 11 měsíci

    Lec 7 Tak view kr liya Jo aap code dika rahe hai usko kise databricks and pyspark

    • @manish_kumar_1
      @manish_kumar_1  Před 11 měsíci

      Aapko practical and fundamentals sath me dekhne hai. First video me hi bataya tha

  • @ordinary_indian
    @ordinary_indian Před 6 měsíci

    where to find the files ? I just have started the course

  • @asif50786
    @asif50786 Před rokem

    How many more videos to come on Apache spark??

  • @ajaysinghjadoun9799
    @ajaysinghjadoun9799 Před rokem

    please make a video in Windows function

  • @raghavsisters
    @raghavsisters Před rokem

    Why it is called acyclic?

    • @manish_kumar_1
      @manish_kumar_1  Před rokem

      Because it doesn't make cycle. If it's get into cycle or you can consider it as a circle then it will run endlessly

  • @navjotsingh-hl1jg
    @navjotsingh-hl1jg Před rokem

    bhai iski file de do iss lecture ki

  • @amlansharma5429
    @amlansharma5429 Před rokem

    us_india_data = us_flight_data.filter((col("ORIGIN_COUNTRY_NAME") == 'India') |
    (col("ORIGIN_COUNTRY_NAME") == 'Singapore'))
    Ismein error bata raha hai : NameError: name 'col' is not defined
    Isko kaise define kare?

    • @AliKhanLuckky
      @AliKhanLuckky Před rokem

      Col ko import karna padenga voh ek function hai toh import functions karo I think so

    • @manish_kumar_1
      @manish_kumar_1  Před rokem

      Correct "from pyspark.sql.functtions import *"

    • @vsbnr5992
      @vsbnr5992 Před rokem

      @@AliKhanLuckky NameError: name 'flight_data_repartition' is not defined what to do in this case even i import functions and types from pyspark
      please I stuck here

    • @vsbnr5992
      @vsbnr5992 Před rokem

      @@manish_kumar_1 NameError: name 'flight_data_repartition' is not defined what to do in this case even i import functions and types from pyspark
      please I stuck here

  • @khurshidhasankhan4700
    @khurshidhasankhan4700 Před 9 měsíci

    Sir csv read karne pr two jobs kaise create ho raha hai, read only one action call kr rahe hain.
    If possible please clearify

    • @manish_kumar_1
      @manish_kumar_1  Před 9 měsíci

      Aur inferschema v use kiye honge. Isliye aa rha hoga

    • @khurshidhasankhan4700
      @khurshidhasankhan4700 Před 9 měsíci

      @@manish_kumar_1 thank you sir, can you please share the action list how many action hai spark me, if possible please share sir

  • @avanibafna6207
    @avanibafna6207 Před rokem

    In my case same code has created 5 jobs?I have import col so it will also be treated as action and new job will be created is it so?

    • @manish_kumar_1
      @manish_kumar_1  Před rokem

      Can you please paste your code in comment section

    • @avanibafna6207
      @avanibafna6207 Před rokem

      @@manish_kumar_1 from pyspark.sql.functions import col
      flight_data=spark.read.format("csv")\
      .option("header","true")\
      .option("inferSchema","true")\
      .load("dbfs:/FileStore/tables/flight_data.csv")
      flight_data_reparition=flight_data.repartition(3)
      us_flight_data=flight_data_reparition.filter("DEST_COUNTRY_NAME='United States'")
      us_india_data=us_flight_data.filter((col("ORIGIN_COUNTRY_NAME")=='India')|(col("ORIGIN_COUNTRY_NAME")=='Singapore'))
      total_flight_ind_sing=us_india_data.groupby("DEST_COUNTRY_NAME").sum("count")
      total_flight_ind_sing.show()
      (5) Spark Jobs
      Job 22 View(Stages: 1/1)
      Job 23 View(Stages: 1/1)
      Job 24 View(Stages: 1/1)
      Job 25 View(Stages: 1/1, 1 skipped)
      Job 26 View(Stages: 1/1, 2 skipped)
      flight_data:pyspark.sql.dataframe.DataFrame = [DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string ... 1 more field]
      flight_data_reparition:pyspark.sql.dataframe.DataFrame = [DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string ... 1 more field]
      us_flight_data:pyspark.sql.dataframe.DataFrame = [DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string ... 1 more field]
      us_india_data:pyspark.sql.dataframe.DataFrame = [DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string ... 1 more field]
      total_flight_ind_sing:pyspark.sql.dataframe.DataFrame = [DEST_COUNTRY_NAME: string, sum(count): long]
      +-----------------+----------+
      |DEST_COUNTRY_NAME|sum(count)|
      +-----------------+----------+
      | United States| 100|
      +-----------------+----------+

  • @shrikantpandey6401
    @shrikantpandey6401 Před rokem

    could you provide notebook link?. It will good for hands on

    • @manish_kumar_1
      @manish_kumar_1  Před rokem +2

      I don't provide notebook or pdf. Take notes and type every line of code by yourself. This will give you confidence

    • @sankuM
      @sankuM Před rokem

      @@manish_kumar_1 this is indeed really great point! However, if possible, do share your own reference material for our benefit! Thanks! This series is really helpful, I've 4+ YoE in DE but never tried to go into spark internals, now while interviewing for switch, I'm definitely going to utilize all this! Keep 'em coming!! 🙌🏻👏🏻

  • @3mixmusic564
    @3mixmusic564 Před 11 měsíci

    Guru ibutton khi nhi aaya na idhr na udhr😂😂😂

  • @akhilgupta2460
    @akhilgupta2460 Před rokem

    Hi manish Bhai, Could u provide the flight data file.

    • @manish_kumar_1
      @manish_kumar_1  Před rokem

      Kisi ek video me bataya tha. Please follow all videos in sequence

  • @Tushar0797
    @Tushar0797 Před rokem

    bhai please vo extra job kese create hua ye doubt clear krdo

    • @manish_kumar_1
      @manish_kumar_1  Před rokem

      Aap sql tab me jaake dekho. Kitne jobs skip hue hai. And share me your code and screenshot of the sql tab on LinkedIn or Instagram.

    • @rohitgade2382
      @rohitgade2382 Před rokem

      @@manish_kumar_1 abe chutiya tere video ka bol Raha he wo 😂

  • @shivakrishna1743
    @shivakrishna1743 Před rokem

    Where can I get the flight_data.csv file? Please help.

  • @Tanc369
    @Tanc369 Před 5 měsíci

    csv kaha milegi sir ?

    • @manish_kumar_1
      @manish_kumar_1  Před 5 měsíci

      2 playlist hai. Parallely dekhiye. Practical wale me data milega description me usko copy karke save Kar lijiye as csv

  • @PARESH_RANJAN_ROUT
    @PARESH_RANJAN_ROUT Před 5 dny

    App kar pate ho toh, mein bhi kar sakta Manish Bhai

  • @aishwaryamane5732
    @aishwaryamane5732 Před 5 měsíci

    Hi sir.. In which video series u have explained about schema @manish_kumar_1