I built data pipelines at Netflix that ran 2000 TBs per day, here’s what I learned about huge data!

Sdílet
Vložit
  • čas přidán 23. 03. 2024
  • Check out my academy at www.DataExpert.io where you can learn all this in much more detail!
    Use code EARLYSUB30 at checkout to be one of my first 100 paid academy subscribers!
    #dataengineering
    #netflix
  • Věda a technologie

Komentáře • 379

  • @sevrantw8931
    @sevrantw8931 Před 4 měsíci +3117

    I’m so glad I found this video, I was just sitting here with 60 million gigabytes and was figuring out what joins to use so this was perfect timing.

    • @aripapas1098
      @aripapas1098 Před 4 měsíci +11

      if all u registered was 60 mil gb & joins ur not flowing

    • @smackastan5697
      @smackastan5697 Před 4 měsíci +32

      You're kidding, but somehow I just started a data analysis project of two terabytes and this video shows up.

    • @hi-mn5rg
      @hi-mn5rg Před 4 měsíci +13

      @@aripapas1098 if you think comments must indicate a user registered every aspect of a video, ur not following

    • @derickd6150
      @derickd6150 Před 4 měsíci +2

      ​@@aripapas1098this is a sad comment

    • @00Tenrai00
      @00Tenrai00 Před 4 měsíci

      Sarcasm ???? 😂

  • @bilbobeutlin3405
    @bilbobeutlin3405 Před 4 měsíci +2288

    Can't wait to build hyperscale pipelines for my startup with 0 users

    • @92kosta
      @92kosta Před 4 měsíci +65

      But it sounds powerful when you say it, like you mean business.

    • @npc-drew
      @npc-drew Před 4 měsíci +6

      Based

    • @vikingthedude
      @vikingthedude Před 4 měsíci +6

      1 user (me)

    • @JGComments
      @JGComments Před 4 měsíci +12

      If you build it, they will come.

    • @abhilashpatel6852
      @abhilashpatel6852 Před 4 měsíci +1

      I have 1k TB data just sitting around in my backyard. Glad your video came up to get me started on atleast something.

  • @subhasishsarkar5106
    @subhasishsarkar5106 Před 4 měsíci +412

    What I absolutely love about your videos is that as a beginner in the data engineering field, you often talk about things that I had no conception of. In this video for example, I have never heard of SMBs or broadcast joins. This gives me an oppurtunity to learn these things, even hearing them be mentioned from someone as widely experienced as you.
    You need not necessarily have to even go into detail, but these short form videos act as beacons of knowledge that I can throw myself into learning about.
    Thanks a lot, and keep these coming Zach!

    • @EcZachly_
      @EcZachly_  Před 4 měsíci +71

      Really appreciate this comment! It reminds to that the value im putting out there is important!

    • @vasudevreddy3527
      @vasudevreddy3527 Před 4 měsíci +2

      @@EcZachly_ ✌

    • @eric.batdorff
      @eric.batdorff Před 4 měsíci +10

      Great summation! I was thinking the exact same thing while watching. It's nice hearing even the specialized lingo from technical experts in their fields, it peaks my curiosity.

    • @MrAmitkr007
      @MrAmitkr007 Před 4 měsíci

      ​@@EcZachly_thanks

    • @prawtism
      @prawtism Před 4 měsíci +2

      ​@@EcZachly_did you already know the importance of these two before Netflix or did you learn that while working at Netflix?

  • @supercompooper
    @supercompooper Před 4 měsíci +657

    In the future a wrist watch will have a little blinking light that will have 60 million gigabytes of data in it

    • @dhillaz
      @dhillaz Před 4 měsíci +93

      You mean an Electron app?

    • @aripapas1098
      @aripapas1098 Před 4 měsíci

      yeah okay crack smoker

    • @mrevilducky
      @mrevilducky Před 4 měsíci +38

      And it will still lag and hit 99% singularities

    • @Ivan-Bagrintsev
      @Ivan-Bagrintsev Před 4 měsíci +12

      @@dhillaz that will just show current time

    • @supercompooper
      @supercompooper Před 3 měsíci +9

      @@Ivan-Bagrintsev Yes it will show the time, but with full DRM. Unless you have a license to view certain minutes it will be denied.

  • @lucas.p.f
    @lucas.p.f Před 4 měsíci +517

    Boyfriend simulator: you sit with your bf and he starts talking about this nerdy stuff you have no idea about but need to keep listening because you love him

    • @EcZachly_
      @EcZachly_  Před 4 měsíci +44

      This is exactly correctly

    • @CU.SpaceCowboy
      @CU.SpaceCowboy Před 4 měsíci +10

      aww 🥰

    • @heykike
      @heykike Před 3 měsíci

      After marriage they no longer pretend to listen to

    • @rajns8643
      @rajns8643 Před 2 měsíci +2

      If only a girl would fall for me when I speak nerdy stuff 🫠

    • @lucas.p.f
      @lucas.p.f Před 2 měsíci +1

      @@rajns8643 are you kidding me? This is what most people like the most! Intelligent people are extremely attractive

  • @Bostonaholic
    @Bostonaholic Před 4 měsíci +45

    I love that you kept it short and to the point.

    • @tobiastho9639
      @tobiastho9639 Před 4 měsíci +2

      He sure wanted to save some data… 😅

  • @supafiyalaito
    @supafiyalaito Před 4 měsíci +83

    Thanks Zach, hopefully one day I will understand what all of that means

    • @mu3076
      @mu3076 Před 12 dny

      😂😂😂, I’m starting now

  • @WM-eg4gh
    @WM-eg4gh Před 4 měsíci +5

    Thank you Zach for taking the time to give us the hard truth and hands down your experience. It helps a lot of enthuastic students/people to know how we can in some way support or help others in the subjects we like. I don't imagine myself processing 2000TBs per day, but it helps give a bigger picture. Once again, appreciate the short video and thank you for sharing

  • @JGComments
    @JGComments Před 4 měsíci +13

    2 pita bites a day, the same as me when I’m on a diet.😊

  • @RichardOles
    @RichardOles Před 4 měsíci +61

    Holy crap. I’m currently learning about data science, the various roles, etc. -with the hope of one day switching careers. But the current state of learning is all about the languages and software used etc, not about the infrastructure and what to do with massive datasets. So this just 🤯

    • @samuelisaacs7557
      @samuelisaacs7557 Před 3 měsíci +2

      its really about math but no one talks about it. get at least 1 year university math comprehension and then get into the python and tech tools. the most competent and successful data engineers are always people with a good STEM background. for example Zach has a Bachelor's Degree in Applied Mathematics and a Bachelor's Degree in Computer Science so he is a heavy numbers guy. That's what most of Data Science \ Engineering CZcamsrs don't tell their viewers cause that will cause them to loose viewers.

    • @byRoyalty
      @byRoyalty Před 2 měsíci +1

      learning the tools can be very different from solving real world problems.

    • @rajns8643
      @rajns8643 Před 2 měsíci

      ​@@samuelisaacs7557 True asf

    • @stevess7777
      @stevess7777 Před 2 měsíci

      ​@@samuelisaacs7557Yep, even a business administration bachelors will have a lot of maths and it's nowhere near data science which is 3x that.

  • @rembautimes8808
    @rembautimes8808 Před 4 měsíci +74

    Great content, an honour to be able to listen to someone who has handled that volume of data.

    • @deleater
      @deleater Před 4 měsíci +1

      literally 🎉

    • @codecaine
      @codecaine Před 4 měsíci

      Have chat gpt explain it too you or some other LLM.

  • @mohammedaamer4201
    @mohammedaamer4201 Před 4 měsíci +4

    Just started following you. Really appreciate you for sharing your knowledge with the community.

  • @oakleyorbit
    @oakleyorbit Před měsícem +1

    Half of what you said I had no idea what you were taking about but I was very engaged and now I’m gonna look all this stuff up for centering my div!

  • @Adhanks91
    @Adhanks91 Před 4 měsíci +4

    Informative and straight to the point, great stuff as usual

  • @JT-zb6vi
    @JT-zb6vi Před 4 měsíci +2

    instant subscribe - really appreciate the concise explanation and clear examples

  • @tanujkhochare3498
    @tanujkhochare3498 Před 4 měsíci +5

    Hey Zach, your content is consistently amazing! As a newcomer to the field, I'm considering diving into data engineering. What roadmap would you recommend, and are there any certifications that could enhance my journey? I already have a solid grasp of Python and SQL in data analysis.

  • @LambOverSpicyRice
    @LambOverSpicyRice Před 4 měsíci +3

    Excellent video, thanks Zach!

  • @rohanbhakat2922
    @rohanbhakat2922 Před 4 měsíci +8

    Thanks for the info Zach. Could you please make an elaboriative video on SMB join.

  • @jacobp8294
    @jacobp8294 Před 4 měsíci +11

    I am a regional IT installer who runs Cat6 Ethernet pipelines for managing 1gb loads on HP laptops, this video is really awesome and breaks down your workflow and mindset in a complicated field really efficiently. I would love to get more short videos about the industry like this.

    • @EcZachly_
      @EcZachly_  Před 4 měsíci +2

      I'll keep them coming. I make much more on Tiktok and Instagram since I like making vertical content!

    • @jacobp8294
      @jacobp8294 Před 4 měsíci

      @@EcZachly_ Ill check it out! Keep it up!

  • @arbol41
    @arbol41 Před 3 měsíci +2

    Thanks Zach , but I have a question broadcast join is used when we have a small dimensions joined with big table this is your case? Or are you used hash join with two large table?

  • @SahilKashyap64
    @SahilKashyap64 Před 4 měsíci +4

    I've never heard of these terms, thank you sharing your real case scenarios(The FB notification example)

  • @dazzassti
    @dazzassti Před 4 měsíci +19

    In the 37 years I’ve been working in data, I’ve never heard anyone call it Peter 😂. PETA

  • @sharpsrain8302
    @sharpsrain8302 Před 4 měsíci +2

    I just found ur stuff but thanks for the content mang keep it up 🙏

  • @souravghosh358
    @souravghosh358 Před 4 měsíci +3

    Very important concept in such short time.. thank u so very much ❤

  • @ngneerin
    @ngneerin Před 4 měsíci +2

    Thanks, looking forward to more such content

  • @nikolagrkovic8769
    @nikolagrkovic8769 Před 4 měsíci +2

    The amount of knowledge you shared here is astonishing

  • @solitary200
    @solitary200 Před měsícem

    Great points to remember!
    There are a lot more underlying abstraction layers you can add at these different points to further optimize the second network hop. Caching is a simple one.
    Can you implement an efficient snapshot system with delta encoding of entities and compress the message? Would be a cool video for you to implement!

  • @JimRohn-u8c
    @JimRohn-u8c Před 3 měsíci +2

    Did Facebook use Databricks or did they have HPC Clusters for you to run Spark on?

  • @dungenwalkerr619
    @dungenwalkerr619 Před 2 měsíci +1

    Thanks for sharing, now I can finally put some good numbers on my resume 🎉

  • @ArjunRajaS
    @ArjunRajaS Před 3 měsíci +2

    If you come across a scenario to join 2 large datasets. You could do an iterative broadcast join. Basically you are going the break one of the df into multiple dfs and join the dataframe in a loop till all the multiple dfs are joined.

    • @jordanmessec5332
      @jordanmessec5332 Před 3 měsíci

      You’ll require a lot of memory and have long start times, no?

  • @Llanowyn
    @Llanowyn Před 4 měsíci +2

    I would be interested in the architecture and content delivery for pre and post cdn from a network design perspective. Are there any examples or presentations regarding networking at netflix?

  • @ChrisMPerry
    @ChrisMPerry Před 4 měsíci +4

    Insightful as always.💯

  • @john_paul
    @john_paul Před 3 měsíci +1

    I love how you acronym Sorted Bucket Merge as SMB. Think you may have had Super Mario Bros on the mind 😂

  • @Jc12x06
    @Jc12x06 Před 4 měsíci +12

    Dude has beef with Bezos😂

  • @RyanSaplanPT
    @RyanSaplanPT Před 4 měsíci +1

    Please more data stuff!!! I hardly understood what you said, but it’s sounds interesting

  • @ChuckNorris-lf6vo
    @ChuckNorris-lf6vo Před 4 měsíci +1

    Hi, what about replacing torrents with IPFS? That's data pipelining, right ?

  • @theAnupamAnandoriginal
    @theAnupamAnandoriginal Před 4 měsíci +1

    you can make a bios optimized for throughput and without interrupta , to speeden 67x and more

  • @theactualslimshady
    @theactualslimshady Před 4 měsíci +1

    Please keep up the great content!

  • @vikrampandit2174
    @vikrampandit2174 Před 4 měsíci +2

    Never thought broadcast join is a Netflix saviour

  • @remo
    @remo Před 2 měsíci +2

    Damn I just wanted to shuffle like there’s no tomorrow and then I found this video.

  • @revel-88
    @revel-88 Před 18 dny +1

    Subscribing just for the britto. One of my favourite hoods

  • @seegreen6484
    @seegreen6484 Před 3 měsíci +1

    I love that I’m only a software engineer but I can understand all of this

  • @iloos7457
    @iloos7457 Před 4 měsíci +1

    Hey are you familiar with cosmosDB from azure? Its a db like mongo but claims to be able to scale infinitely... What are your thoughts on that?

  • @earthling_parth
    @earthling_parth Před 4 měsíci +6

    Imma wait for Primeagen to confirm this as well when he reacts to this video inevitably 😁

  • @uwize5897
    @uwize5897 Před 3 měsíci

    optimizing selling personal data to minimize cost is something i never thought about

  • @explosivecl
    @explosivecl Před 4 měsíci +1

    Thanks for the video

  • @TLOGhx
    @TLOGhx Před 4 měsíci +1

    Insanely valuable content

  • @schwarzie2478
    @schwarzie2478 Před 4 měsíci +1

    I just felt like drinking from the fountain of knowledge and instantly drowning. Definitily haven't had to deal with these kind of volumes yet...

  • @emerald42481
    @emerald42481 Před 2 měsíci +1

    Very useful and interesting, even to a layman

  • @yippykayyay
    @yippykayyay Před 14 dny +1

    No idea what this guy is talking about, but thankful CZcams sent me this

  • @SamCyanide
    @SamCyanide Před 4 měsíci +2

    My medical science clients called, they need an 800tb imaging data set parsed by end of day (thank you kubernetes)

  • @MFsyrup
    @MFsyrup Před 3 měsíci

    Thank you Tony Hawk, very cool!

  • @TheDa6781
    @TheDa6781 Před 2 měsíci +1

    Managing retention, storage and flow is always important. Im sitting on a toilet as im writing this.

  • @aamadmi5848
    @aamadmi5848 Před 4 měsíci +1

    Thanks zech for the video

  • @user-op5vc9qw6o
    @user-op5vc9qw6o Před 4 měsíci +1

    That's cool bro. Will it fix the Netflix app where it shows the title of one show but the preview and description of another?

    • @EcZachly_
      @EcZachly_  Před 4 měsíci

      It was to look at network traffic to keep your credit card data secure

  • @hearhaw
    @hearhaw Před měsícem +1

    I'd like to learn more about these pitabytes. What are they? What do they taste like?

  • @narbwow8168
    @narbwow8168 Před 4 měsíci +1

    Pretty interesting, even though I had no idea about most of what he was talking about.

  • @internetcancer1672
    @internetcancer1672 Před 4 měsíci +4

    My problem is how do people even find out about the careers that they go into?

  • @dark_lord98
    @dark_lord98 Před 4 měsíci +3

    Are those joins available in MySQl or specific to dbms at meta you worked?

    • @juanbrekesgregoris4405
      @juanbrekesgregoris4405 Před 4 měsíci +1

      I think they're not available on MySQL because it's an OLTP database. Those joins are used for analytics

    • @jordanmessec5332
      @jordanmessec5332 Před 3 měsíci

      These are not database joins, they are processing joins. Frameworks such as Flink and Spark would leverage broadcasts.
      It basically boils down to a single coordinator instance that publishes a small, often changing dataset to all parallel processors. Usually used to enrich, prune, or map the main dataset.

  • @TheInterestingInformer
    @TheInterestingInformer Před 4 měsíci +2

    I’m trying to get into data analytics and most of this we t over my head but this still sounds lit 🔥

  • @IAmAlpharius14
    @IAmAlpharius14 Před 2 měsíci +4

    Sir this is a Wendy's.

  • @orppranator5230
    @orppranator5230 Před 4 měsíci +1

    Bro can figure out how to send my entire homework folder in 1/500th of a second but can’t flip the camera sideways

  • @_sonicfive
    @_sonicfive Před 3 měsíci +1

    Whenever I hold on to more than 60 petabytes I just call the assistant to the regional manager and he runs a fix from his mainframe.

  • @GameCyborgCh
    @GameCyborgCh Před 2 měsíci +1

    gotta love a good pita byte

  • @cry2love
    @cry2love Před 4 měsíci +2

    I still bite my gigas when my man hustling meta in peta

  • @rashshawn779
    @rashshawn779 Před 4 měsíci +1

    Very nice. Short and sweet.

  • @ATX_Engineer
    @ATX_Engineer Před měsícem +1

    Ah yes, data structures and sorting… but with the “can you even scale bro” tick enabled.

  • @Manhunternew
    @Manhunternew Před 4 měsíci +1

    How do you deal with log data

  • @3dilson
    @3dilson Před 4 měsíci +1

    "FNA developer"
    I'm sorry, my brain couldn't let go of it

  • @LucTaylor
    @LucTaylor Před 3 měsíci +1

    I might get 5 users on my site this month so this will come in handy

  • @xasm83
    @xasm83 Před 3 měsíci +1

    my data pipeline usually processes one pitabyte every other day and one shawarmabyte every week week

  • @chrism3790
    @chrism3790 Před 2 měsíci +1

    What engine were you using to do these massive joins? Spark?

  • @49erman2
    @49erman2 Před 4 měsíci +1

    Quality content!

  • @Dmytro-kt3fr
    @Dmytro-kt3fr Před 2 měsíci +1

    would you say that using bucketing and basically constraining against “acceptable” throughput as well as risking on creating gazillion files in process is more acceptable approach then more ad hoc ones like: z ordering and bloom filters?

  • @Hishamhh93
    @Hishamhh93 Před 2 měsíci +1

    Bro is the PewDiePie of data Engineering

  • @chrishabgood8900
    @chrishabgood8900 Před 4 měsíci +2

    Is this only available with sparksql?

    • @jordanmessec5332
      @jordanmessec5332 Před 3 měsíci

      No, broadcasts can be leveraged in any processing framework that leverages two sets of processing logic. Your highly parallelized logic as well as a commonly single process. The single process “broadcasts” data for all of the parallel instances. It can be implemented other ways but that is the most common.

  • @liamvstech
    @liamvstech Před 4 měsíci +1

    When I was hired to do data engineering, it was always data that could fit on a single hard drive and it was boring af. I hated it. This sounds way more challenging and interesting.

  • @picdu2891
    @picdu2891 Před 2 měsíci +1

    I love technology and I know more than your average user, yet I have no IT qualifications and I am light years away from this knowledge, but for some reason, I love watching these videos as if I was ever going to use the information 😂

  • @OurNewestMember
    @OurNewestMember Před 2 měsíci +1

    Interesting! I would have thought something like sharding (or partitioning and clustering) so data processing and access can scale horizontally.

    • @EcZachly_
      @EcZachly_  Před 2 měsíci

      Bucketing and clustering are similar

  • @theAnupamAnandoriginal
    @theAnupamAnandoriginal Před 4 měsíci +1

    : multiple streams across entire ddrs directly accessible

  • @twitchizle
    @twitchizle Před 18 dny +1

    I really wonder how netflix achieves 100tb/hr just with only streaming videos.

  • @bacfjib9874
    @bacfjib9874 Před 4 měsíci +1

    Very informative, I wanna ask you, which certification can help me as a fresh graduate, is AWS data engineer Certification worth it or not? And thank's a lot Zach

  • @bandanaboii3136
    @bandanaboii3136 Před 3 měsíci

    Interviewer: name 5 data types
    Me:

  • @sneakybutpirate
    @sneakybutpirate Před 4 měsíci

    Oh yeah that’s really great and insightful, now what’s a join?

  • @GeneralKenobi69420
    @GeneralKenobi69420 Před 4 měsíci +2

    The Venn diagram of people who use TikTok and data scientists is two circles my dude lol

    • @EcZachly_
      @EcZachly_  Před 4 měsíci +1

      I have 66k followers on TikTok and this video did 375k views there.

  • @tschaderdstrom2145
    @tschaderdstrom2145 Před 4 měsíci +2

    I love pita bites as much as the next guy, but I don't think I can take more than 35 before I'm full

  • @user-to4md9xm2d
    @user-to4md9xm2d Před 4 měsíci +1

    Hey absolutely curious about the content your are doing.
    In my company we are working dbt and snowflake. I can't find a possibility to work with broadcast joins there. do you see a possibility to replicate this process?

    • @EcZachly_
      @EcZachly_  Před 4 měsíci

      Snowflake isn’t suitable for volumes >100tbs in my opinion.
      Clustering is an option in snowflake that helps though

  • @ungeschaut
    @ungeschaut Před 4 měsíci +1

    I use just a database with just value as field (long string) and nothing else

  • @ankandatta4352
    @ankandatta4352 Před 3 měsíci

    Bucketing is a one time process. But what if everyday new data comes in?
    For example if our bucketing takes say 2hr per day for say 10 gb data(right table), and every next day, this increases by 10 gb, don't you think that it'll take more and more time as more data get accumulated?

    • @EcZachly_
      @EcZachly_  Před 3 měsíci +1

      You have to partition your data. Unless your data is genuinely doubling everyday (which I doubt it is)

    • @EcZachly_
      @EcZachly_  Před 3 měsíci +1

      The bucket joins should only be between events for that day and dimensions for that day. Not all the data going back

    • @EcZachly_
      @EcZachly_  Před 3 měsíci +1

      As the business grows, this can still get bigger because 10 GB/day might become 50 after some time and you need to account for that

  • @YishuaiLiu
    @YishuaiLiu Před 4 měsíci +3

    Short and informative

    • @EcZachly_
      @EcZachly_  Před 4 měsíci

      Thank you! What other videos would you like to see from me?

  • @elferpe27
    @elferpe27 Před 3 dny

    Wow, didn't know Owen Wilson was working on data

  • @RajveerSingh-vf7pr
    @RajveerSingh-vf7pr Před 5 dny

    Wow, if I knew all this, it's pretty amazing content...
    If only...

  • @awesomebears
    @awesomebears Před 2 měsíci +1

    Wait, i have 200TB/hr what do I do? Please help!

  • @aarjunpp
    @aarjunpp Před 3 měsíci +2

    1. Are you a data engineer?
    2. What tech is this? AWS, Snowflake?

  • @dexnow
    @dexnow Před 2 měsíci +1

    I suddenly feel like pita bread...

  • @TheGoodContent37
    @TheGoodContent37 Před 2 měsíci +1

    Love the way you tried to make it sound more complicated than it actually is and failed.

  • @manh9105
    @manh9105 Před 4 měsíci +1

    ok, so how to do that ...can you make a screencast and show us how to do it!

  • @mohdmuneeb4851
    @mohdmuneeb4851 Před 3 měsíci

    I am senior year software engineer intern. I didn't understand anything you said except "joins". Not even the variants. Where can I learn things like that? please

  • @mikishwagg
    @mikishwagg Před 4 měsíci +2

    Me watching this not knowing anything hes talking about makes me feel like starting a big tech company 😀

  • @brandonheaton6197
    @brandonheaton6197 Před 2 měsíci +1

    He is channeling a young William Benney over here isn't he