How to avoid a single point of failure in distributed systems ✅

SdĂ­let
VloĆŸit
  • čas pƙidĂĄn 9. 07. 2024
  • A single point of failure(SPOF) in computing is a critical point in the system whose failure can take down the entire system. A lot of resources and time is spent on removing single points of failure in an architecture/design.
    Single points of failure often pop up when setting up coordinators and proxies. These services help distribute load and discover services as they come and leave the system. Because of the critical centralized tasks of these services, they are more prone to being SPOFs.
    One way to mitigate the problem is to use multiple instances of every component in the service. The graph of dependencies then becomes more flexible, allowing the system to resiliently switch to another service instead of failing requests.
    Another approach is to have backups which allow a quick switch over on failure. The backups are useful in components dealing with data, like databases.
    Allocating more resources, distributing the system and replication are some ways of mitigating the problem of SPOF. Hence designs include horizontal scaling capabilities and partitioning.
    It is important to note that the CAP theorem does not allow removing SPOFs if perfect consistency is required.
    Looking to ace your next interview? Try this System Design video course! đŸ”„
    get.interviewready.io
    Course chapters:
    1) Design an email service like Gmail
    2) Design a rate limiter
    3) Design an audio search engine
    4) Design a calling app like WhatsApp
    5) Design and code a payment tracking app like Splitwise
    6) Machine coding a cache
    7) Low level design of an event bus
    The chapters have architectural diagrams and capacity estimates, along with subtitled videos. Use the coupon code of 'earlybird' to get a 20% discount.
    References:
    docs.oracle.com/cd/E19693-01/...
    stackoverflow.com/questions/7...
    ieeexplore.ieee.org/document/...
    www.spkaa.com/blog/how-to-deal...
    en.wikipedia.org/wiki/Single_...

Komentáƙe • 124

  • @semihkekul
    @semihkekul Pƙed 5 lety +76

    Elon Musk had already seen the Earth as a single point of failure and has been trying to create a slave on Mars. In that case, maybe Moon will a a Load Balancer.

    • @gkcs
      @gkcs  Pƙed 5 lety +8

      Lol. Maybe that will be true one day!

    • @Shankusu993
      @Shankusu993 Pƙed 2 lety

      parents send a request for a child to the moon and it redirects it to either mars or earth and there a baby pops up and is returned back to the parents who are strolling through the space in a spaceship. Sounds good XD

    • @DesiTennis
      @DesiTennis Pƙed 2 lety

      😂

  • @WittyGeek
    @WittyGeek Pƙed 6 lety +5

    The Netflix example is good one. I saw their PyCon 2018 talk and they showed how they do regional failovers in under 7 minutes. It was a good talk.

  • @Nithin_Coorg
    @Nithin_Coorg Pƙed 2 lety

    So crisp as always!

  • @fchas15
    @fchas15 Pƙed 4 lety +5

    your positive energy makes me feel good! i feel like even i can get through an interview after watching you! excellent!

    • @gkcs
      @gkcs  Pƙed 4 lety

      Fantastic!

  • @manojmj5479
    @manojmj5479 Pƙed 6 lety +2

    Very informative! Please keep making more videos!

  • @sumitlahiri209
    @sumitlahiri209 Pƙed 6 lety +1

    Informative video. especially the meteorite scene. Awesome!!

  • @anastasianaumko923
    @anastasianaumko923 Pƙed rokem

    Thank you, very clear!

  • @songs4enjoy
    @songs4enjoy Pƙed 5 lety +2

    Few more observations
    All your examples use reverse proxies to achieve HA (except browser, tho not explicitly mentioned), but there are other techniques
    1. Client side load balancing: Using service registry (a bit smart DNS) & smart clients to achieve HA. In your example, browser can be considered a smart client, but we have a lot more on the server-server communication to achieve HA in request-response flows
    2. Also, your LB in a single zone is usually kept in a HA configuration using something like keepalived & a floating ip

  • @kumarakantirava429
    @kumarakantirava429 Pƙed 4 lety +1

    Prof. Gauran Sen,
    learning a lot Sir. Thank you.

  • @abhikeshu
    @abhikeshu Pƙed 5 lety

    Hey Gaurav, A big thanks for your efforts you are putting, I really learnt many things which was myths till now, you have have explained concepts in very simple terms, keep the good work.

  • @karandutt4534
    @karandutt4534 Pƙed 3 lety +1

    Hello Gaurav,
    Thanks a million for sharing your knowledge and helping us.
    Keeping your examples / explanation as simple as possible it could be makes you stand out.
    Plz do add such small topics which is definitely useful otherwise it gets unnoticed in a larger video/topic.

  • @thisisfunc4529
    @thisisfunc4529 Pƙed 6 lety

    Thanks, your video is awesome

  • @prashantsrivastava9550
    @prashantsrivastava9550 Pƙed 3 lety

    Very well expalined and great positive energy with gr8 smile :)...thnx buddy

  • @geekengr
    @geekengr Pƙed 5 lety

    Nice work!

  • @suraj-gd9qy
    @suraj-gd9qy Pƙed 2 lety

    Nice bro, I had to use this in recent development and I understood the concept... Thanks:)

  • @tusharverma03
    @tusharverma03 Pƙed 2 lety

    Thanks you Gaurav, we need to know these tiny information about each part.
    Your vides are amazing and keep making videos on large systems and whenever you come up with some sub topic you can link those topics link in description using which one can master that topic before moving ahead.
    Thanks a lot

  • @sachinakinapally5061
    @sachinakinapally5061 Pƙed 4 lety +1

    Man, you literally helped me finish my assignment! Learned a lot. Great content. Thanks!

  • @sankalparora9374
    @sankalparora9374 Pƙed rokem

    Thanks for the video!

  • @influencer737
    @influencer737 Pƙed 5 lety

    Bro, all the very best in your new role at Uber, wishing you all success

  • @sheshitkarthikeya1528
    @sheshitkarthikeya1528 Pƙed 5 lety

    Awesome!!

    • @gkcs
      @gkcs  Pƙed 5 lety

      Thank you!

  • @PankajKumarSingla
    @PankajKumarSingla Pƙed 5 lety

    Thanks Gaurav for this video Please add more videos for server fail and how to divide according to region or master slave

  • @lien3723
    @lien3723 Pƙed 5 lety +1

    Could anyone please explain what profile server? Or profiling in system design mean? Are these 2 different things? I heard this term in multiple contexts but don't quite understand what they mean. Thank you!

  • @raj_kundalia
    @raj_kundalia Pƙed rokem

    thanks

  • @harisridhar1668
    @harisridhar1668 Pƙed 3 lety +2

    3:35 Gaurav - we don't have to worry about the Domain Name System (DNS) being an articulation point / SPOF ( Single Point of Failure ) since DNS is already a decentralized distributed system, correct? In a sense, we are already taking advantage of an existing scalable and resilient network architecture, correct?

  • @vijaykidecha7491
    @vijaykidecha7491 Pƙed 5 lety +1

    Hi Gaurav, Thanks for the video. All the videos on system design are really informative. Can you please make videos on esb and messaging queues like you made on load balancing.

    • @gkcs
      @gkcs  Pƙed 5 lety +1

      I have one on messaging queues in the playlist. I'll checkout what ESB is 🙂

  • @kinjalthehero
    @kinjalthehero Pƙed 4 lety

    What is a profile server? Thank you for the informative videos.

  • @saitejajonnadula
    @saitejajonnadula Pƙed 3 lety

    Chaos engineering is applied on application/node before going to production phase. Triggering controlled attacks and having ability to role back the attack to maintain original stable position.

    • @gkcs
      @gkcs  Pƙed 3 lety

      Netflix does this really well.

  • @mohanreddy4669
    @mohanreddy4669 Pƙed 5 lety

    Hey Gaurav,
    Can you please do a video on designing e-commerce like Amazon/Walmart?

  • @alirezamosavi6185
    @alirezamosavi6185 Pƙed 10 měsĂ­ci

    Hi i didnot get why we do need a kind of DNS above load balancers? Why the only load balancers are not needed for distributing the traffic to all nodes ? Please explain this ...

  • @sugyansahu9120
    @sugyansahu9120 Pƙed 6 lety +1

    well, this was informational. â˜ș waiting for your tinder system. 😎

  • @biswajeetsethi7689
    @biswajeetsethi7689 Pƙed rokem

    Hi, is clock synchronization is must in every distributed system ? Can i just call it a distributed system where nodes are working on different data set at different locations independently in order to make some business decision at the end.

  • @venkatreddy6851
    @venkatreddy6851 Pƙed 6 lety +32

    Hi Gaurav Thanks for the Videos Really Enjoying and Learning a lot from them. I have a question as you mentioned in the video when a load balancer fails we will overcome this problem by placing multiple load balancers and we keep all the IP's of the Load balancers in the DNS . But how the DNS knows whether the first Load balancer is working fine or not as DNS is simply just a name to address resolver and once it is done it will come into picture. and where do we write our logic saying that if loadbalancer1 fails contact loadbalancer2 or something like this.

    • @gkcs
      @gkcs  Pƙed 6 lety +20

      Hey Venkat, DNS are pretty smart these days. They send requests to an IP Address and can redirect to the next IP in their list if they get the appropriate error code.
      503 means service unavailable, and seems like a good error to set.

    • @venkatreddy6851
      @venkatreddy6851 Pƙed 6 lety +1

      Thanks Can u Please make on UBER

    • @kartkat
      @kartkat Pƙed 5 lety +1

      @@venkatreddy6851 Your computer can do RR or the DNS result that you got might have done RR and sent you the response. More info can be found here: blogs.technet.microsoft.com/networking/2009/04/17/dns-round-robin-and-destination-ip-address-selection/. Also check the RFC which details about LB on DNS: www.faqs.org/rfcs/rfc1794.html

    • @trushapatel9012
      @trushapatel9012 Pƙed 4 lety +1

      @@gkcs You can also consider time out as if no ACK from packet sent across network for N time, it will automatically send the same packet to another IP address. That's called 3 way Hand Shacking.
      That's where Routing comes into play and more structured Networking Architecture will be built. You can cover some of that in your System Design videos.

    • @lakshminarayanansairam2739
      @lakshminarayanansairam2739 Pƙed 4 lety

      +1

  • @ankitakashyap4289
    @ankitakashyap4289 Pƙed 5 lety

    Hi Gaurav. When you say multiple Databases for data replication in master-slave pattern, do the multiple database fall under the concept of sharding?. Also what is meant by cross-data replication and cross region. Is it simply that replication happens over multiple regions?

  • @kumarakantirava429
    @kumarakantirava429 Pƙed 4 lety +1

    Sir,
    It's wonderful of you to PIN that Question with your insightful answer..... I was struggling to understand how clusters can offer HIgh Availability for web sites.. Your DNS answer Enlightened me on lot of design aspects.....Directly prostrating to your feet.

  • @nikhilsingh2233
    @nikhilsingh2233 Pƙed 5 lety

    Hi Gaurav!
    Great Videos. Shows Your Passion about explaining these topics. Could you please tell me how the changes in the database of a particular node get mirrored to another node?

    • @gkcs
      @gkcs  Pƙed 5 lety +1

      Thank you!
      I'll be getting to this topic soon 😁

  • @ravindrababu4759
    @ravindrababu4759 Pƙed 4 lety +2

    Re-phrase "More nodes" to "redundant nodes" to address Single Point of failures

  • @GeorgeChi1
    @GeorgeChi1 Pƙed 3 lety

    p*p is usually not the case when you have hot data issue that just migrates to the backup/replica and hoses that down as well due to a system that was not redesigned on time for the scale it now has to support. In that case the particular technology's ability to handle load becomes a single point of failure.

    • @gkcs
      @gkcs  Pƙed 3 lety

      Yes, this is a simplification. Communication lines and response times tend to suffer as the number of nodes increase.

  • @abhishekbansal3425
    @abhishekbansal3425 Pƙed 5 lety

    Hi Gaurav, here in your explanation of DNS to multiple IP Address mapping, how will DNS choose the IP Address, may be it will send all request to 1 IP and overload it?

  • @ishasingh6726
    @ishasingh6726 Pƙed 4 měsĂ­ci

    i found thissssssss vid bcoz of striverrrr a gem knows another gemmmm

  • @adilsaju
    @adilsaju Pƙed 4 lety

    Thank you Gaurav, I just realized that Moriarty in Sherlock Holmes was actually that Chernobyl officer 😂

    • @gkcs
      @gkcs  Pƙed 4 lety

      Hahaha!

  • @jiamingxing6333
    @jiamingxing6333 Pƙed 4 lety

    I am new to some technology, anyone can explain to me what is back up services?

  • @anuraggharat5453
    @anuraggharat5453 Pƙed 6 měsĂ­ci

    So basically if you think something will fail, add a replica of it to handle failure. Then add a balancer to figure out which one to use from the original and copy. Then the balancer can fail too, so add another balancer to support the original balancer😭

  • @lakshminarayanansairam2739

    which book u read a lot.. you have good potential in explaining... even i know few, but presenting those in front of ppl and camera is hard. minus 90%.will be my outcome. like ur confidence.

    • @gkcs
      @gkcs  Pƙed 4 lety

      Designing Data Intensive Applications :)

  • @adityamudaliar1145
    @adityamudaliar1145 Pƙed 4 lety

    Isn't finding an inconsistency is tougher? In these systems or some bug occurs in between what caused it?
    Which module caused it?
    Just asking because the finanical exchange systems which I worked on were more recovery oriented and were based on single point of failure at least where it was majorly time critical
    Inconsistencies of data weren't tolerable at all.

  • @shivamvishwakarma1475
    @shivamvishwakarma1475 Pƙed 3 lety +1

    Is DNS failure in this a Single point of failure?

  • @sonalinaresh7013
    @sonalinaresh7013 Pƙed 6 lety

    Keep the videos going

  • @shivangidhakad9807
    @shivangidhakad9807 Pƙed 5 lety +2

    Its funny how the 'single point of failure' aka our iron man is wiped out now ! Disasters happen ..

    • @gkcs
      @gkcs  Pƙed 5 lety +1

      Hahaha, I loved Avengers too 😁

  • @adheipsingh1797
    @adheipsingh1797 Pƙed 5 lety

    what about edge topologies , where basically we have an api gateway to handle norh-south traffic and spreading the it east-west using service meshes

  • @bsummer
    @bsummer Pƙed 2 lety

    Aws load Balancers brought me here

  • @Joe99
    @Joe99 Pƙed 5 lety

    Does spof impact serverless platforms where vm's are dynamically provisioned for you? Or does it become a non issue as a result?

    • @gkcs
      @gkcs  Pƙed 5 lety

      It's never a non issue, only an insulated one. Eventually we have to rely on ourselves to ensure resiliency :)

  • @GerardBeaubrun
    @GerardBeaubrun Pƙed 2 lety

    I'm lookingng forward to the tinder architecture

  • @thongtran1653
    @thongtran1653 Pƙed 5 lety

    hi Gaurav. Can you make a video of showing the master-slave architecture ?

    • @gkcs
      @gkcs  Pƙed 5 lety

      I have. Check out the data replication video. czcams.com/video/GeGxgmPTe4c/video.html

  • @rusrushal13
    @rusrushal13 Pƙed 6 lety +1

    Hey Gaurav, in the description you said that "the CAP theorem does not allow removing SPOFs if perfect consistency is required". What does that mean, what do mean by perfect consistency? Can't we acheive consistency if we are distributing and replicating our system?

    • @gkcs
      @gkcs  Pƙed 6 lety

      We can't if we want the data to be consistent. The distributed system is either available or consistent in case of a network partition.
      At best we can have eventual consistency or a really low chance of inconsistency.

    • @rusrushal13
      @rusrushal13 Pƙed 6 lety

      isn't the master-slave concept(leader election thing which services like zookeeper does) makes our distributed system consistent?
      CAP theorem suggests that out of three things we can achieve max to max two things! so in that sense, if we are making using a distributed system, consistency and availability could be achieved right?

    • @rusrushal13
      @rusrushal13 Pƙed 6 lety

      I just visited en.wikipedia.org/wiki/CAP_theorem to get a better understanding of CAP theorem. I get the point what you are trying to make but theoretically, the things I am saying also makes sense(at least to me), maybe practically it doesn't happen. Can you give more explanation on why CAP theorem is right?
      PS: And yeah thanks for explanation and videos, I don't know whether I thanked you before or not so I don't want to miss my chance at least here!

    • @gkcs
      @gkcs  Pƙed 6 lety

      The CAP theorem proof is quite involved. Let's take it at face value.
      Even with a master slave, the request might be lost in transit from master to slave, which then has inconsistent information.

    • @rusrushal13
      @rusrushal13 Pƙed 6 lety

      but as a whole, our system is consistent as our databases are consistent with our incoming data even if our slave is inconsistent with the master database.
      So does this means that CAP theorem independently applies on every component of our system?

  • @AmanNidhi
    @AmanNidhi Pƙed 6 lety

    can you also explain the tech stack of codechef

    • @gkcs
      @gkcs  Pƙed 6 lety

      I work for Flock, a team messaging app 😊

  • @Akashdeepkashyap
    @Akashdeepkashyap Pƙed 2 lety

    Gaurav: If earth is destroyed humanity will end.
    Musk : I accept the challange.

  • @sikorpro
    @sikorpro Pƙed 5 lety

    If one load balancer (gateway) is down. How technically user get redirected to the working one? Is the user sent back automatically to DNS server? Or maybe user grabs all the links available for google.com in DNS server? Please, explain.

    • @kartkat
      @kartkat Pƙed 5 lety

      Its actually upto the company who's LB is down to stop advertising that IP to the internet. In that case they should stop the DNS entry for the faulty one and just send one DNS entry(if only 2 LB are present).There are other ways to recover from sending faulty traffic to the LB and one of the ways are by CDNs. Almost all large scale companies use CDNs to distributed their traffic. If you are using a CDN, you can force your traffic to only go to the good LB IP. You as an individual can force the traffic to a good IP from the DNS result which you have received and add it in your hosts file.

  • @sayandey1478
    @sayandey1478 Pƙed 5 lety

    Could we do something like this: Important updates- Global cache Not so important updates- in-memory cache?

    • @gkcs
      @gkcs  Pƙed 5 lety

      Yes you could :)

  • @jayeshudhani99
    @jayeshudhani99 Pƙed 3 lety

    What if DNS fails? Isn't DNS a single point of failure? In that case, having a spare backup copy of DNS would work right?

    • @gkcs
      @gkcs  Pƙed 3 lety +1

      DNS clusters are the backbone of the internet. It's unlikely for them to fail in their entirety.

    • @jayeshudhani99
      @jayeshudhani99 Pƙed 3 lety

      @@gkcs Thanks man. Great explanation.

  • @UmangSardesai
    @UmangSardesai Pƙed 5 lety +4

    So Elon Musk is making humanity 'Fault Tolerant' ;)

    • @gkcs
      @gkcs  Pƙed 5 lety

      In a way, yes 🙂
      However, if Earth explodes, can Mars still be inhabitable?

    • @UmangSardesai
      @UmangSardesai Pƙed 5 lety +1

      Well, let's hope Mars is habitable (or hope we're able to make it habitable). Only time will tell. 😬

  • @kirankjoseph
    @kirankjoseph Pƙed 2 lety

    Guess facebook had a single point of failure couple of days back!

  • @chiragbansal2891
    @chiragbansal2891 Pƙed rokem

    Yout analogy is not that good but I get it! :D

  • @mannysingh6618
    @mannysingh6618 Pƙed 5 lety

    What happens if a transaction is processing on a node and that node fails?
    You may have completed part of the transaction.

    • @gkcs
      @gkcs  Pƙed 5 lety

      A transaction should be rolled back in that case.
      If the node fails, the request fails, and the client can retry.
      If the transaction isn't reversible...well...then it isn't a "transaction" is it? :)

    • @mannysingh6618
      @mannysingh6618 Pƙed 5 lety +1

      @@gkcs
      Okay, then what would you do with a sequence of operations?
      I only ask because this is a "gotcha" situation?

    • @gkcs
      @gkcs  Pƙed 5 lety +1

      That's a problem, but a sequence of operations cannot be treated as a transaction anyway. Either we reverse them, or retry from the point of failure.
      Most systems can't reverse a sequence of operations. One of the operations might contact an external system or send the user an email. To reverse them, they perform some compensation operations.
      Have a look at Sagas.
      microservices.io/patterns/data/saga.html
      Cheers!

    • @mannysingh6618
      @mannysingh6618 Pƙed 5 lety

      Yep, tricky stuff here. Thanks for the pointer to the pattern!
      How long you been coding?

    • @gkcs
      @gkcs  Pƙed 5 lety

      Not too long, I have been working for 4 years now. What about you?

  • @jimitshah7636
    @jimitshah7636 Pƙed měsĂ­cem

    Is DNS a single point of failure?

    • @gkcs
      @gkcs  Pƙed měsĂ­cem +1

      DNS is a distributed service.

  • @davidlee2117
    @davidlee2117 Pƙed 4 lety

    I was wondering is the DNS server now a single point of failure? đŸ€”

    • @gkcs
      @gkcs  Pƙed 4 lety

      There are multiple DNS servers spread across the globe, to which nearby users connect to.

  • @taneja_unchained
    @taneja_unchained Pƙed 3 lety

    So DNS is a single point of failure here? How is that made more resilient?

    • @gkcs
      @gkcs  Pƙed 3 lety +1

      The DNS network is the backbone of the internet.
      Everything is prone to failure. Eventually, Earth is a single point of failure.
      We have to settle for a risk factor that's acceptable for our use case.

    • @shashanksoni9539
      @shashanksoni9539 Pƙed 2 lety

      @@gkcs After Facebook outage, I think we need to give a handful thought for DNS failure too.

    • @gkcs
      @gkcs  Pƙed 2 lety

      @@shashanksoni9539 Hahaha 😛
      Well, the DNS didn't fail. FB got itself unregistered on the DNS servers.

  • @amitjalan
    @amitjalan Pƙed 3 lety

    Hey Gaurav, I really enjoy these videos and have not just learned a lot, but you have revived my interest in academic explanations behind practical solutions which is not something i thought would have been possible after stressful university years. Can I ask one question (or potentially start a discussion?) Referencing primary/backup database servers as master/slave makes me cringe every time I hear it. It is pervasive in the industry, and that is the standard way to refer to it even in the most "woke" circles. What's the chance that we as an industry can stop using the master/slave terminology and use primary/backup or production/shadow language instead? Not just to be more accurate, but also to not demean the experience of enslaved people, which was a lot less glorified than what a backup database is responsible for.

    • @gkcs
      @gkcs  Pƙed 3 lety +1

      I would prefer to use the terms primary or backup instead of the terms of master and slave, because they describe the roles of these computers better. Words are tools that describe or communicate ideas. I want to use the best tools available.
      However, in my experience, I have heard the term master/slave more often than primary/backup. That's probably the reason why I end up using them: it's a subconscious thing.

    • @amitjalan
      @amitjalan Pƙed 3 lety

      @@gkcs oh I totally get that. It's a long process of unlearning these deeply ingrained things. Takes a lot of discipline. I very recently learned the origin of "whitelisting/blacklisting" words and it was quite eye opening.

  • @shubhamverma1407
    @shubhamverma1407 Pƙed 3 lety

    A stupid question:
    What if DNS server crashes? Isn't it not a single point of failure ?

    • @gkcs
      @gkcs  Pƙed 3 lety +1

      There are multiple DNS servers we can connect to.

  • @prithwishdasgupta4508
    @prithwishdasgupta4508 Pƙed 6 lety +2

    Waiting for tinder..... 😀

  • @alirezaasadi8656
    @alirezaasadi8656 Pƙed 5 měsĂ­ci

    Human is single point of failure😂😂