How to avoid a single point of failure in distributed systems ✅

Gaurav Sen

zhlédnutí 131 160

Přidat do
- Můj playlist
- Přehrát později
Sdílet

Sdílet

Vložit

Velikost videa:

Zobrazit ovladače přehrávání

Automatické přehrávání

Přehrát

čas přidán 9. 07. 2024
A single point of failure(SPOF) in computing is a critical point in the system whose failure can take down the entire system. A lot of resources and time is spent on removing single points of failure in an architecture/design.
Single points of failure often pop up when setting up coordinators and proxies. These services help distribute load and discover services as they come and leave the system. Because of the critical centralized tasks of these services, they are more prone to being SPOFs.
One way to mitigate the problem is to use multiple instances of every component in the service. The graph of dependencies then becomes more flexible, allowing the system to resiliently switch to another service instead of failing requests.
Another approach is to have backups which allow a quick switch over on failure. The backups are useful in components dealing with data, like databases.
Allocating more resources, distributing the system and replication are some ways of mitigating the problem of SPOF. Hence designs include horizontal scaling capabilities and partitioning.
It is important to note that the CAP theorem does not allow removing SPOFs if perfect consistency is required.
Looking to ace your next interview? Try this System Design video course! 🔥
get.interviewready.io
Course chapters:
1) Design an email service like Gmail
2) Design a rate limiter
3) Design an audio search engine
4) Design a calling app like WhatsApp
5) Design and code a payment tracking app like Splitwise
6) Machine coding a cache
7) Low level design of an event bus
The chapters have architectural diagrams and capacity estimates, along with subtitled videos. Use the coupon code of 'earlybird' to get a 20% discount.
References:
docs.oracle.com/cd/E19693-01/...
stackoverflow.com/questions/7...
ieeexplore.ieee.org/document/...
www.spkaa.com/blog/how-to-deal...
en.wikipedia.org/wiki/Single_...

Komentáře • 124

@semihkekul Před 5 lety ⁺⁷⁶
Elon Musk had already seen the Earth as a single point of failure and has been trying to create a slave on Mars. In that case, maybe Moon will a a Load Balancer.
@gkcs Před 5 lety ⁺⁸
Lol. Maybe that will be true one day!
@Shankusu993 Před 2 lety
parents send a request for a child to the moon and it redirects it to either mars or earth and there a baby pops up and is returned back to the parents who are strolling through the space in a spaceship. Sounds good XD
@DesiTennis Před 2 lety
😂
@WittyGeek Před 6 lety ⁺⁵
The Netflix example is good one. I saw their PyCon 2018 talk and they showed how they do regional failovers in under 7 minutes. It was a good talk.
@Nithin_Coorg Před 2 lety
So crisp as always!
@fchas15 Před 4 lety ⁺⁵
your positive energy makes me feel good! i feel like even i can get through an interview after watching you! excellent!
@gkcs Před 4 lety
Fantastic!
@manojmj5479 Před 6 lety ⁺²
Very informative! Please keep making more videos!
@sumitlahiri209 Před 6 lety ⁺¹
Informative video. especially the meteorite scene. Awesome!!
@anastasianaumko923 Před rokem
Thank you, very clear!
@songs4enjoy Před 5 lety ⁺²
Few more observations
All your examples use reverse proxies to achieve HA (except browser, tho not explicitly mentioned), but there are other techniques
1. Client side load balancing: Using service registry (a bit smart DNS) & smart clients to achieve HA. In your example, browser can be considered a smart client, but we have a lot more on the server-server communication to achieve HA in request-response flows
2. Also, your LB in a single zone is usually kept in a HA configuration using something like keepalived & a floating ip
@kumarakantirava429 Před 4 lety ⁺¹
Prof. Gauran Sen,
learning a lot Sir. Thank you.
@abhikeshu Před 5 lety
Hey Gaurav, A big thanks for your efforts you are putting, I really learnt many things which was myths till now, you have have explained concepts in very simple terms, keep the good work.
@karandutt4534 Před 3 lety ⁺¹
Hello Gaurav,
Thanks a million for sharing your knowledge and helping us.
Keeping your examples / explanation as simple as possible it could be makes you stand out.
Plz do add such small topics which is definitely useful otherwise it gets unnoticed in a larger video/topic.
@thisisfunc4529 Před 6 lety
Thanks, your video is awesome
@prashantsrivastava9550 Před 3 lety
Very well expalined and great positive energy with gr8 smile :)...thnx buddy
@geekengr Před 5 lety
Nice work!
@suraj-gd9qy Před 2 lety
Nice bro, I had to use this in recent development and I understood the concept... Thanks:)
@tusharverma03 Před 2 lety
Thanks you Gaurav, we need to know these tiny information about each part.
Your vides are amazing and keep making videos on large systems and whenever you come up with some sub topic you can link those topics link in description using which one can master that topic before moving ahead.
Thanks a lot
@sachinakinapally5061 Před 4 lety ⁺¹
Man, you literally helped me finish my assignment! Learned a lot. Great content. Thanks!
@sankalparora9374 Před rokem
Thanks for the video!
@gkcs Před rokem
Cheers!
@influencer737 Před 5 lety
Bro, all the very best in your new role at Uber, wishing you all success
@sheshitkarthikeya1528 Před 5 lety
Awesome!!
@gkcs Před 5 lety
Thank you!
@PankajKumarSingla Před 5 lety
Thanks Gaurav for this video Please add more videos for server fail and how to divide according to region or master slave
@lien3723 Před 5 lety ⁺¹
Could anyone please explain what profile server? Or profiling in system design mean? Are these 2 different things? I heard this term in multiple contexts but don't quite understand what they mean. Thank you!
@raj_kundalia Před rokem
thanks
@harisridhar1668 Před 3 lety ⁺²
3:35 Gaurav - we don't have to worry about the Domain Name System (DNS) being an articulation point / SPOF ( Single Point of Failure ) since DNS is already a decentralized distributed system, correct? In a sense, we are already taking advantage of an existing scalable and resilient network architecture, correct?
@vijaykidecha7491 Před 5 lety ⁺¹
Hi Gaurav, Thanks for the video. All the videos on system design are really informative. Can you please make videos on esb and messaging queues like you made on load balancing.
@gkcs Před 5 lety ⁺¹
I have one on messaging queues in the playlist. I'll checkout what ESB is 🙂
@kinjalthehero Před 4 lety
What is a profile server? Thank you for the informative videos.
@saitejajonnadula Před 3 lety
Chaos engineering is applied on application/node before going to production phase. Triggering controlled attacks and having ability to role back the attack to maintain original stable position.
@gkcs Před 3 lety
Netflix does this really well.
@mohanreddy4669 Před 5 lety
Hey Gaurav,
Can you please do a video on designing e-commerce like Amazon/Walmart?
@alirezamosavi6185 Před 10 měsíci
Hi i didnot get why we do need a kind of DNS above load balancers? Why the only load balancers are not needed for distributing the traffic to all nodes ? Please explain this ...
@sugyansahu9120 Před 6 lety ⁺¹
well, this was informational. ☺️ waiting for your tinder system. 😎
@biswajeetsethi7689 Před rokem
Hi, is clock synchronization is must in every distributed system ? Can i just call it a distributed system where nodes are working on different data set at different locations independently in order to make some business decision at the end.
@venkatreddy6851 Před 6 lety ⁺³²
Hi Gaurav Thanks for the Videos Really Enjoying and Learning a lot from them. I have a question as you mentioned in the video when a load balancer fails we will overcome this problem by placing multiple load balancers and we keep all the IP's of the Load balancers in the DNS . But how the DNS knows whether the first Load balancer is working fine or not as DNS is simply just a name to address resolver and once it is done it will come into picture. and where do we write our logic saying that if loadbalancer1 fails contact loadbalancer2 or something like this.
@gkcs Před 6 lety ⁺²⁰
Hey Venkat, DNS are pretty smart these days. They send requests to an IP Address and can redirect to the next IP in their list if they get the appropriate error code.
503 means service unavailable, and seems like a good error to set.
@venkatreddy6851 Před 6 lety ⁺¹
Thanks Can u Please make on UBER
@kartkat Před 5 lety ⁺¹
@@venkatreddy6851 Your computer can do RR or the DNS result that you got might have done RR and sent you the response. More info can be found here: blogs.technet.microsoft.com/networking/2009/04/17/dns-round-robin-and-destination-ip-address-selection/. Also check the RFC which details about LB on DNS: www.faqs.org/rfcs/rfc1794.html
@trushapatel9012 Před 4 lety ⁺¹
@@gkcs You can also consider time out as if no ACK from packet sent across network for N time, it will automatically send the same packet to another IP address. That's called 3 way Hand Shacking.
That's where Routing comes into play and more structured Networking Architecture will be built. You can cover some of that in your System Design videos.
@lakshminarayanansairam2739 Před 4 lety
+1
@ankitakashyap4289 Před 5 lety
Hi Gaurav. When you say multiple Databases for data replication in master-slave pattern, do the multiple database fall under the concept of sharding?. Also what is meant by cross-data replication and cross region. Is it simply that replication happens over multiple regions?
@kumarakantirava429 Před 4 lety ⁺¹
Sir,
It's wonderful of you to PIN that Question with your insightful answer..... I was struggling to understand how clusters can offer HIgh Availability for web sites.. Your DNS answer Enlightened me on lot of design aspects.....Directly prostrating to your feet.
@nikhilsingh2233 Před 5 lety
Hi Gaurav!
Great Videos. Shows Your Passion about explaining these topics. Could you please tell me how the changes in the database of a particular node get mirrored to another node?
@gkcs Před 5 lety ⁺¹
Thank you!
I'll be getting to this topic soon 😁
@ravindrababu4759 Před 4 lety ⁺²
Re-phrase "More nodes" to "redundant nodes" to address Single Point of failures
@GeorgeChi1 Před 3 lety
p*p is usually not the case when you have hot data issue that just migrates to the backup/replica and hoses that down as well due to a system that was not redesigned on time for the scale it now has to support. In that case the particular technology's ability to handle load becomes a single point of failure.
@gkcs Před 3 lety
Yes, this is a simplification. Communication lines and response times tend to suffer as the number of nodes increase.
@abhishekbansal3425 Před 5 lety
Hi Gaurav, here in your explanation of DNS to multiple IP Address mapping, how will DNS choose the IP Address, may be it will send all request to 1 IP and overload it?
@ishasingh6726 Před 4 měsíci
i found thissssssss vid bcoz of striverrrr a gem knows another gemmmm
@adilsaju Před 4 lety
Thank you Gaurav, I just realized that Moriarty in Sherlock Holmes was actually that Chernobyl officer 😂
@gkcs Před 4 lety
Hahaha!
@jiamingxing6333 Před 4 lety
I am new to some technology, anyone can explain to me what is back up services?
@anuraggharat5453 Před 6 měsíci
So basically if you think something will fail, add a replica of it to handle failure. Then add a balancer to figure out which one to use from the original and copy. Then the balancer can fail too, so add another balancer to support the original balancer😭
@lakshminarayanansairam2739 Před 4 lety
which book u read a lot.. you have good potential in explaining... even i know few, but presenting those in front of ppl and camera is hard. minus 90%.will be my outcome. like ur confidence.
@gkcs Před 4 lety
Designing Data Intensive Applications :)
@adityamudaliar1145 Před 4 lety
Isn't finding an inconsistency is tougher? In these systems or some bug occurs in between what caused it?
Which module caused it?
Just asking because the finanical exchange systems which I worked on were more recovery oriented and were based on single point of failure at least where it was majorly time critical
Inconsistencies of data weren't tolerable at all.
@shivamvishwakarma1475 Před 3 lety ⁺¹
Is DNS failure in this a Single point of failure?
@sonalinaresh7013 Před 6 lety
Keep the videos going
@shivangidhakad9807 Před 5 lety ⁺²
Its funny how the 'single point of failure' aka our iron man is wiped out now ! Disasters happen ..
@gkcs Před 5 lety ⁺¹
Hahaha, I loved Avengers too 😁
@adheipsingh1797 Před 5 lety
what about edge topologies , where basically we have an api gateway to handle norh-south traffic and spreading the it east-west using service meshes
@gkcs Před 5 lety ⁺¹
I'll look into it, thanks! 🙂
@adheipsingh1797 Před 5 lety
@@gkcs :)
@bsummer Před 2 lety
Aws load Balancers brought me here
@Joe99 Před 5 lety
Does spof impact serverless platforms where vm's are dynamically provisioned for you? Or does it become a non issue as a result?
@gkcs Před 5 lety
It's never a non issue, only an insulated one. Eventually we have to rely on ourselves to ensure resiliency :)
@GerardBeaubrun Před 2 lety
I'm lookingng forward to the tinder architecture
@thongtran1653 Před 5 lety
hi Gaurav. Can you make a video of showing the master-slave architecture ?
@gkcs Před 5 lety
I have. Check out the data replication video. czcams.com/video/GeGxgmPTe4c/video.html
@rusrushal13 Před 6 lety ⁺¹
Hey Gaurav, in the description you said that "the CAP theorem does not allow removing SPOFs if perfect consistency is required". What does that mean, what do mean by perfect consistency? Can't we acheive consistency if we are distributing and replicating our system?
@gkcs Před 6 lety
We can't if we want the data to be consistent. The distributed system is either available or consistent in case of a network partition.
At best we can have eventual consistency or a really low chance of inconsistency.
@rusrushal13 Před 6 lety
isn't the master-slave concept(leader election thing which services like zookeeper does) makes our distributed system consistent?
CAP theorem suggests that out of three things we can achieve max to max two things! so in that sense, if we are making using a distributed system, consistency and availability could be achieved right?
@rusrushal13 Před 6 lety
I just visited en.wikipedia.org/wiki/CAP_theorem to get a better understanding of CAP theorem. I get the point what you are trying to make but theoretically, the things I am saying also makes sense(at least to me), maybe practically it doesn't happen. Can you give more explanation on why CAP theorem is right?
PS: And yeah thanks for explanation and videos, I don't know whether I thanked you before or not so I don't want to miss my chance at least here!
@gkcs Před 6 lety
The CAP theorem proof is quite involved. Let's take it at face value.
Even with a master slave, the request might be lost in transit from master to slave, which then has inconsistent information.
@rusrushal13 Před 6 lety
but as a whole, our system is consistent as our databases are consistent with our incoming data even if our slave is inconsistent with the master database.
So does this means that CAP theorem independently applies on every component of our system?
@AmanNidhi Před 6 lety
can you also explain the tech stack of codechef
@gkcs Před 6 lety
I work for Flock, a team messaging app 😊
@Akashdeepkashyap Před 2 lety
Gaurav: If earth is destroyed humanity will end.
Musk : I accept the challange.
@sikorpro Před 5 lety
If one load balancer (gateway) is down. How technically user get redirected to the working one? Is the user sent back automatically to DNS server? Or maybe user grabs all the links available for google.com in DNS server? Please, explain.
@kartkat Před 5 lety
Its actually upto the company who's LB is down to stop advertising that IP to the internet. In that case they should stop the DNS entry for the faulty one and just send one DNS entry(if only 2 LB are present).There are other ways to recover from sending faulty traffic to the LB and one of the ways are by CDNs. Almost all large scale companies use CDNs to distributed their traffic. If you are using a CDN, you can force your traffic to only go to the good LB IP. You as an individual can force the traffic to a good IP from the DNS result which you have received and add it in your hosts file.
@sayandey1478 Před 5 lety
Could we do something like this: Important updates- Global cache Not so important updates- in-memory cache?
@gkcs Před 5 lety
Yes you could :)
@jayeshudhani99 Před 3 lety
What if DNS fails? Isn't DNS a single point of failure? In that case, having a spare backup copy of DNS would work right?
@gkcs Před 3 lety ⁺¹
DNS clusters are the backbone of the internet. It's unlikely for them to fail in their entirety.
@jayeshudhani99 Před 3 lety
@@gkcs Thanks man. Great explanation.
@UmangSardesai Před 5 lety ⁺⁴
So Elon Musk is making humanity 'Fault Tolerant' ;)
@gkcs Před 5 lety
In a way, yes 🙂
However, if Earth explodes, can Mars still be inhabitable?
@UmangSardesai Před 5 lety ⁺¹
Well, let's hope Mars is habitable (or hope we're able to make it habitable). Only time will tell. 😬
@kirankjoseph Před 2 lety
Guess facebook had a single point of failure couple of days back!
@chiragbansal2891 Před rokem
Yout analogy is not that good but I get it! :D
@mannysingh6618 Před 5 lety
What happens if a transaction is processing on a node and that node fails?
You may have completed part of the transaction.
@gkcs Před 5 lety
A transaction should be rolled back in that case.
If the node fails, the request fails, and the client can retry.
If the transaction isn't reversible...well...then it isn't a "transaction" is it? :)
@mannysingh6618 Před 5 lety ⁺¹
@@gkcs
Okay, then what would you do with a sequence of operations?
I only ask because this is a "gotcha" situation?
@gkcs Před 5 lety ⁺¹
That's a problem, but a sequence of operations cannot be treated as a transaction anyway. Either we reverse them, or retry from the point of failure.
Most systems can't reverse a sequence of operations. One of the operations might contact an external system or send the user an email. To reverse them, they perform some compensation operations.
Have a look at Sagas.
microservices.io/patterns/data/saga.html
Cheers!
@mannysingh6618 Před 5 lety
Yep, tricky stuff here. Thanks for the pointer to the pattern!
How long you been coding?
@gkcs Před 5 lety
Not too long, I have been working for 4 years now. What about you?
@jimitshah7636 Před měsícem
Is DNS a single point of failure?
@gkcs Před měsícem ⁺¹
DNS is a distributed service.
@davidlee2117 Před 4 lety
I was wondering is the DNS server now a single point of failure? 🤔
@gkcs Před 4 lety
There are multiple DNS servers spread across the globe, to which nearby users connect to.
@taneja_unchained Před 3 lety
So DNS is a single point of failure here? How is that made more resilient?
@gkcs Před 3 lety ⁺¹
The DNS network is the backbone of the internet.
Everything is prone to failure. Eventually, Earth is a single point of failure.
We have to settle for a risk factor that's acceptable for our use case.
@shashanksoni9539 Před 2 lety
@@gkcs After Facebook outage, I think we need to give a handful thought for DNS failure too.
@gkcs Před 2 lety
@@shashanksoni9539 Hahaha 😛
Well, the DNS didn't fail. FB got itself unregistered on the DNS servers.
@amitjalan Před 3 lety
Hey Gaurav, I really enjoy these videos and have not just learned a lot, but you have revived my interest in academic explanations behind practical solutions which is not something i thought would have been possible after stressful university years. Can I ask one question (or potentially start a discussion?) Referencing primary/backup database servers as master/slave makes me cringe every time I hear it. It is pervasive in the industry, and that is the standard way to refer to it even in the most "woke" circles. What's the chance that we as an industry can stop using the master/slave terminology and use primary/backup or production/shadow language instead? Not just to be more accurate, but also to not demean the experience of enslaved people, which was a lot less glorified than what a backup database is responsible for.
@gkcs Před 3 lety ⁺¹
I would prefer to use the terms primary or backup instead of the terms of master and slave, because they describe the roles of these computers better. Words are tools that describe or communicate ideas. I want to use the best tools available.
However, in my experience, I have heard the term master/slave more often than primary/backup. That's probably the reason why I end up using them: it's a subconscious thing.
@amitjalan Před 3 lety
@@gkcs oh I totally get that. It's a long process of unlearning these deeply ingrained things. Takes a lot of discipline. I very recently learned the origin of "whitelisting/blacklisting" words and it was quite eye opening.
@shubhamverma1407 Před 3 lety
A stupid question:
What if DNS server crashes? Isn't it not a single point of failure ?
@gkcs Před 3 lety ⁺¹
There are multiple DNS servers we can connect to.
@prithwishdasgupta4508 Před 6 lety ⁺²
Waiting for tinder..... 😀
@alirezaasadi8656 Před 5 měsíci
Human is single point of failure😂😂

Další v pořadí

Automatické přehrávání