Distributed Systems 2.4: Fault tolerance

Sdílet
Vložit
  • čas přidán 27. 07. 2024
  • Accompanying lecture notes: www.cl.cam.ac.uk/teaching/212...
    Full lecture series: • Distributed Systems le...
    This video is part of an 8-lecture series on distributed systems, given as part of the undergraduate computer science course at the University of Cambridge. It is preceded by an 8-lecture course on concurrent systems for which videos are not publicly available, but slides can be found on the course web page: www.cl.cam.ac.uk/teaching/212...

Komentáře • 12

  • @eyadkhayat
    @eyadkhayat Před 8 měsíci

    Watching this course while brushing up my system design skills. Very useful. Thank you

  • @bermick
    @bermick Před měsícem

    brilliant! thanks a lot for the content Martin!

  • @dmytrozaporizkiy3599
    @dmytrozaporizkiy3599 Před 2 lety

    Brilliant!

  • @ascyrax8507
    @ascyrax8507 Před 2 lety

    nice content. thanks a lot.

  • @mantistoboggan537
    @mantistoboggan537 Před 3 lety +5

    So wait, how does the eventual failure detection get implemented? Don't we still fundamentally have the same problem if we have asynchronous timings? How would I know that my node has failed, as opposed to just going through a huge garbage collection protocol, or thrashing, or anything else?

    • @AZAssazin
      @AZAssazin Před 3 lety +5

      I think the idea is that *eventually* may mean a very long time, e.g. if you don't get a response in a few weeks, the node crashed. Alternatively, you could probably enforce (maybe via an SLA) what a failed node will look like, especially if the service you're calling is another service your company owns. "If we don't respond within 1 minute, then even if we were just stalled due to garbage collection, we'll discard the message and consider the node faulty."

    • @yogeshedekar6078
      @yogeshedekar6078 Před 3 lety +4

      You can simply have a heartbeat signal sent to every node usually called as a liveness probe in cloud terminology. If the node does not reply to heart beat say 3 times consecutively you know that the node has failed and can trigger an automatic restart. If restart also does not fix the issue then you take that node out of rotation and put another node in place.

    • @allyourcode
      @allyourcode Před 2 lety +3

      I think the answer is in the title of the slide: a PARTIALLY SYNCHRONOUS model is being considered, not async.

    • @kleppmann
      @kleppmann  Před 2 lety +17

      That's exactly the point: if you don't get a reply from some node within some timeout, it might be that the node crashed, but it could also be that the node or the network is just temporarily being slow. And we can't definitively distinguish between crash and slowness. However, if slowness is only temporary, then eventually the node will start responding again if it's not crashed. The problem is that in an asynchronous or partially synchronous system, we don't know how long that might take.

  • @sarathkumarmutnuru1177

    at 6:51, how can any fault detector label a node as correct if it crashed actually? Since, fault detector labels correct only if it receives an acknowledgment of some sort, so there is no way a crashed node can acknowledge.
    Unless, the node has crashed in between the signal trigger intervals of the fault detector.

    • @khaldrogo9451
      @khaldrogo9451 Před 2 lety +1

      Well one example is to think of the time in between messages being passed. A sends a message to B, asking if B is still up. B responds by saying "yes, I'm good", and crashes right away. Now, A will get a message saying that B is up, but in reality B has actually crashed. So, until A goes around and asks B for its status again, it will never know and will have marked it as correct.

    • @GooseBerry390
      @GooseBerry390 Před rokem +1

      @@khaldrogo9451 Excellent response. Note that there is the timeout period itself as well, so even after A has asked B, it will wait for a particular length of time until it decides that a timeout has actually occurred.