Martin Kleppmann | Kafka Summit SF 2018 Keynote (Is Kafka a Database?)

Sdílet
Vložit
  • čas přidán 27. 07. 2024
  • Kafka Summit SF 2018 Keynote by Martin Kleppmann (Researcher, University of Cambridge). Martin Kleppmann is a distributed systems researcher at the University of Cambridge and author of the acclaimed O’Reilly book “Designing Data-Intensive Applications” (dataintensive.net/). Previously he was a software engineer and entrepreneur, co-founding and selling two startups, and working on large-scale data infrastructure at LinkedIn.
    LEARN MORE
    ► Learn about Apache Kafka on Confluent Developer: cnfl.io/confluent-developer-k...
    ► Use CLOUD100 to get $100 of free Confluent Cloud usage: cnfl.io/try-confluent-cloud-c...
    ► Promo code details: cnfl.io/promo-code-details-ka...
    ABOUT CONFLUENT
    Confluent, founded by the creators of Apache Kafka®, enables organizations to harness business value of live data. The Confluent Platform manages the barrage of stream data and makes it available throughout an organization. It provides various industries, from retail, logistics and manufacturing, to financial services and online social networking, a scalable, unified, real-time data pipeline that enables applications ranging from large volume data integration to big data analysis with Hadoop to real-time stream processing. To learn more, please visit confluent.io
    #kafkasummit #apachekafka #kafka #confluent
  • Věda a technologie

Komentáře • 48

  • @markodimic
    @markodimic Před 4 lety +65

    This video should be titled - The best ACID explanation on the whole f-in Internet

    • @homerg
      @homerg Před 4 lety

      LMAO

    • @traviscramer
      @traviscramer Před 4 lety

      was about to comment with something similar, but you said it perfectly

  • @_sudipidus_
    @_sudipidus_ Před 5 lety +31

    For those who liked this talk I strongly recommend reading his book called Designing Data Intensive Applications

    • @MrJoao6697
      @MrJoao6697 Před 4 lety +5

      This guy is a fucking guru, the book is amazing.

    • @_sudipidus_
      @_sudipidus_ Před 5 měsíci

      Here after 4 years of writing this comment to revisit this topic haha

  • @YoyoMoneyRing
    @YoyoMoneyRing Před 3 lety +8

    I was reading his dataintensive application book and could relate everything mentioned in Transactions chapter with this video. I would highly recommend reading transactions chapter and then watch this video

  • @patricknazar
    @patricknazar Před 5 lety +1

    I love it, thank you.

  • @balaramganesh2970
    @balaramganesh2970 Před 3 lety +1

    Have a question on how he explained atomicity is expressed in Kafka. When a event that needs to be consumed by a number of consumers is added to a partition, yes, it means that eventually each consumer will consume this event and get the job done. However, the issue that when one of these consumers go down, there is a time interval when all other consumers have consumed this event but this one hasn't still persists and is a violation of atomicity. Not saying this issue isn't there in RDBMS. In fact, all I am saying is that the same issue is there in RDBMS where until the crashed system/consumer is back up, we'll still be violating the all in/all out policy. Open to views from everyone!

  • @rpisal1976
    @rpisal1976 Před 5 lety +15

    Thanks for a very detailed explanation around this topic, Martin. When you started explaining Consistency, one question popped in my mind in relation to account debit/credit example. In a case where I have split debit and credit transactions into two separate stream processors they have in turn lost knowledge of the overall 'account transfer' action, isn't it? You mentioned consistency is a feature which needs to be enforced from an application and not really in the DB. Does it mean both these stream processors are returning some response (status of the transaction) back to the application? Consider the following scenario.
    1. Event was put on first (parent topic) which is processed by first streams processor and two events (one debit and one credit) get put into two separate topics.
    2. When the debit stream processor tries to debit the account 12345, if it will make the account balance go into negative, the transaction will fail.
    3. The credit stream processor will succeed in crediting account 54321 which kind of makes the whole transaction not succeed as it did not do what it was supposed to
    How will the above scenario be handled? Only if the application is notified of both of these and then initiates a compensating (reversing) event, isn't it?

    • @josemose2000
      @josemose2000 Před 5 lety

      Maybe there could be a process before debiting/checking that checks that the transactions will succeed first.

    • @anandsakthivel9425
      @anandsakthivel9425 Před 5 lety

      Hi Rahul,
      Great question. When I thought about this design with your question, I thank they will validate the balance of the transaction owner and then put the transfer event into the parent topic.

    • @MahmoudTantawy
      @MahmoudTantawy Před 3 lety +1

      Validating the balance of the transaction owner does not really ensure anything, because you might have events for debiting that were not applied yet, in an async system you can't have a reliable sync check/validation step, you have no idea how many async tasks/events have not been applied yet
      Example
      1. User A withdraws money
      2. Validation happens and there is money in balance
      3. System that applies the event of debiting is slow or down or whatever
      4. User A immediately afterwards transfers an amount of money to User B
      5. Validation happens and still sees there is money in balance (remember in an "eventually-consistent" system, there is no guarantee of "when")
      6. Transfer is OK'ed and proceeds
      Now you have a negative balance
      Also similar concern is this idea of different systems seeing different balance, yes they all converge to same balance but "when"!

    • @alvin3832
      @alvin3832 Před 2 lety

      @@MahmoudTantawy I think the validation process can be included in the same transaction and done synchronously before the money transfer.

  • @sshawnuff
    @sshawnuff Před 5 lety +4

    13:19: "In financial systems you only want to preserve money, only move it around..." Wow.

  • @adamosloizou538
    @adamosloizou538 Před 3 lety +8

    While I really enjoyed Martin's beautiful talk on Transactions (czcams.com/video/5ZjhNTM8XU8/video.html) I find this one at odds with it.
    Martin's take at Atomicity with Kafka is misleading to me.
    Perhaps Kafka's log could be used as an ingredient to manually, painfully build a distributed database (a very hard problem) but it sure doesn't give you enough out-of-the-box.
    From wikipedia's definition (en.wikipedia.org/wiki/Atomicity_(database_systems)) I'd like to bring to your attention this:
    "A guarantee of atomicity prevents updates to the database occurring only *partially*..."
    and
    "As a consequence, the transaction *cannot be observed to be in progress by another database client*. At one moment in time, it has not yet happened, and at the next it has already occurred in whole (or nothing happened if the transaction was cancelled in progress)."
    Now, in Martin's example of transferring an amount between 2 bank accounts.
    Important: In practice, the observable behaviour of the total distributed system will be at the end storage DB(s). Not at the Kafka level. This is a crucial detail left out. This is the point where the last stream processors will write those 2 account-based events, produced by the original event.
    At the point of splitting the original transaction to 2 isolated events we effectively lose the atomicity.
    Why?
    Because the 2 stream apps are not aware of each other and don't coordinate.
    So, in practice you'll be definitely getting partial state at the end DBs!
    Best case, a temporary one (i.e. the time it takes between the 2 apps to perform both their operations - eventual consistency)
    Worse case, when 1 of them goes down, we will stay with a partial result at the DB level (remember this is the observable side of our distributed system!) until someone has realised the app is down and brought it back up. No rollback, no auto-detection and recovery.
    Then 1 of the 2 accounts will have updated their balance at the DB and the other one won't.
    And, more crucially, the whole system will have no idea that that is the case.
    So, in a system that money can stay in limbo, or created out of thin air, I'd say we still have a long way to go.
    Also, let's bear in mind that Atomicity is not independent of the rest of the ACID principles (see Orthogonality on the Wikipedia article). So Atomicity without Consistency is not atomicity.
    I hope this helps bring things down to earth a little.

  • @Oswee
    @Oswee Před 5 lety +2

    Berglund + Kleppmann = Power! :) Love their stuff.

  • @hl5768
    @hl5768 Před 2 lety

    the traditional 2PC is CA from a CAP point of view, but the Kafka based implementation is a kind of eventual consistency

  • @olegyablokov4334
    @olegyablokov4334 Před 4 lety +2

    So you make your system serializable by using kafka as one-point-of-entry for writes. Actually, you can do the same by using single leader replication, and you don't necessary need kafka or other stream processing software to achieve that. I understand that it is not the case with several systems (db, cache, index), you already have to somehow synchronize them. But with the money transfer example using just one database can be enough. So why to use kafka in the money transfer example?
    P.S. And of course thank you for a great talk! :-)

  • @user-qx5dy4nz3n
    @user-qx5dy4nz3n Před 2 lety

    great!thanks

  • @TyzFix
    @TyzFix Před 2 lety

    While realizing the consistency using Kafka, do we need to estimate the latency introduced by the system?

  • @lgylym
    @lgylym Před 5 lety +4

    Seems like many of the properties are achieved by turning synchronous processing to asynchronous

  • @adamdude
    @adamdude Před 2 lety

    For the username claims example, how can you scale this to multiple partitions if there needs to be some central store of if a username was taken? There's a process that checks if the claim can be accepted which must be checking some database to see if it already exists and this database must have all of the usernames right? How is that scalable then?

  • @abhiitechie
    @abhiitechie Před 5 lety +3

    Hey Martin, what an amazing talk. I had one question though about eventual consistency model for this event driven architecture where the end users, maybe for a period of time, would see different versions of a data at the different data-stores i.e the Database, the cache and the search index, how do we solve this problem

    • @karthikrangaraju9421
      @karthikrangaraju9421 Před 5 lety +1

      abhishek gupta You’d need distributed transactions across different DBs.
      But that’s not possible/available in all DBs.
      It’s still a researchy area.

    • @aljjxw
      @aljjxw Před 5 lety

      In some other video that I can't remember which, someone was saying this is unavoidable and when everything works correctly it will not happen. Also, they suggested to use always the same source for showing data to the end user. Say, always use the cache ... I know it is not a satisfying answer.

    • @sachintiwari1579
      @sachintiwari1579 Před 5 lety +1

      Hi it seems confluent guys have picked your question and tried to answer. Not sure if you are aware of it thus sharing it here.
      m.czcams.com/video/CeDivZQvdcs/video.html
      IMO they have not understood your question and thus their answer is not going to address your concern.
      Please check it yourself and share your thoughts.

  • @pankajsinghv
    @pankajsinghv Před 5 lety +4

    An amazing talk... Very informative... I'm thinking of making use of it in my projects...

  • @slimbouguerra7111
    @slimbouguerra7111 Před 5 lety +6

    what a great idea (minute 12:00) to get atomicity let's serialize all the writes using a Kafka Queue instead of 2 phase commit because 2PC can be slow!

    • @songs4enjoy
      @songs4enjoy Před 5 lety +4

      For developers who are dumb like me, this comment is a sarcasm. Dont take it literally :D

  • @slimbouguerra7111
    @slimbouguerra7111 Před 5 lety +8

    another great design min 17:00, instead of using a traditional Transactional updates, let's use events and make each update (here 2 updates) idempotent, well the question, if one of your stream processor fails or is slow do you consider the state of your DBs correct ? is this really atomicity ?

    • @songs4enjoy
      @songs4enjoy Před 5 lety +5

      His diagram says RDMS can consumer in parallel with cache & messaging. I assume this must be a mistake as RDBMS the consistency has to be guaranteed by some snapshot aware store (RDBMS) before cache/search index can be updated
      Again, this model has its uses & the graping hole of additional complexity to achieve this operation. But, this whole model should only be looked at when RDBMS is not able to meet the throughput & latency requierements of the business. Else, just stick with RDMBS, which is quite a safe place

  • @rydmerlin
    @rydmerlin Před rokem

    How can you say that something that isn’t performed in the same time quantum as something else is automic? If it does it later on then that isn’t a guarantee it’s automic is it? A time period has to apply for atomicity doesn’t it?

  • @hydrogeneration7353
    @hydrogeneration7353 Před 5 lety +1

    broaden database architecture perspective

  • @_SoundByte_
    @_SoundByte_ Před 3 lety

    Totally ordered log is not possible with multiple partitions right ?
    It’s not possible to have no of partitions equal to the number of customers ?
    Or banks really have lakhs of partitions ?

    • @john-taylorsmith3100
      @john-taylorsmith3100 Před 3 lety +2

      You can have any number of different keys in a partition and the guarantee is that each key in the partition will be in order.

  • @super-ulitka
    @super-ulitka Před 2 lety

    Don't like it for a free form ACID interpretation. Consistency also means that read after write returns the same value, this is not guaranteed with Kafka example, as there is a time gap between. Same for Atomicity. What if Cache got updated first, so the user will make an order looking at the new value, while DB has not yet received it and the app will charge from previous. Would love to see how you would like to use your Twitter feed with an eventual consistency ;) Kafka is a great software but this keynote is misleading.

  • @exocom2
    @exocom2 Před 2 lety

    Why is it easier to have 3 topics and 3 consumers that eventually write to a DB, which additionally have to check whether a transaction id has already been processed - compared to two pretty standard updates in the already existing DB? Maybe someone can enlighten me here, I would be very much interested in figuring this out... (I am talking about the -100/+100 debit example)

    • @micahhutchens7942
      @micahhutchens7942 Před 2 lety

      Because you no longer need to support transactions and there is no need for row locks etc. It increases throughput, and is not a one size fits all solution. It's mostly for scalability purposes. Imagine if you had a hot row in a database and it is constantly locking that row (also if the transaction spans multiple rows then it needs to hold locks on both those rows and this only gets worse if the rows are on different partitions). That is a lot of overhead that can be avoided by just making it asynchronous using a single threaded stream processor. Not to mention the other 2 consumers are able to continue working normally even if the third goes down.
      In the debit example, it is much more efficient to just serialize the financial transaction via a single threaded application than it is to support distributed transactions over a whole database. Also it's pretty trivial to just reject a message with an id that is lower than the current id (offsets in Kafka) in order to support deduplication.

  • @stoneshou
    @stoneshou Před 4 lety

    Does Kafka based database exist yet?

    • @adnxn
      @adnxn Před 4 lety

      kinda, www.confluent.io/blog/intro-to-ksqldb-sql-database-streaming/

  • @johnboy14
    @johnboy14 Před 3 lety

    It can be used to build a database but it is not a database. Thats alot of code we all need to write.

  • @ArashBizhanzadeh
    @ArashBizhanzadeh Před 2 lety

    The second example of acidity didn't make any sense. Debit and credit should be simultaneous, otherwise people can double spend

  • @arariks
    @arariks Před 4 lety +2

    Worked on Relational Databases for 10 years, Shaking my head in disbelief at your Serialisation example. You know you are ultimately relying on the database for checking uniqueness right ? ..... The account transfer would take an unacceptable amount of work to make it happen and the Jane example is downright misleading.
    Kafka is good tech. Use it where it is needed. Click-Streams, user action tracking and logging, distributing event information across network, not to replace RDBMS processes.

  • @karthikrangaraju9421
    @karthikrangaraju9421 Před 5 lety

    If those two events for different users creating username end up in different partitions, different threads will consume concurrently
    And we are back to square one.
    If we only had one partition then it will work because there will only be one consumer writing to the DB.
    Which is the same idea of serializable isolation. It literally forces writes to happen in a single thread.

    • @aljjxw
      @aljjxw Před 5 lety +8

      But what Martin is saying is to use the username as the partitions key.

  • @davidbenyaminovch6761
    @davidbenyaminovch6761 Před 3 lety +1

    What a bunch of marketing bullshit. It should have been titled "Is Message Bus an Application Framework?", or how fast can the notion of "smart endpoints / dumb pipes" go down the drain.