QCon London '23 - A New Era for Database Design with TigerBeetle

Sdílet
Vložit
  • čas přidán 21. 07. 2024
  • Pivotal moments in database durability, I/O, systems programming languages and testing techniques, and how they influenced our design decisions for TigerBeetle.
    This is the pre-recording of our talk, which was later given live at QCon London '23, in the Innovations in Data Engineering track hosted by Sid Anand... and a stone's throw from Westminster Abbey!
    The live QCon talks were in-person only this year, and were not recorded, but thankfully, we stuck to the script, so that what you see here is what you would have seen, if you had been there.
    The cover art is a special illustration by Joy Machs. We wanted to bring together the London skyline to showcase old and new design in the form of the historic London Bridge alongside the futuristic Shard. If you happen to be in London, take a walk across the Millennium Footbridge, and see if you can see Joy's vision as you look across the water.
    Thanks to Sid Anand for the special invitation. It was an honor to present TigerBeetle alongside DynamoDB, StarTree, Gunnar Morling, and our friends in the Animal Database Alliance: Redpanda and DuckDB!
    qconlondon.com/presentation/m...
  • Věda a technologie

Komentáře • 23

  • @themichaelw
    @themichaelw Před 3 měsíci +2

    18:00 that's the same Andres Freund who discovered the XZ backdoor. Neat.

  • @dannykopping
    @dannykopping Před rokem +7

    Super talk! Very information dense and clear, with a strong narrative.
    Also so great to hear a South African accent in highly technical content on the big stage 😊

  • @luqmansen
    @luqmansen Před 4 měsíci +1

    Great talk, thanks for the sharing! ❤

  • @youtux2
    @youtux2 Před 10 měsíci +1

    Absolutely amazing.

  • @asssheeesh2
    @asssheeesh2 Před rokem +4

    That was really great!

  • @LewisCampbellTech
    @LewisCampbellTech Před 8 měsíci +2

    Every month or so I'll watch this talk while cooking. The first time didn't really understand part I. This time around I got most of it. Crazy how the linux kernel prioritised users yanking USB sticks out over database durability.

    • @jorandirkgreef
      @jorandirkgreef Před 8 měsíci +1

      Thanks Lewis, special to hear this-and I hope that the “durability” of the flavors in your cooking are all the better! ;)

  • @dwylhq874
    @dwylhq874 Před měsícem +1

    This one of the few channels I have *notifications on* for. 🔔
    TigerBeetle is _sick_ !! Your whole team is _awesome_ !! 😍
    So stoked to _finally_ be using this in a real project! 🎉
    Keep up the great work. 🥷

    • @jorandirkgreef
      @jorandirkgreef Před měsícem

      Thank you so much! You're sicker still! :) And we're also so stoked to hear that!

  • @timibolu
    @timibolu Před 5 měsíci +1

    Amazing. Really amazing

  • @rabingaire
    @rabingaire Před 11 měsíci +2

    Wow what an amazing talk

  • @YuruCampSupermacy
    @YuruCampSupermacy Před rokem +2

    absolutely loved the talk.

  • @jonathanmarler5808
    @jonathanmarler5808 Před 11 měsíci +2

    Great talk. I'm at 15:20 and have to comment. Even if you crash and restart to handle fsync failure that still doesnt address the problem because anothe process could have called fsync and marked the pages as clean, meaning the database process would never see an fsync failure.

    • @jorandirkgreef
      @jorandirkgreef Před 11 měsíci

      Hey Jonathan, thanks! Agreed, for sure. I left that out to save time, and because it's nuanced (a few kernel patches ameliorate this). Ultimately, Direct I/O is the blanket fix for all of these issues with buffered I/O. Awesome to see you here and glad you enjoyed the talk! Milan '24?! :)

  • @uncleyour3994
    @uncleyour3994 Před rokem +2

    Really good stuff

  • @Peter-bg1ku
    @Peter-bg1ku Před měsícem

    I never thought Redis AOF were this simple.

  • @tenthlegionstudios1343
    @tenthlegionstudios1343 Před rokem +5

    Very good talk. It takes a lot of the previous deep dives I have watched and puts them all together. I am curious about the points made about the advantages to single threaded execution model, especially in the context of using the VOPR / having deterministic behaviors.
    When you look at something like Red Panda with a thread per core architecture, using seastar, and a bunch of advanced linux features - are design choices like this making it harder to test and have some sense of deterministic bug reproduction? This is not a tradeoff I have ever considered before, and for a DB that is most concerned about strict serializability and no data loss - this must have greatly changed the design. I am curious about the potential speed ups at the cost of losing the deterministic nature of tiger beetle - not to mention the cognitive load of a more complex code base.

    • @tigerbeetledb
      @tigerbeetledb  Před rokem +2

      Thanks-great to hear that! We are huge fans of Redpanda, and indeed RP and TB share a similar philosophy (direct async I/O, single binary, and of course, single thread per core). In fact, we did an interview on these things with Alex Gallego, CEO of Redpanda, last year: czcams.com/video/jC_803mW448/video.html
      With care, it's possible to design a system from the outset for concurrency, that can then run either single threaded or in parallel, or with varying degrees of parallelism determined by the operator at runtime, with the same deterministic result, even across the cluster as a whole. Dominik Tornow has a great post comparing concurrency with parallelism and determinism (the latter two are orthogonal, which is what makes this possible): dominik-tornow.medium.com/a-tale-of-two-spectrums-df6035f4f0e1
      For example, within TigerBeetle's own LSM-Forest storage engine, we are planning to have parts of the compaction process eventually run across threads, but with deterministic effects on the storage data file.
      For now, we're focusing on single core performance, to see how far we can push that before we introduce CPU thread pools (separated by ring buffers) for things like sorting or cryptography. The motivation for this is Frank McSherry's paper, “Scalability but at what Cost?”, which is a great read! www.frankmcsherry.org/graph/scalability/cost/2015/01/15/COST.html

    • @tenthlegionstudios1343
      @tenthlegionstudios1343 Před rokem +1

      ​@@tigerbeetledb These articles are gold. Thanks for the in depth reply! Cant wait to see where this all goes.

  • @pervognsen_bitwise
    @pervognsen_bitwise Před 10 měsíci +1

    Thanks for the talk, Joran.
    Genuine question since I don't know and it's very surprising to me: Is there really no way to get buffered write syscall backpressure on Linux? The Windows NT kernel has a notoriously outdated (and slow) IO subsystem which has always provided mandatory disk write backpressure by tracking the number of outstanding dirty pages. So if disk block writes cannot keep up with the rate of dirty pages, the dirty page counter will reach the cap and will start applying backpressure by blocking write syscalls (and also blocking page-faulting writes to file-mapped pages, though the two cases differ in the implementation details).
    I'm assuming Linux's choice to not have backpressure must be based on fundamental differences in design philosophy, closely related to the situation with memory overcommit? Certainly the NT design here hurts bursty write throughput in cases where you want to write an amount that is large enough that it exceeds the dirty page counter limit but not so large that you're worried about building up a long-term disk backlog (a manually invoked batch-mode program like a linker would fall in this category). Or you're worried about accumulating more than a desired amount of queueing-induced latency that would kill the throughput of fsync-dependent applications; considering this point makes me think that you wouldn't want to rely on any fixed dirty page backpressure policy anyway, since you want to control the max queuing-induced latency.

  • @stevesteve8098
    @stevesteve8098 Před měsícem

    Yes.... i remember back in the 90's oracle tried this system of "direct IO", Blew lots of trumpets..... and announced it HAS to be better & faster Because ....'insert reasoning here"
    Well you know what..... it was complete bullshit, becasue they made lots of assumptions and very little real testing.
    Because even you THINK you are writing directly to the "disk" YOU ARE NOT....
    you are Writing to a BLACK BOX., you have absolutely NO idea of HOW or WHAT is implemented in that Black box.
    There may be a thousand buffer levels in that box, with all sorts of swings and roundabouts.
    so... no.... you are NOT directly writing to disk, such a basic lack of insight and depth of thought is a worry with this sort of "data" Evangelicalism...

    • @jorandirkgreef
      @jorandirkgreef Před měsícem

      Thanks Steve, I think we're actually in agreement here. That's why TigerBeetle was designed with an explicit storage fault model, where we expect literally nothing of the "disk" (whether physical or virtualized). For example, we fully expect that I/O may be sent to the wrong sector or corrupted, and we test this to extreme lengths with the storage fault injection that we do.
      Again, we fully expect to be running in virtualized environments or across the network, or on firmware that doesn't fsync etc. and pretty much all of TigerBeetle was designed with this in mind.
      However, at the same time, to be clear, this talk is not so much about the "disk" as hardware-as about the kernel page cache as software, and what the kernel page cache does in response to I/O errors (whether from real disk or virtual disk).
      We're really trying to shine a spotlight on the terrific work coming out of UW-Madison in this regard: www.usenix.org/system/files/atc20-rebello.pdf
      To summarize their findings then, while Direct I/O is (completely) not sufficient, it is still necessary. It's just one of many little things you need to get right, if you have an explicit storage fault model, and if you want to preserve as much durability as you can.
      At least