Write-Ahead Logging Challenges and Append only B-Trees

Sdílet
Vložit
  • čas přidán 23. 07. 2024
  • In this video I talk about the challenges with Write-Ahead Logging and how Append only B-Trees can be used as an alternative.
    00:00 Write-Ahead Logging Challenges and Append only B-Trees
    01:02 WAL - A quick recap
    02:43 Challenges with Write-Ahead Logging
    03:01 Crash Recovery Time
    06:15 Optimise Storage Space
    08:09 WAL Complexity
    10:00 Append Only B-Trees (No WAL)
    15:22 Optimising Storage space for Append only B-Trees
    Follow me on Linkedin and Twitter: / kaivalyaapte and / thegeeknarrator
    If you like this episode, please hit the like button and share it with your network.
    Also please subscribe if you haven't yet.
    Database internals series: • Write-ahead-logging
    Popular playlists:
    Realtime streaming systems: • Realtime Streaming Sys...
    Software Engineering: • Software Engineering
    Distributed systems and databases: • Distributed Systems an...
    Modern databases: • Modern Databases
    Stay Curios! Keep Learning!
    Cheers,
    The GeekNarrator

Komentáře • 22

  • @ajinkyajadhav5058
    @ajinkyajadhav5058 Před 5 měsíci +1

    You explained this very simply thank you and keep it up!!

  • @AtharvaRao0104
    @AtharvaRao0104 Před 8 měsíci +2

    I think including the in-memory page cache in the diagram of checkpoint process will help. This checkpoint process controls both WAL and page cache to ensure that all changes are persisted.
    1. When DB crashes, the in-memory page cache is lost. All in-memory updates which have not yet been flushed to Data files on disk are lost. To make sure we can recover these changes , we write the operations (read/write) to WAL append only log file on disk while doing the in-memory page cache updates..
    2. We can replay the WAL contents to restore the in-memory page cache and hence recover the state after crash.
    3. We should not be replaying all the operations in WAL as it can take a long time when DB restarts.
    When the page cache is flushed to disk, it makes the changes durable. Now we can discard the WAL log records associated with the operations that were applied to those cached pages which were flushed.

  • @vinothraoful
    @vinothraoful Před 9 měsíci +3

    Thanks for this awesome series. it would be more helpful if we can get an example like on what scenarios we use what kind of DBs? When to choose WAL based db? when to choose Append only B-Tree? Though we understand the internals I feel use cases give more clarity.

    • @TheGeekNarrator
      @TheGeekNarrator  Před 9 měsíci +3

      Yes - it is in the plan. Upcoming in this series:
      - B-Trees - B*, B-link, B-w and so on.
      - LSM trees
      - Improving LSM
      - Columnar vs Row orientation
      - Isolation levels
      - Locking and MVCC
      - Indexing
      - Consensus models
      And much more.

  • @nareshgb1
    @nareshgb1 Před 7 měsíci +1

    WOW.....this is an absolutely brilliant explanation for the append only B-tree. I was starting to get into couchbase (which I believe uses this) and my first search on youtube showed this video. Though I had read about append only b-tree before, this time I think I am not going to forget :) Awesome, keep it up Sir.

  • @varunvats32
    @varunvats32 Před 8 měsíci +1

    Correct me if I'm wrong,
    1. When a update comes, the relevant pages would be loaded up in memory.
    2. Then B-Tree processing => copy of pages, creating new links, or a new leaf node would all be done in memory only.
    3. And then the fsync happens, which will be sequential I/O on the disk.

    • @TheGeekNarrator
      @TheGeekNarrator  Před 8 měsíci

      Yes, that sounds right except that the copy is started from the leaf node which is the actual data. The goal is to keep as many hot pages as possible in memory, but implementation varies in different databases.
      IIRC CouchDB, MongoDB also uses copy-on-write with Garbage collection (but please verify). But there are some databases that do not use any cache, for example LMDB doesn’t used any cache but instead uses memory mapping which uses OS cache only. I will talk about memory mapping and how it is used in databases in a separate video.

  • @rajeshkumar1149
    @rajeshkumar1149 Před 9 měsíci

    Great video. Is the alternative to WAL described in th video a LSM tree used by rocksdb? If not how is it different from LSM?

    • @TheGeekNarrator
      @TheGeekNarrator  Před 9 měsíci +2

      No, its not an LSM Tree. Its an Append only B-Tree. LSM (as the name suggests) is a log based structure, but what I described in the video is done on a B-Tree. In LSM (which I am going to cover in depth in this series), you simply append in the log, but in the append only B-Tree, you start from the leaf node and you copy the exact structure of the B-Tree that exists on disk. So they are very different. The only thing common between them is append-only property.
      This is copy-on-write mechanism on B-Trees, where the entire idea is to avoid random writes and have sequential buffered IO which is far more efficient. I hope it helps.

  • @ksramchandani
    @ksramchandani Před 3 měsíci +1

    What is the corresponding in-memory update for the log that is written?
    Log is also written to disk or is it in memory and then flushed to disk?
    What happens when log flushing to disk fails?

    • @TheGeekNarrator
      @TheGeekNarrator  Před 3 měsíci +1

      A Log is flushed to disk, yes. Databases like Postgres also support fsync mode which waits for the log be flushed to the disk, which adds a little performance impact (but nothing comparable to updating data pages and indexes directly). Not using fsync mode OR in simple terms, not waiting for the OS to flush the log to the disk may result in lost committed transactions.
      Typically a database dealing with concurrent transactions can write log entries to disk with a single fsync which is very efficient. So choose your tradeoff, but the thumb rule I use is, enable fsync by default to ensure maximum reliability and tweak if performance ever become a problem. Does that answer your question?

    • @ksramchandani
      @ksramchandani Před 3 měsíci

      @@TheGeekNarrator Yes. Thank you! Wasn't aware of the DB option to wait for log flushing to succeed.

  • @balajisrinivasan6618
    @balajisrinivasan6618 Před 8 měsíci +1

    Does append-only B-Tree follow sequential IO on disk? In the multi version B-tree mentioned, looks like we are updating a page that might in turn get into a random IO on the disk block right?
    My understanding is : Database deals with pages, OS deals with files and Disk deals with blocks. Our goal is to have a sequential IO on disk (the end layer). WAL achieves it through immutability until the disk level. There is no updates on any level, only inserts and deletes (during compaction). Curious to understand how it works for append-only B-Tree. Thanks !

    • @TheGeekNarrator
      @TheGeekNarrator  Před 8 měsíci +2

      Thats a great question. It is still copy-on-write with LMDB. To avoid random IO it keeps these two versions old and new close to each other in a file. Page 0 is the old version Page 1 is the new version. And it keeps this ping pong going with each update. These two pages (0 and 1) are in the same file and are used for the two roots that it uses for the two versions of data. Since the location is predetermined Random IO isn’t involved and it is pretty quick in moving between versions. (There will of course be some head movement but nothing close to random IO).
      Imagine something like this:
      +-----------+-----------+-----------+-----------+-----+
      | Page 0 | Page 1 | Page 2 | Page 3 | ... |
      +-----------+-----------+-----------+-----------+-----+
      Page 0 and Page 1 are used for the two roots. Predetermined IO vs random IO.
      Does that make sense? Sorry I should have added more details, but I only realise it now 😀. I will plan another video on this topic.

    • @balajisrinivasan6618
      @balajisrinivasan6618 Před 8 měsíci

      @@TheGeekNarrator Got it. Does that mean the head needle on the magnetic disk has a capability to move backward as well? thanks for the quick reply, much appreciated !

    • @TheGeekNarrator
      @TheGeekNarrator  Před 8 měsíci +3

      The head can move across the radius and IIUC these pages would be stored such that the head can access them across the radius.
      I will try to find if there is more information here, but good questions. keep them coming.

  • @harshitpandey6664
    @harshitpandey6664 Před 7 měsíci +1

    Nice insights. Any durable database which is based on append only B-Trees (and not WAL) ?

    • @TheGeekNarrator
      @TheGeekNarrator  Před 7 měsíci

      Thanks- IIUC CouchDb also uses Append only Btrees.

    • @nareshgb1
      @nareshgb1 Před 7 měsíci

      @@TheGeekNarrator does it not use a WAL - I assume CouchDB is the same as couchbase?
      BTW - another advantage of WAL is that the IO for checkpointing can be done in background mode - so commits can be "instantaneous," as only a single write to the WAL is needed. Thanks again for the great video.

  • @KumarSahil78
    @KumarSahil78 Před 7 měsíci +1

    The background music is little loud, please try to lower it or maybe remove all together.