Implementing Hardware-Friendly Databases (with DuckDB co-creator, Hannes Mühleisen)

Sdílet
Vložit
  • čas přidán 7. 07. 2024
  • SQLite could do with a little competition, so when I invited the co-creator of DuckDB in to talk, I thought we'd be discussing the perils of trying to build a new in-process database engine. I quickly realised things went much deeper than just a tech refresh.
    Hannes Mühleisen joins me this week to blend his academic credentials as a database researcher with his vehement need to make that research practical. And so we dive into what modern database literature has to say on making queries faster, more parallelizable, and closer to the metal, and how it all comes together in a user-friendly package that’s found its way into my day-to-day workload, and might well help out yours.
    If you’re curious about the gory details of database queries, how they can take advantage of modern hardware, or how all that research actually turns into a useful tool, Hannes has some great answers.
    --
    DuckDB: duckdb.org/
    Database Systems Book: infolab.stanford.edu/~ullman/d...
    Kris’ first computer: en.wikipedia.org/wiki/File:ZX...
    Volcano Query Evaluation System [pdf]: paperhub.s3.amazonaws.com/dac...
    Morsel Query Engine [pdf]: cs.brown.edu/~kayhan/papers/m...
    Unnesting Arbitrary Queries [pdf]: cs.emis.de/LNI/Proceedings/Pr...
    Papers Hannes' team have published: duckdb.org/why_duckdb#peer-re...
    DuckDB on Mastodon: mastodon.social/@duckdb
    Kris on Twitter: / krisajenkins
    Kris on LinkedIn: / krisjenkins
    Kris on Mastodon: mastodon.social/@krisajenkins
    --
    #softwaredevelopment #podcast #programming #database #duckdb #sql #sqlite
    0:00 Intro
    2:15 Podcast
    8:15 From Professor to Implementor
    15:28 Deciding What To Build
    22:43 How does DuckDB work?
    42:50 Parallelization
    1:10:18 A real world use-case
    1:18:59 Outro

Komentáře • 51

  • @fredguth1315
    @fredguth1315 Před 5 měsíci +44

    This is becoming one of my favorite podcasts. Great work!

  • @DreamsAPI
    @DreamsAPI Před 5 měsíci +12

    The professor has a learning voice and the host shows so much interest that I thoroughly enjoyed this discussion and sharing of fun subject that I did not know was possible

  • @datenschauer
    @datenschauer Před 24 dny +1

    DuckDB changed my (Data-Science)-Life!

  • @joshuamorris9597
    @joshuamorris9597 Před 5 měsíci +4

    I can't comment substantively because I'm 9 minutes into this video, but I love your interviews--and I'm already learning from this one. Hannes is brilliant. Thanks for making this!

  • @taouinche
    @taouinche Před 4 měsíci +4

    Thank you for this fascinating interview. The part about parallel processing reminded me of the 90s, when I was lucky enough to take part in a few benchmarks with Informix's first parallel database engine. Back then, a database of a few hundred gigabytes was a VLDB, an address book these days.

  • @DreamsAPI
    @DreamsAPI Před 5 měsíci +3

    Of my God, he just touched on 2 of these issues. I am glad they're are people who are trying to make it better so I can just worry connecting, creating tables, columns, performing crud, and some other minor issues, never knew there was an universe that was at works to enable me to save and retrieve information

  • @Mik1604
    @Mik1604 Před 5 měsíci +3

    We have reached peak geekdom with the SQL license plate. Great stuff as always!

  • @steveoc64
    @steveoc64 Před 5 měsíci +4

    Been using duckDB for some utility things, and it's been amazing. The ability to treat non-database things as just another DB table ends up making light work of otherwise difficult tasks. (By non-DB things, I mean CSV files, excel files, remote cloud-based buckets, etc)
    As a bonus, it has a great Zig library as well, which adds even more value.

  • @pietropeterlongo2695
    @pietropeterlongo2695 Před 4 měsíci +2

    This is really a great interview and kudos Kris for being able to keep up with the volcanic (or should I say morseful?) Hannes!

  • @janholland2224
    @janholland2224 Před 4 měsíci +1

    Thanks for featuring Hannes. Great to hear some common sense. Delighted he has chosen to work in my tiny country. I am not a developer but certainly a DBMS guy and business/information analist so I am going to have a look at his Duckling. Last database that impressed me (conceptually) was Illustra, the universal DBMS, and another Stonebraker project. Cheers and thx, J@n!

  • @nexovec
    @nexovec Před 5 měsíci +2

    Lately I've been watching a lot of your interviews and this is probably my currently favorite podcast.

  • @ReynirOrnBachmannGudmundsson
    @ReynirOrnBachmannGudmundsson Před měsícem +1

    This is wicked cool, very educational and entertaining and it gives so much context to SQL queries, most likely I will think about this podcast every time I write some SQL hereafter.

  • @JT-mr3db
    @JT-mr3db Před 5 měsíci +2

    Your interviewing skills are astonishingly good. Wonderful channel!

  • @SzTz100
    @SzTz100 Před 4 měsíci +1

    This is so interesting, I've always been interested in database design, ever since working with Sql Server and KDB+ a decade ago.

  • @AK-vx4dy
    @AK-vx4dy Před 5 měsíci +1

    I really like your interviews 😁
    And this guest is pure gold for me, i never have to write database engine but I'm immersed on SQL 20+ years also before i fetch programmatically data from big flat databases so optimization is near to my databass soul 😅
    This time i just understand and know what guest ia talking about.
    Also i very appreciate guest opinions about academia , funding etc.

  • @sillybuttons925
    @sillybuttons925 Před 4 měsíci +1

    This was such an entertaining interview. "A staff of nurses" love it.

  • @jailop7013
    @jailop7013 Před 4 měsíci

    I start seeing your interviews like exploring what is going there, but next I can't stop watching until you get to the final of the program, because everything is so interesting, provocative. Thanks to you and your invitees.

  • @jaedavas3050
    @jaedavas3050 Před 5 měsíci +1

    What perfect timing! I just heard about DuckDB and wanted to learn more. Thanks for another great interview.

  • @Debrugger
    @Debrugger Před 4 měsíci

    Best episode yet!

  • @geraldodev
    @geraldodev Před 5 měsíci +1

    Thank you a lot both to Chris and Hannes.

  • @WolfoxBR
    @WolfoxBR Před 5 měsíci +2

    Super interesting and fun episode. Thank you both!

  • @stanislavtrifan96
    @stanislavtrifan96 Před 3 měsíci

    Amazing talk, unexpectedly for me. Kudos for organizing it, for host and guest!

  • @LtdJorge
    @LtdJorge Před 2 měsíci

    The guys at CWI are insane. Specially their Database division, the best papers for pushing the boundaries of performance in DBs most often come from them. MonetDB is another interesting (and much older) project, normally at the cutting edge (at least in terms of academics). The original vectorized engine paper Hannes mentions is the one from MonetDB/X100.
    Also, fun fact, Python was born at CWI, too.

  • @DarenC
    @DarenC Před 4 měsíci

    This was fascinating! I followed some of it, but some parts definitely went over my head. I don't have any big data to process, but I'll find some reason to give it a whirl. I enjoyed it all the more because it made me think "if Jürgen Klopp wrote database engines" ♥🥰

  • @AlesNajmann
    @AlesNajmann Před 3 měsíci

    What a wonderful guest! ❤

  • @VolodymyrPavlyshyn
    @VolodymyrPavlyshyn Před 4 měsíci

    it is amazing talk . i keen to see episode about cozodb or any datalog dbs

  • @v0lke
    @v0lke Před 5 měsíci

    That was just amazing content ! Thank you for giving us access to such incredibly valuable information in such an approchable way. Love it ❤

  • @TheEVEInspiration
    @TheEVEInspiration Před 5 měsíci

    Another drawback of parallelizing re-partitioning streams instead of parallel operators is that there is a lot more memory claimed, just to avoid spillovers. And spillovers really hurt performance.

  • @aleclippe6213
    @aleclippe6213 Před 4 měsíci

    10/10 podcast

  • @abc_cba
    @abc_cba Před 10 dny

    why are there no from scratch tutorial to advanced on duckdb anywhere?

  • @mikemcculley
    @mikemcculley Před 4 měsíci +1

    OMG and you provide links to the papers in the video description! Thank you! This is an awesome podcast.
    Also, here's a 30-or-so-minute presentation from the Mark mentioned by Hannes, on unnesting arbitrary queries: czcams.com/video/ajpg_pMX620/video.htmlsi=YGfKwes6O1J-slvn

  • @nexovec
    @nexovec Před 5 měsíci

    21:10 I didn't believe that was actually a thing. Maybe you should get someone who can talk to us about error correction in databases(or in other software, like space computers, yaay, and more interestingly, at the hardware level too), but I don't know who that might even be. 😆 👍

  • @karlkeller2662
    @karlkeller2662 Před měsícem

    BIOS password resonated with me
    Bought the laptop I'm writing on used from eBay and found the BIOS password locked. It's a more modern laptop with a Security chip (Lenovo Yoga 20C0), so it wasn't as simple as taking the CMOS battery out ....but, I found a solution after many months of not giving up!
    (it involved shorting two pins on the BIOS chip)

  • @TheEVEInspiration
    @TheEVEInspiration Před 5 měsíci +1

    I hope the database can be used in client server mode and had backup/restore functionality.
    Else its practicality in the field will be quite limited.
    Even a crude client/server with just a few broad rights (like read-only) and limited number of users would go a long way.

    • @LtdJorge
      @LtdJorge Před 2 měsíci +1

      The entire point of DuckDB is that it is not a client server model. It is an in-process DBMS like SQLite, with a CLI tool that embeds the DBMS and let’s you operate on raw CSV files and the like, instead of having to install some Postgres or ClickHouse, create your tables, do the import, etc. Also since there’s no client/server, there are no users or permissions to setup, you can use the data right away (similar to what one would do with Pandas/Polars).

    • @TheEVEInspiration
      @TheEVEInspiration Před 2 měsíci

      @@LtdJorge IMO, you just described its crippling limitations for real applications, not its strengths.
      There are very few use-cases other than taking a first look at some data with the CLI tool as far as I can see from your reply.
      Actual applications for automation need more than that and cannot accept data of unknown structures to work with meaningfully. And also require support for other processes to access the data or the output after processing of said data.
      So as an explorer tool that uses SQL to open CSV files etc, sure. It will be better at that than typical RDBMS I seen, but once that is done, one still needs to use RDBMS for the real deal.
      Can't call it a database then, call it a data-explorer tool.

    • @TapetBart
      @TapetBart Před 28 dny

      @@TheEVEInspiration it has a lot of use cases actually. It can be used for ingesting data and cleaning data (duckdb can export to parquet files or whatever format you want). You can put duckdb in the browser, and then have access controll on the files it will query (you have a parquet table in blob storage that you want to share with people in your company, so you give users read access to those parquet files and let them query it using duckdb. I am currently experimenting with this at work, and seems very promising).
      You can use it in conjunction with Polars, or Pandas (or spark with a bit of a workaround).

  • @VladPalacios
    @VladPalacios Před 4 měsíci

    I wonder how much of a problem a Flipped Bit could be.... if it is a DB for common folk then ECC memory is not yet that ubiquitous

  • @radicalbyte
    @radicalbyte Před 3 měsíci +1

    I don't agree with what Hannes is saying about "no-one thinking" about making databases easy. Microsoft Access was, at least in the earlier versions, exactly that. A fantastic little easy to use database. It was easy to get data in. Easy to get data out. You didn't need to install any complex servers. Everything was in a file. It also made data creation easy because it combined the database with a fully features programming language, form designer/manager and report designer/manager.
    In some ways SQL Lite has replaced the data side of that (it's also low overhead, single-file based and super simple) but not in all ways.

  • @kindoblue
    @kindoblue Před 3 měsíci

    Persone eccezionali che forse non si rendono conto di quanto sia difficile fare quello che hanno fatto. Non per cercare scuse per me stesso eh 😉

  • @Jankoekepannekoek
    @Jankoekepannekoek Před 4 měsíci

    What do you mean setting up a database is inconvenient? How old is SQLite now?

  • @budiardjo6610
    @budiardjo6610 Před 5 měsíci

    this person and ceo TigreBeetle had a same vibes

  • @AK-vx4dy
    @AK-vx4dy Před 5 měsíci

    Maybe google has smart peoples but attacking this problems by academics have some merit, because Google may no6 always want to share their findings especially in details

  • @anthonvanderneut
    @anthonvanderneut Před 4 měsíci

    Interesting interview (as most of the time). Being into databases maybe you want to try and do an interview Howard Chu on Lightning Database (en.wikipedia.org/wiki/Lightning_Memory-Mapped_Database) underlying OpenLDAP (which had a vastly different starting point/problem to solve/constraints than DuckDB)
    BTW not all German license plates start with an S, those are plates issued by the Stuttgart area. Mine starts with BS-

  • @mlliarm
    @mlliarm Před 5 měsíci +2

    What an interesting interview ! Thanks again Chris ! My eyes are on DuckDB now :)

    • @DeveloperVoices
      @DeveloperVoices  Před 5 měsíci

      You're welcome. I hope you have as good an experience with it as I have. :-)