Using the {arrow} and {duckdb} packages to wrangle medical datasets that are Larger than RAM

Sdílet
Vložit
  • čas přidán 16. 07. 2024
  • From R/Medicine Conference 2022
    Peter D.R. Higgins, MD, Ph.D., MSc, Director of Inflammatory Bowel Disease (IBD) Program at the University of Michigan.
    Deck: speakerdeck.com/higgi13425/bi...
    Sections
    0:00 Introduction
    0:40 Starting point
    1:09 The motivating problem
    2:10 The data
    3:08 Options
    4:25 Lots to like about {data.table}
    5:23 Data on disk vs data in ram
    6:37 How to wrangle bigger-than-RAM data in R?
    8:15 Speed-wrangling
    9:42 What about the bigger-than-RAM problem?
    10:19 Let’s try it out
    11:35 What if data are still bigger-than-RAM?
    15:42 Back to the question…
    16:19 There’s always that (more than) one guy
    16:43 Take home points - speed
    17:15 Take home points - bigger-than-RAM data
    18:12 Closing
    More Resources
    Main Site: www.r-consortium.org/
    News: www.r-consortium.org/news
    Blog: www.r-consortium.org/news/blog
    Join: www.r-consortium.org/about/join
    Twitter: / rconsortium
    LinkedIn: / r-consortium
  • Věda a technologie

Komentáře • 14

  • @tmuffly1
    @tmuffly1 Před 3 měsíci +1

    This talk blew my mind. Thank you very much!

  • @tomfenn4
    @tomfenn4 Před rokem +6

    Really useful presentation, and timely for me. Personally I find data.table statements are greatly improved with just a little whitespace.

  • @tdawry
    @tdawry Před 2 měsíci

    A neat question to answer.
    I'm using the duckplyr library and it's nice to not have to think about anything. It does make a strong argument for having a fast hard drive (an SSD is an order of magnitude faster than a traditional HDD, an M2 is an order of magnitude faster than that, and modern nvme drives are even faster).

  • @multitaskprueba1
    @multitaskprueba1 Před 2 měsíci

    You are a genius! Fantastic video! Thanks!

  • @musicspinner
    @musicspinner Před rokem +1

    Masterful deployment of the "Kobayashi Maru" reference. 🖖

  • @VictorOrdu
    @VictorOrdu Před rokem +2

    Wow, thank you for this illuminating presentation.

  • @gueyenono
    @gueyenono Před rokem +2

    Great presentation.

  • @higgi13425
    @higgi13425 Před rokem +3

    For further learning, here are the links from the next to last slide:
    Arrow
    cheatsheet: raw.githubusercontent.com/rstudio/cheatsheets/master/arrow.pdf
    video intro: czcams.com/video/O42LUmJZPx0/video.html
    full workshop from useR!: arrow-user2022.netlify.app
    DuckDB
    website: duckdb.org
    R package: cran.r-project.org/web/packages/duckdb/index.html
    data.table
    website: rdatatable.gitlab.io/data.table
    dtplyr (a data.table translator): dtplyr.tidyverse.org

  • @matthewson8917
    @matthewson8917 Před rokem

    Perfectly summarizes my big data journey. Really good!

  • @JohnoScott
    @JohnoScott Před rokem

    Great talk. Concise and to the point.

  • @porlando12
    @porlando12 Před rokem

    Excellent presentation!

  • @torbjornstorli2880
    @torbjornstorli2880 Před 6 měsíci

    Loved your presentation. Well done Sir!😊

  • @ZachRenwickData
    @ZachRenwickData Před rokem

    great video and interesting analysis use case!

  • @arunabhbarua1924
    @arunabhbarua1924 Před 6 dny

    How about just using duckdb and SQL?