Using the {arrow} and {duckdb} packages to wrangle medical datasets that are Larger than RAM
Vložit
- čas přidán 16. 07. 2024
- From R/Medicine Conference 2022
Peter D.R. Higgins, MD, Ph.D., MSc, Director of Inflammatory Bowel Disease (IBD) Program at the University of Michigan.
Deck: speakerdeck.com/higgi13425/bi...
Sections
0:00 Introduction
0:40 Starting point
1:09 The motivating problem
2:10 The data
3:08 Options
4:25 Lots to like about {data.table}
5:23 Data on disk vs data in ram
6:37 How to wrangle bigger-than-RAM data in R?
8:15 Speed-wrangling
9:42 What about the bigger-than-RAM problem?
10:19 Let’s try it out
11:35 What if data are still bigger-than-RAM?
15:42 Back to the question…
16:19 There’s always that (more than) one guy
16:43 Take home points - speed
17:15 Take home points - bigger-than-RAM data
18:12 Closing
More Resources
Main Site: www.r-consortium.org/
News: www.r-consortium.org/news
Blog: www.r-consortium.org/news/blog
Join: www.r-consortium.org/about/join
Twitter: / rconsortium
LinkedIn: / r-consortium - Věda a technologie
This talk blew my mind. Thank you very much!
Really useful presentation, and timely for me. Personally I find data.table statements are greatly improved with just a little whitespace.
A neat question to answer.
I'm using the duckplyr library and it's nice to not have to think about anything. It does make a strong argument for having a fast hard drive (an SSD is an order of magnitude faster than a traditional HDD, an M2 is an order of magnitude faster than that, and modern nvme drives are even faster).
You are a genius! Fantastic video! Thanks!
Masterful deployment of the "Kobayashi Maru" reference. 🖖
Wow, thank you for this illuminating presentation.
Great presentation.
For further learning, here are the links from the next to last slide:
Arrow
cheatsheet: raw.githubusercontent.com/rstudio/cheatsheets/master/arrow.pdf
video intro: czcams.com/video/O42LUmJZPx0/video.html
full workshop from useR!: arrow-user2022.netlify.app
DuckDB
website: duckdb.org
R package: cran.r-project.org/web/packages/duckdb/index.html
data.table
website: rdatatable.gitlab.io/data.table
dtplyr (a data.table translator): dtplyr.tidyverse.org
Perfectly summarizes my big data journey. Really good!
Great talk. Concise and to the point.
Excellent presentation!
Loved your presentation. Well done Sir!😊
great video and interesting analysis use case!
How about just using duckdb and SQL?