Dask
Dask
  • 123
  • 342 653
Dask Demo Day 2024-03-21
Today's Talks:
00:00 Intro
00:38 Dask DataFrame is Fast - @fjetter
14:15 Large scale population of vector databases for RAG - @mrocklin
26:36 Easy GPU access with Coiled - @jrbourbeau
Next Demo Day is April 18th, sign up here: github.com/dask/community/issues/307
---
What is Dask Demo Day?
Each month we solicit 5-10 minute demos that show off ongoing and/or lesser-known work. Meetings will be recorded and advertised on social. Hopefully, this helps educate folks on some of the great work people do.
If you're interested, please reply to this issue with a brief (a couple sentences) description. If you have colleagues who you think should be interested please let them know. If you would like to present but not this month, check out the dates and signup for an upcoming one:
coiled.io/dask-demo-days
----
What is Dask?
Dask is a free and open-source library for parallel computing in Python. Dask is a community project maintained by developers and organizations.
Share your feedback on this video in the comments and let us know:
- Did you find this video helpful?
- Have you used Dask before?
Learn more at dask.org
zhlédnutí: 740

Video

Dask Demo Day - 2024-02-15
zhlédnutí 578Před 4 měsíci
Today's Talks: 00:00 Intro 01:18 One trillion row challenge - @mrocklin 06:20 Deploying Dask on Databricks - @jacobtomlinson 15:09 Deploying Prefect workflows on the cloud with Coiled - @jrbourbeau 29:22 Scaling embedding pipelines (LlamaIndex Dask) - @quasiben 46:45 Using AWS Cost Explorer to see the cost of public IPv4 addresses - @ntabris Next Demo Day is March 21st, sign up here: github.com...
Dask Demo Day - 2024-01-18
zhlédnutí 646Před 5 měsíci
Today's Talks: 00:00 Intro 00:47 Apache Beam DaskRunner - @cisaacstern 15:45 Array expressions - @mrocklin 26:27 One billion row challenge - @scharlottej13 What is Dask Demo Day? Each month we solicit 5-10 minute demos that show off ongoing and/or lesser-known work. Meetings will be recorded and advertised on social. Hopefully, this helps educate folks on some of the great work people do. If yo...
Dask Demo Day - 2023.10.19
zhlédnutí 777Před 8 měsíci
October 19th, 2023 Today's Talks: 00:00 Intro 00:31 @jacobtomlinson - "Who uses RAPIDS?" 10:51 @mrocklin - TPC-H benchmarks for Spark, Dask, Polars, DuckDB 24:27 @jhamman Dask - Arraylake integration 37:24 @mrchtr - Fondant We'd like to solicit 5-10 minute demos that show off ongoing or lesser-known work. I hope to have 3-5 of these during the meeting. Meetings will be recorded and advertised o...
Dask Demo Day - 2023-09-21
zhlédnutí 394Před 9 měsíci
Today's Talks 00:00 Intro 00:21 @fjetter - Performance with P2P array rechunking 14:04 @phofl - Dask expressions 27:07 @sjcharlotte13 @dcherian - Processing a quarter petabyte geospatial dataset in the cloud We'd like to solicit 5-10 minute demos that show off ongoing or lesser-known work. I hope to have 3-5 of these during the meeting. Meetings will be recorded and advertised on social. Hopefu...
Dask Demo Day - 2023-08-17
zhlédnutí 355Před 10 měsíci
Last Dask Demo Day of the summer! Todays Talks: @fjetter - Memray Integration for Memory Management @mrocklin - Some new updates and news @ jrbourbeau - Analyzing Sea Levels in the Cloud with Earthaccess and Coiled We'd like to solicit 5-10 minute demos that show off ongoing or lesser-known work. I hope to have 3-5 of these during the meeting. Meetings will be recorded and advertised on social....
How to Install Dask
zhlédnutí 856Před 10 měsíci
Learn how to install Dask and the Dask JupyterLab extension with either conda or pip. This video goes through how to set up with a clean working environment with Dask 00:00 Introduction 00:51 Pip install Dask 02:21 Create LocalCluster 03:27 Use Dashboard in JupyterLab
Dask Demo Day - 2023-07-20
zhlédnutí 549Před 11 měsíci
Today's talks @hendrikmakait - Shuffle resilience @Matt711 - Dask-Kubernetes update @GueroudjiAmal - External tasks in Dask distributed (github.com/GueroudjiAmal/distributed) @skrawcz Dask - Hamilton integration We'd like to solicit 5-10 minute demos that show off ongoing or lesser-known work. I hope to have 3-5 of these during the meeting. Meetings will be recorded and advertised on social. Ho...
Dask Demo Day - 2023-06-15
zhlédnutí 517Před rokem
Today's Talks dask-geopandas demo by @martinfleis Fine performance dask metrics and spans @crusaderky (10-15 min) Gil monitoring on dask @milesgranger We'd like to solicit 5-10 minute demos that show off ongoing or lesser-known work. I hope to have 3-5 of these during the meeting. Meetings will be recorded and advertised on social. Hopefully, this helps educate folks on some of the great work p...
Dask Demo Day 2023-05-18
zhlédnutí 443Před rokem
These are 5-10 minute demos that show off ongoing or lesser-known work. We hope to have 3-5 of these during the meeting. Meetings will be recorded and advertised on social. Hopefully, this helps to educate folks on some of the great work people are up to. Meetings are 3rd Thursday of every month at 11am EDT on zoom, Zoom link: us06web.zoom.us/j/89383035703?pwd=WkRJSzNnRTh4T2R1ZjJuVVdJWlMxQT09 W...
Dask Demo Day 2023-04-20
zhlédnutí 483Před rokem
Talks: Lindsey Gray - dask-awkward and dask-histogram for high energy physics analysis Amine Diro - daskqueue : a dask-based distributed task queue James Bourbeau - Pyarrow strings in Dask DataFrames Jacob Tomlinson - Launching a Jupyter/Dask cluster on NVIDIA Base Command Platform Want to present in one of the upcoming Dask Demo Days? Sign up here: github.com/dask/community/issues/307 Key Mome...
Dask Demo Day - 2023-03-16
zhlédnutí 443Před rokem
Dask Demo Days Talks: Analyzing Terabytes of Ocean Simulation model output with Xarray, xgcm and xhistogram - Tom Nicholas P2P shuffling - Hendrik Makait Scaling weather radar data analysis with Dask - Max Grover Automatic package synchronization in Coiled Dask Clusters - David Chudzicki Graph Neural Networks training with Dask - Vibhu Jawa Want to present at one of the upcoming Dask Demo Days?...
Dask Demo Day - 2023-02-16
zhlédnutí 482Před rokem
Monthly Dask Demo Day: February 2023 Talks: 00:00 Intro 00:28 New Dask integration in Flyte - Bernhard Stadlbauer 11:37 Parallelizing FTP downloads from a janky government server - Paul Hobson 22:45 Configurable Dataframe backends - Rick Zamora 34:36 Parallelize HPO of XGBoost with Optuna and Dask (multi-cluster) - Guido Imperiale 43:20 Accelerated Jaccard similarity using RAPIDS and Dask - Jiw...
Dask Demo Day - 2022-11-16
zhlédnutí 1KPřed rokem
Monthly demo day for Dask for November 2022 Github Issue: github.com/dask/community/issues/286 Talks: 00:00 Intro 03:05 2,000,000,000 lightning flashes - @ktyle 14:44 Dask CLI - @douglasdavis 21:44 Optuna - @jrbourbeau 32:00 Community Interlude - @mrocklin 34:02 Dask Awkward - @douglasdavis 46:02 Dask PySpy - @gjoseph92 01:03:30 Closing Follow us on twitter @dask_dev or sign up for the newslett...
Dask Demo Day - 2022-10-27
zhlédnutí 1,5KPřed rokem
Dask Demo Days - October 2022. Five quick talks using and developing Dask. Talks: 00:00 Intro 01:43 Scraping arXiv to determine Matplotlib popularity - Matthew Rocklin 08:36 Reducing memory use with task queuing - Florian Jetter 20:54 Kubernetes Operator and KubeFlow - Jacob Tomlinson 33:23 Prometheus - Nat Tabris 42:46 Apache Beam on Dask - Alex Merose 54:52 Conclusion github.com/dask/communit...
Dask in Production | How Dask Can Help in Production
zhlédnutí 523Před rokem
Dask in Production | How Dask Can Help in Production
Dask Use Case | Who Uses Dask: GrubHub
zhlédnutí 281Před rokem
Dask Use Case | Who Uses Dask: GrubHub
Dask Use Case | Who Uses Dask: CapitalOne
zhlédnutí 289Před 2 lety
Dask Use Case | Who Uses Dask: CapitalOne
Dask Use Case | Who Uses Dask: Geophysical Sciences Studying Ocean Currents
zhlédnutí 297Před 2 lety
Dask Use Case | Who Uses Dask: Geophysical Sciences Studying Ocean Currents
Dask Use Case | Who Uses Dask: UK Meteorology Office
zhlédnutí 183Před 2 lety
Dask Use Case | Who Uses Dask: UK Meteorology Office
Dask Use Case | Who Uses Dask: WalMart
zhlédnutí 300Před 2 lety
Dask Use Case | Who Uses Dask: WalMart
Dask Use Case | CapitalOne: Adding Dask to Your Existing Pipeline
zhlédnutí 313Před 2 lety
Dask Use Case | CapitalOne: Adding Dask to Your Existing Pipeline
Dask Scientific Libraries | Scaling Science | Genevieve Buckley
zhlédnutí 332Před 2 lety
Dask Scientific Libraries | Scaling Science | Genevieve Buckley
New Dask Branding | Dask Gets an Upgrade
zhlédnutí 1,1KPřed 2 lety
New Dask Branding | Dask Gets an Upgrade
Dask Use Case | Who Uses Dask: Financial Institutions
zhlédnutí 496Před 2 lety
Dask Use Case | Who Uses Dask: Financial Institutions
Dask Best Practices | Scaling Up Science | Genevieve Buckley
zhlédnutí 3KPřed 2 lety
Dask Best Practices | Scaling Up Science | Genevieve Buckley
Dask for Science | Dask Example | Genevieve Buckley
zhlédnutí 324Před 2 lety
Dask for Science | Dask Example | Genevieve Buckley
Scientific Computing & Dask | Leveraging Dask for Life Sciences | Genevieve Buckley
zhlédnutí 632Před 2 lety
Scientific Computing & Dask | Leveraging Dask for Life Sciences | Genevieve Buckley
What is Dask? A Brief Introduction
zhlédnutí 1,8KPřed 2 lety
What is Dask? A Brief Introduction
Scalable Machine Learning with Data Scientist Eric Ma
zhlédnutí 263Před 2 lety
Scalable Machine Learning with Data Scientist Eric Ma

Komentáře

  • @gemini_537
    @gemini_537 Před měsícem

    Gemini 1.5 Pro: The video mentions that group by operations can fail due to large datasets and unsorted data. Here are the reasons for failure and how to compensate for them: * **Large datasets:** When dealing with large datasets, it is recommended to tune the split-out parameter. This parameter determines the size of the partitions, and a good starting point is to target 100 megabyte partitions. You can estimate the split-out value by considering the number of groups in your data and the size of each group. * **Unsorted data:** Dask performs better when the data is sorted by the group by fields. If your data is not sorted, Dask will shuffle the data to group it, which can be expensive. There are two ways to address this: * Sort your data before performing the group by operation. * Use math partitions. Math partitions can be used when your data is already sorted by an index matching one of your group-by fields. In this case, Dask can perform the group by operation on each partition without shuffling the data. Here are additional tips to improve the performance of group by operations in Dask: * **Optimize memory usage:** * Use pandas string dtype instead of object dtype for strings. * Use categorical data types when applicable. Categoricals are efficient when you have a small number of unique strings and the strings are large. * Drop unnecessary columns before performing the group by operation. * **Repartition your data:** Repartitioning your data ensures that the partitions are uniform in size. This can improve the performance of group by operations by avoiding situations where some partitions are significantly larger than others. * **Prioritize reductions before group by:** Perform any filtering or data reduction operations before the group by operation. This will reduce the amount of data that needs to be shuffled or grouped by.

  • @gemini_537
    @gemini_537 Před měsícem

    Gemini 1.5 Pro: This video is about Dask Bag, a library for processing large datasets in parallel. The video starts with a basic introduction to Dask Bag. It explains that Dask Bag is a library that is useful for doing embarrassingly parallel analyses and a lot of pre-processing especially the text JSON or Avro data. Then the video dives into details with an example. The speaker constructs a bag with ten elements separate into four different partitions to demonstrate what a bag is. A bag is like a bunch of lists. Users can perform map, filter and reduce functions on the bag. For instance, the speaker uses map function to square every element in the bag, and filter function to get only the even elements. Next, the video shows how to use Dask Bag on real data. The data used in the example is a bunch of JSON files from a web service called MyBinder. The speaker reads the data using the read text function from Dask Bag. Then the speaker uses map function to convert the JSON encoded text into Python dictionaries. After converting the data into Python dictionaries, the speaker uses frequencies function to count how many times each Github repository shows up. The result shows that ipython is the most common repository that showed up in the data. The video then talks about how to use Dask Bag to pre-process data. The speaker filters out data that does not have "task" in the "spec" field and convert the data back into JSON format. Finally, the speaker writes the data to a text file. The last part of the video talks about the data frame. The speaker mentioned that Dask Bag may not be the right choice for complex analyses. Dask Dataframe might be a better option for such cases. The speaker also mentioned that Dask Bag can be converted to a Dask Dataframe using the to_dataframe function.

  • @miriamdixon1870
    @miriamdixon1870 Před měsícem

    The resolution is very bad.

  • @JohnMatthew-dt1vq
    @JohnMatthew-dt1vq Před 2 měsíci

    Excellent video, I wish all tech videos were this good.

  • @apachaves
    @apachaves Před 3 měsíci

    Very interesting. Thank you for this view on the new dask_databricks functionalities.

  • @mwd6478
    @mwd6478 Před 4 měsíci

    Dask on Databricks is really cool. There's so many times you're on Databricks doing Python data science and don't want to use Spark.

  • @DanielJahn-fu2ev
    @DanielJahn-fu2ev Před 5 měsíci

    Question regarding Array Expressions: how do they play together with the Dask (high-level) graph? A concrete xarray example: a problem with very large arrays is that even just their computational graph is too large to be materialized. A strategy is to read them without Dask (chunks=None), slice, and then again turn them into a dask-backed array by chunking. Would Array Expression simplify this, pushing the slicing before the graph materialization, or are those operating at different levels?

    • @Coiled
      @Coiled Před 5 měsíci

      Expressions will eventually replace high-level graphs. They generate low-level task graphs directly. Slicing is definitely pushed through before graph generation, which will likely help reduce overall graph generation overhead. It's still possible to create large graphs though, just less likely. We're also shipping the expressions directly to the scheduler, so there will be less pain to large graphs (they won't have to travel over a wire).

    • @DanielJahn-fu2ev
      @DanielJahn-fu2ev Před 5 měsíci

      @@Coiled Thanks for the answer! That actually sounds great, would help our workflows quite a bit.

  • @lalitchoudhary1095
    @lalitchoudhary1095 Před 5 měsíci

    Show its use with xarray

  • @rodrigoluca6296
    @rodrigoluca6296 Před 6 měsíci

    Obrigado por ter legendas em Português .

  • @pingzhong-pl5sb
    @pingzhong-pl5sb Před 7 měsíci

    Where can I get Paul Hobson's source code ?

  • @sagniksarkar506
    @sagniksarkar506 Před 7 měsíci

    Awesome video Trevor. Do you have any idea about the resources that I can use to learn more about the Zarr and its inbuilt configurations? I have seen the documentation, but it seems little overwhelming to me.

  • @DrTallin
    @DrTallin Před 8 měsíci

    Nice video. Is there a detailed review how your colleagues are analyze billions of records? you've mentioned it here: czcams.com/video/8aQ3xcX8e9Y/video.htmlsi=0FRQOT9TEnDz9FUs&t=1621

  • @dogosousa
    @dogosousa Před 9 měsíci

    @martinfleis can we access your notebook?

  • @AlverGant
    @AlverGant Před 10 měsíci

    Had some issues with Ray, but Dask worked out of the Box! Congratulations to the Developers!

  • @cleitonluiz7136
    @cleitonluiz7136 Před 10 měsíci

    What is the name of this enviromnet where you are running this commands?

  • @aria_nukil
    @aria_nukil Před 10 měsíci

    Great intro. Also, how do I show those additional panes on the right shows an 2:05 to display memory usage and progress etc. That is pretty awesome. Thanks so much

  • @be12
    @be12 Před rokem

    Great work you guys

  • @habruti7215
    @habruti7215 Před rokem

    1:08:00

  • @loveyou-pi5gj
    @loveyou-pi5gj Před rokem

    Could I use async/await with dask?

  • @loveyou-pi5gj
    @loveyou-pi5gj Před rokem

    55:15

  • @Kai-iy7pe
    @Kai-iy7pe Před rokem

    Is there an official Dask community channel?

  • @kristiantorres1080
    @kristiantorres1080 Před rokem

    Hi Matt. Amazing stuff as always. Do you know if there is something similar for VScode? Thank you!

  • @Dynamitegaming125
    @Dynamitegaming125 Před rokem

    pin me

  • @jacobgomez_
    @jacobgomez_ Před rokem

    Kept checking my Slack because I didn't realize it was coming from the video...

  • @billyblackburn864
    @billyblackburn864 Před rokem

    is the notebook for the local gpu availablr

  • @jijie133
    @jijie133 Před rokem

    Thank you!

  • @RobertAlbrecht-mw7er

    Dask is the bomb.

  • @RoguesAndEvolution
    @RoguesAndEvolution Před rokem

    Hiya, you mentioned Xarray in passing. Is there a multi-demensional equivalent to cudf?

  • @alexanderlyapin8057

    Please correct if I am wrong, but maybe it is better to open file for writing at 4:48 with 'a' mode or every worker will override the data inside and you will have only the result of the last firing worker.

  • @parikannappan1580
    @parikannappan1580 Před rokem

    where can we download the CSV files?

  • @jylpah
    @jylpah Před rokem

    Highest resolution available is 360p. It’s hard to read the code

  • @samsammurphy
    @samsammurphy Před rokem

    These videos are fantastic but sometimes difficult to hear (even with my volume set to max)

  • @paveevad
    @paveevad Před rokem

    Hi, since this video was posted, the dask-report.html page has an extra tab called "Summary" - is there a doc where I can read what the various stats in that summary mean?

  • @iamworstgamer
    @iamworstgamer Před rokem

    cant even create dataframe from python list. need to create a pandas dataframe first. which kinda defeats the whole purpose.

  • @shpundk
    @shpundk Před rokem

    Thank you for this recorded Dask Demo Day! Are these Jupyter NB available for users?

    • @fjetter4295
      @fjetter4295 Před rokem

      We don't have a single repo for this, yet. My notebooks are available here github.com/fjetter/dask-demo

  •  Před rokem

    Hi! Thank you for this! Regards

  • @k1zmt
    @k1zmt Před rokem

    Does the Dask have some kind of linter?

  • @annawilson3824
    @annawilson3824 Před rokem

    Really a great talk!

  • @PalataoArmy
    @PalataoArmy Před rokem

    Thank you for the explanation. Now it clears up my confusion on compute() vs persist()

  • @carlosmateosamudiolezcano2463

    Where can I get access to the notebooks used here?

    • @Dask-dev
      @Dask-dev Před rokem

      Hi Carlos, you can find them here: github.com/quasiben/rapids-dask-summit-2021

  • @carlosmateosamudiolezcano2463

    The quality of this video makes it impossible to read the code

  • @585ghz
    @585ghz Před rokem

    This lib is awesome!!! Thanks a lot 😍😍

  • @hamidrezahosseinkhani5980

    what a boring speaker, such a disgusting english!

  • @MsStoCa
    @MsStoCa Před rokem

    Thanks for the great explanation!

  • @Queeno11
    @Queeno11 Před 2 lety

    This is without doubt the best short guide on dask futures. Been reading lots of documentation but this video makes it so simple yet so powerful. Thanks a lot!

  • @arkadipbasu828
    @arkadipbasu828 Před 2 lety

    Thank you Dask Team, will explore this and join the community

  • @sebbie2e
    @sebbie2e Před 2 lety

    That's a quality video, well done

  • @danielabatalha5434
    @danielabatalha5434 Před 2 lety

    Hello. The notbooks are available somewhere ?

  • @Amapramaadhy
    @Amapramaadhy Před 2 lety

    Dask and all the python magic aside, Matt should hold master classes in delivering public lecture ♥ Also +100 on the "mature deployment" issue.

  • @antribera2138
    @antribera2138 Před 2 lety

    💔 🄿🅁🄾🄼🄾🅂🄼