![Dask](/img/default-banner.jpg)
- 123
- 342 653
Dask
United States
Registrace 22. 02. 2016
Content, tutorials, and more on how to use Dask effectively.
Dask is a flexible open-source Python library for parallel computing. Dask scales Python code from multi-core local machines to large distributed clusters in the cloud. Dask provides a familiar user interface by mirroring the APIs of other libraries in the PyData ecosystem including Pandas, Scikit-learn, and NumPy. It also exposes low-level APIs that help programmers run custom algorithms in parallel.
Dask was created by Matthew Rocklin in 2014 and is used by retail, financial, and governmental organizations, as well as life science and geophysical institutes.
Dask is a flexible open-source Python library for parallel computing. Dask scales Python code from multi-core local machines to large distributed clusters in the cloud. Dask provides a familiar user interface by mirroring the APIs of other libraries in the PyData ecosystem including Pandas, Scikit-learn, and NumPy. It also exposes low-level APIs that help programmers run custom algorithms in parallel.
Dask was created by Matthew Rocklin in 2014 and is used by retail, financial, and governmental organizations, as well as life science and geophysical institutes.
Dask Demo Day 2024-03-21
Today's Talks:
00:00 Intro
00:38 Dask DataFrame is Fast - @fjetter
14:15 Large scale population of vector databases for RAG - @mrocklin
26:36 Easy GPU access with Coiled - @jrbourbeau
Next Demo Day is April 18th, sign up here: github.com/dask/community/issues/307
---
What is Dask Demo Day?
Each month we solicit 5-10 minute demos that show off ongoing and/or lesser-known work. Meetings will be recorded and advertised on social. Hopefully, this helps educate folks on some of the great work people do.
If you're interested, please reply to this issue with a brief (a couple sentences) description. If you have colleagues who you think should be interested please let them know. If you would like to present but not this month, check out the dates and signup for an upcoming one:
coiled.io/dask-demo-days
----
What is Dask?
Dask is a free and open-source library for parallel computing in Python. Dask is a community project maintained by developers and organizations.
Share your feedback on this video in the comments and let us know:
- Did you find this video helpful?
- Have you used Dask before?
Learn more at dask.org
00:00 Intro
00:38 Dask DataFrame is Fast - @fjetter
14:15 Large scale population of vector databases for RAG - @mrocklin
26:36 Easy GPU access with Coiled - @jrbourbeau
Next Demo Day is April 18th, sign up here: github.com/dask/community/issues/307
---
What is Dask Demo Day?
Each month we solicit 5-10 minute demos that show off ongoing and/or lesser-known work. Meetings will be recorded and advertised on social. Hopefully, this helps educate folks on some of the great work people do.
If you're interested, please reply to this issue with a brief (a couple sentences) description. If you have colleagues who you think should be interested please let them know. If you would like to present but not this month, check out the dates and signup for an upcoming one:
coiled.io/dask-demo-days
----
What is Dask?
Dask is a free and open-source library for parallel computing in Python. Dask is a community project maintained by developers and organizations.
Share your feedback on this video in the comments and let us know:
- Did you find this video helpful?
- Have you used Dask before?
Learn more at dask.org
zhlédnutí: 740
Video
Dask Demo Day - 2024-02-15
zhlédnutí 578Před 4 měsíci
Today's Talks: 00:00 Intro 01:18 One trillion row challenge - @mrocklin 06:20 Deploying Dask on Databricks - @jacobtomlinson 15:09 Deploying Prefect workflows on the cloud with Coiled - @jrbourbeau 29:22 Scaling embedding pipelines (LlamaIndex Dask) - @quasiben 46:45 Using AWS Cost Explorer to see the cost of public IPv4 addresses - @ntabris Next Demo Day is March 21st, sign up here: github.com...
Dask Demo Day - 2024-01-18
zhlédnutí 646Před 5 měsíci
Today's Talks: 00:00 Intro 00:47 Apache Beam DaskRunner - @cisaacstern 15:45 Array expressions - @mrocklin 26:27 One billion row challenge - @scharlottej13 What is Dask Demo Day? Each month we solicit 5-10 minute demos that show off ongoing and/or lesser-known work. Meetings will be recorded and advertised on social. Hopefully, this helps educate folks on some of the great work people do. If yo...
Dask Demo Day - 2023.10.19
zhlédnutí 777Před 8 měsíci
October 19th, 2023 Today's Talks: 00:00 Intro 00:31 @jacobtomlinson - "Who uses RAPIDS?" 10:51 @mrocklin - TPC-H benchmarks for Spark, Dask, Polars, DuckDB 24:27 @jhamman Dask - Arraylake integration 37:24 @mrchtr - Fondant We'd like to solicit 5-10 minute demos that show off ongoing or lesser-known work. I hope to have 3-5 of these during the meeting. Meetings will be recorded and advertised o...
Dask Demo Day - 2023-09-21
zhlédnutí 394Před 9 měsíci
Today's Talks 00:00 Intro 00:21 @fjetter - Performance with P2P array rechunking 14:04 @phofl - Dask expressions 27:07 @sjcharlotte13 @dcherian - Processing a quarter petabyte geospatial dataset in the cloud We'd like to solicit 5-10 minute demos that show off ongoing or lesser-known work. I hope to have 3-5 of these during the meeting. Meetings will be recorded and advertised on social. Hopefu...
Dask Demo Day - 2023-08-17
zhlédnutí 355Před 10 měsíci
Last Dask Demo Day of the summer! Todays Talks: @fjetter - Memray Integration for Memory Management @mrocklin - Some new updates and news @ jrbourbeau - Analyzing Sea Levels in the Cloud with Earthaccess and Coiled We'd like to solicit 5-10 minute demos that show off ongoing or lesser-known work. I hope to have 3-5 of these during the meeting. Meetings will be recorded and advertised on social....
How to Install Dask
zhlédnutí 856Před 10 měsíci
Learn how to install Dask and the Dask JupyterLab extension with either conda or pip. This video goes through how to set up with a clean working environment with Dask 00:00 Introduction 00:51 Pip install Dask 02:21 Create LocalCluster 03:27 Use Dashboard in JupyterLab
Dask Demo Day - 2023-07-20
zhlédnutí 549Před 11 měsíci
Today's talks @hendrikmakait - Shuffle resilience @Matt711 - Dask-Kubernetes update @GueroudjiAmal - External tasks in Dask distributed (github.com/GueroudjiAmal/distributed) @skrawcz Dask - Hamilton integration We'd like to solicit 5-10 minute demos that show off ongoing or lesser-known work. I hope to have 3-5 of these during the meeting. Meetings will be recorded and advertised on social. Ho...
Dask Demo Day - 2023-06-15
zhlédnutí 517Před rokem
Today's Talks dask-geopandas demo by @martinfleis Fine performance dask metrics and spans @crusaderky (10-15 min) Gil monitoring on dask @milesgranger We'd like to solicit 5-10 minute demos that show off ongoing or lesser-known work. I hope to have 3-5 of these during the meeting. Meetings will be recorded and advertised on social. Hopefully, this helps educate folks on some of the great work p...
Dask Demo Day 2023-05-18
zhlédnutí 443Před rokem
These are 5-10 minute demos that show off ongoing or lesser-known work. We hope to have 3-5 of these during the meeting. Meetings will be recorded and advertised on social. Hopefully, this helps to educate folks on some of the great work people are up to. Meetings are 3rd Thursday of every month at 11am EDT on zoom, Zoom link: us06web.zoom.us/j/89383035703?pwd=WkRJSzNnRTh4T2R1ZjJuVVdJWlMxQT09 W...
Dask Demo Day 2023-04-20
zhlédnutí 483Před rokem
Talks: Lindsey Gray - dask-awkward and dask-histogram for high energy physics analysis Amine Diro - daskqueue : a dask-based distributed task queue James Bourbeau - Pyarrow strings in Dask DataFrames Jacob Tomlinson - Launching a Jupyter/Dask cluster on NVIDIA Base Command Platform Want to present in one of the upcoming Dask Demo Days? Sign up here: github.com/dask/community/issues/307 Key Mome...
Dask Demo Day - 2023-03-16
zhlédnutí 443Před rokem
Dask Demo Days Talks: Analyzing Terabytes of Ocean Simulation model output with Xarray, xgcm and xhistogram - Tom Nicholas P2P shuffling - Hendrik Makait Scaling weather radar data analysis with Dask - Max Grover Automatic package synchronization in Coiled Dask Clusters - David Chudzicki Graph Neural Networks training with Dask - Vibhu Jawa Want to present at one of the upcoming Dask Demo Days?...
Dask Demo Day - 2023-02-16
zhlédnutí 482Před rokem
Monthly Dask Demo Day: February 2023 Talks: 00:00 Intro 00:28 New Dask integration in Flyte - Bernhard Stadlbauer 11:37 Parallelizing FTP downloads from a janky government server - Paul Hobson 22:45 Configurable Dataframe backends - Rick Zamora 34:36 Parallelize HPO of XGBoost with Optuna and Dask (multi-cluster) - Guido Imperiale 43:20 Accelerated Jaccard similarity using RAPIDS and Dask - Jiw...
Dask Demo Day - 2022-11-16
zhlédnutí 1KPřed rokem
Monthly demo day for Dask for November 2022 Github Issue: github.com/dask/community/issues/286 Talks: 00:00 Intro 03:05 2,000,000,000 lightning flashes - @ktyle 14:44 Dask CLI - @douglasdavis 21:44 Optuna - @jrbourbeau 32:00 Community Interlude - @mrocklin 34:02 Dask Awkward - @douglasdavis 46:02 Dask PySpy - @gjoseph92 01:03:30 Closing Follow us on twitter @dask_dev or sign up for the newslett...
Dask Demo Day - 2022-10-27
zhlédnutí 1,5KPřed rokem
Dask Demo Days - October 2022. Five quick talks using and developing Dask. Talks: 00:00 Intro 01:43 Scraping arXiv to determine Matplotlib popularity - Matthew Rocklin 08:36 Reducing memory use with task queuing - Florian Jetter 20:54 Kubernetes Operator and KubeFlow - Jacob Tomlinson 33:23 Prometheus - Nat Tabris 42:46 Apache Beam on Dask - Alex Merose 54:52 Conclusion github.com/dask/communit...
Dask in Production | How Dask Can Help in Production
zhlédnutí 523Před rokem
Dask in Production | How Dask Can Help in Production
Dask Use Case | Who Uses Dask: CapitalOne
zhlédnutí 289Před 2 lety
Dask Use Case | Who Uses Dask: CapitalOne
Dask Use Case | Who Uses Dask: Geophysical Sciences Studying Ocean Currents
zhlédnutí 297Před 2 lety
Dask Use Case | Who Uses Dask: Geophysical Sciences Studying Ocean Currents
Dask Use Case | Who Uses Dask: UK Meteorology Office
zhlédnutí 183Před 2 lety
Dask Use Case | Who Uses Dask: UK Meteorology Office
Dask Use Case | Who Uses Dask: WalMart
zhlédnutí 300Před 2 lety
Dask Use Case | Who Uses Dask: WalMart
Dask Use Case | CapitalOne: Adding Dask to Your Existing Pipeline
zhlédnutí 313Před 2 lety
Dask Use Case | CapitalOne: Adding Dask to Your Existing Pipeline
Dask Scientific Libraries | Scaling Science | Genevieve Buckley
zhlédnutí 332Před 2 lety
Dask Scientific Libraries | Scaling Science | Genevieve Buckley
New Dask Branding | Dask Gets an Upgrade
zhlédnutí 1,1KPřed 2 lety
New Dask Branding | Dask Gets an Upgrade
Dask Use Case | Who Uses Dask: Financial Institutions
zhlédnutí 496Před 2 lety
Dask Use Case | Who Uses Dask: Financial Institutions
Dask Best Practices | Scaling Up Science | Genevieve Buckley
zhlédnutí 3KPřed 2 lety
Dask Best Practices | Scaling Up Science | Genevieve Buckley
Dask for Science | Dask Example | Genevieve Buckley
zhlédnutí 324Před 2 lety
Dask for Science | Dask Example | Genevieve Buckley
Scientific Computing & Dask | Leveraging Dask for Life Sciences | Genevieve Buckley
zhlédnutí 632Před 2 lety
Scientific Computing & Dask | Leveraging Dask for Life Sciences | Genevieve Buckley
Scalable Machine Learning with Data Scientist Eric Ma
zhlédnutí 263Před 2 lety
Scalable Machine Learning with Data Scientist Eric Ma
Gemini 1.5 Pro: The video mentions that group by operations can fail due to large datasets and unsorted data. Here are the reasons for failure and how to compensate for them: * **Large datasets:** When dealing with large datasets, it is recommended to tune the split-out parameter. This parameter determines the size of the partitions, and a good starting point is to target 100 megabyte partitions. You can estimate the split-out value by considering the number of groups in your data and the size of each group. * **Unsorted data:** Dask performs better when the data is sorted by the group by fields. If your data is not sorted, Dask will shuffle the data to group it, which can be expensive. There are two ways to address this: * Sort your data before performing the group by operation. * Use math partitions. Math partitions can be used when your data is already sorted by an index matching one of your group-by fields. In this case, Dask can perform the group by operation on each partition without shuffling the data. Here are additional tips to improve the performance of group by operations in Dask: * **Optimize memory usage:** * Use pandas string dtype instead of object dtype for strings. * Use categorical data types when applicable. Categoricals are efficient when you have a small number of unique strings and the strings are large. * Drop unnecessary columns before performing the group by operation. * **Repartition your data:** Repartitioning your data ensures that the partitions are uniform in size. This can improve the performance of group by operations by avoiding situations where some partitions are significantly larger than others. * **Prioritize reductions before group by:** Perform any filtering or data reduction operations before the group by operation. This will reduce the amount of data that needs to be shuffled or grouped by.
Gemini 1.5 Pro: This video is about Dask Bag, a library for processing large datasets in parallel. The video starts with a basic introduction to Dask Bag. It explains that Dask Bag is a library that is useful for doing embarrassingly parallel analyses and a lot of pre-processing especially the text JSON or Avro data. Then the video dives into details with an example. The speaker constructs a bag with ten elements separate into four different partitions to demonstrate what a bag is. A bag is like a bunch of lists. Users can perform map, filter and reduce functions on the bag. For instance, the speaker uses map function to square every element in the bag, and filter function to get only the even elements. Next, the video shows how to use Dask Bag on real data. The data used in the example is a bunch of JSON files from a web service called MyBinder. The speaker reads the data using the read text function from Dask Bag. Then the speaker uses map function to convert the JSON encoded text into Python dictionaries. After converting the data into Python dictionaries, the speaker uses frequencies function to count how many times each Github repository shows up. The result shows that ipython is the most common repository that showed up in the data. The video then talks about how to use Dask Bag to pre-process data. The speaker filters out data that does not have "task" in the "spec" field and convert the data back into JSON format. Finally, the speaker writes the data to a text file. The last part of the video talks about the data frame. The speaker mentioned that Dask Bag may not be the right choice for complex analyses. Dask Dataframe might be a better option for such cases. The speaker also mentioned that Dask Bag can be converted to a Dask Dataframe using the to_dataframe function.
The resolution is very bad.
Excellent video, I wish all tech videos were this good.
Very interesting. Thank you for this view on the new dask_databricks functionalities.
Dask on Databricks is really cool. There's so many times you're on Databricks doing Python data science and don't want to use Spark.
Question regarding Array Expressions: how do they play together with the Dask (high-level) graph? A concrete xarray example: a problem with very large arrays is that even just their computational graph is too large to be materialized. A strategy is to read them without Dask (chunks=None), slice, and then again turn them into a dask-backed array by chunking. Would Array Expression simplify this, pushing the slicing before the graph materialization, or are those operating at different levels?
Expressions will eventually replace high-level graphs. They generate low-level task graphs directly. Slicing is definitely pushed through before graph generation, which will likely help reduce overall graph generation overhead. It's still possible to create large graphs though, just less likely. We're also shipping the expressions directly to the scheduler, so there will be less pain to large graphs (they won't have to travel over a wire).
@@Coiled Thanks for the answer! That actually sounds great, would help our workflows quite a bit.
Show its use with xarray
Obrigado por ter legendas em Português .
Where can I get Paul Hobson's source code ?
Awesome video Trevor. Do you have any idea about the resources that I can use to learn more about the Zarr and its inbuilt configurations? I have seen the documentation, but it seems little overwhelming to me.
Nice video. Is there a detailed review how your colleagues are analyze billions of records? you've mentioned it here: czcams.com/video/8aQ3xcX8e9Y/video.htmlsi=0FRQOT9TEnDz9FUs&t=1621
@martinfleis can we access your notebook?
Had some issues with Ray, but Dask worked out of the Box! Congratulations to the Developers!
What is the name of this enviromnet where you are running this commands?
It's a Jupyter notebook
Great intro. Also, how do I show those additional panes on the right shows an 2:05 to display memory usage and progress etc. That is pretty awesome. Thanks so much
Great work you guys
1:08:00
Could I use async/await with dask?
55:15
Is there an official Dask community channel?
Hi Matt. Amazing stuff as always. Do you know if there is something similar for VScode? Thank you!
pin me
Kept checking my Slack because I didn't realize it was coming from the video...
is the notebook for the local gpu availablr
Thank you!
Dask is the bomb.
Hiya, you mentioned Xarray in passing. Is there a multi-demensional equivalent to cudf?
Please correct if I am wrong, but maybe it is better to open file for writing at 4:48 with 'a' mode or every worker will override the data inside and you will have only the result of the last firing worker.
where can we download the CSV files?
Highest resolution available is 360p. It’s hard to read the code
These videos are fantastic but sometimes difficult to hear (even with my volume set to max)
Hi, since this video was posted, the dask-report.html page has an extra tab called "Summary" - is there a doc where I can read what the various stats in that summary mean?
cant even create dataframe from python list. need to create a pandas dataframe first. which kinda defeats the whole purpose.
Thank you for this recorded Dask Demo Day! Are these Jupyter NB available for users?
We don't have a single repo for this, yet. My notebooks are available here github.com/fjetter/dask-demo
Hi! Thank you for this! Regards
Does the Dask have some kind of linter?
Really a great talk!
Thank you for the explanation. Now it clears up my confusion on compute() vs persist()
Where can I get access to the notebooks used here?
Hi Carlos, you can find them here: github.com/quasiben/rapids-dask-summit-2021
The quality of this video makes it impossible to read the code
This lib is awesome!!! Thanks a lot 😍😍
what a boring speaker, such a disgusting english!
Thanks for the great explanation!
This is without doubt the best short guide on dask futures. Been reading lots of documentation but this video makes it so simple yet so powerful. Thanks a lot!
Thanks!
Thank you Dask Team, will explore this and join the community
That's a quality video, well done
Hello. The notbooks are available somewhere ?
Dask and all the python magic aside, Matt should hold master classes in delivering public lecture ♥ Also +100 on the "mature deployment" issue.
💔 🄿🅁🄾🄼🄾🅂🄼