Quantization explained with PyTorch - Post-Training Quantization, Quantization-Aware Training

Mistral / Mixtral Explained: Sliding Window Attention, Sparse Mixture of Experts, Rolling Buffer

Segment Anything - Model explanation with code

Running With Bigger And Bigger Feastables

Ondřej Novotný - GELNÁROVÁ JE HODNÁ HOLKA, TRILOGII VYHRAJE KARLOS,PRODALI JSME 17.000 LÍSTKŮ ZA DEN

Gli occhiali da sole non mi hanno coperto! 😎

Distributed Training with PyTorch: complete tutorial with cloud infrastructure and code

Umar Jamil

zhlédnutí 13 604

Přidat do
- Můj playlist
- Přehrát později
Sdílet

Sdílet

Vložit

Velikost videa:

Zobrazit ovladače přehrávání

Automatické přehrávání

Přehrát

čas přidán 5. 09. 2024

Komentáře • 57

@rachadlakis1 Před 7 dny ⁺¹
That's an amazing resource! It's great to see you sharing such detailed information on a complex topic. Your effort to explain everything clearly will really help others understand and apply these concepts. Keep up the great work!
@chiragjn101 Před 8 měsíci ⁺⁹
Great video, thanks for creating this. I have use DDP quite a lot but seeing the visualizations for communication overlap helped me build a very good mental model.
Would love to see more content around distributed training - Deepspeed ZeRO, Megatron DP + TP + PP
@tharunbhaskar6795 Před měsícem ⁺¹
Dang. Never thought learning DDP would be this easy. Another great content from Umar. Looking forward for FSDP
@amishasomaiya9891 Před 3 měsíci ⁺²
Starting to watch my 3rd video on this channel, after transformer from scratch and quantization. Thank you for the great content and also for the code and notes to look back again. Thank you.
@abdallahbashir8738 Před 5 měsíci ⁺⁴
I really love your vidoes. you have a natural talent on simplifying logic and code. in same capacity as Andrej
@user-td8vz8cn1h Před 5 měsíci ⁺²
This is second video Ive watched from this channel after "quantization". And frankly wanted to express my gratitude towards your work as it is very easy to follow and the level of abstractions is tenable to understand concepts holistically.
@karanacharya18 Před 3 měsíci ⁺²
Super high quality lecture. You have a gift of teaching, man. Thank you!
@vasoyarutvik2897 Před 4 měsíci ⁺²
this channel is hidden gem
@thuann2cats Před měsícem ⁺¹
absolutely amazing! You made these concepts so accessible!
@Maximos80 Před měsícem ⁺¹
Incredible content, Umar! Well done! 🎉
@maleekabakhtawar3892 Před měsícem
well explained each and every detail, Great work Great Explanation👍
can you make this type of detailed video on distributed training through tensor parallelism? it would be very helpful. Thank you!
@vimukthirandika872 Před 26 dny ⁺¹
Really impressive!
@cken27 Před 8 měsíci ⁺³
Amazing content! Thanks for your sharing
@oliverhitchcock8436 Před 8 měsíci ⁺³
Another great video, Umar. Nice work
@user-jf6li8mn3l Před 7 měsíci
The video was very interesting and useful. Please make a similar video on DeepSpeed functionality. And in general, how to train large models (for example LLaMa SFT) on distributed systems (Multi-Server) when GPUs are located on different PCs.
@user-wm5xv5ei8o Před 6 měsíci ⁺¹
very nice and informative video. Thanks
@user-od3ig9qt6h Před 8 měsíci ⁺²
Thank you very much for your wonderful video. Can you teach a video on how to use the accelerate library with dpp?
@prajolshrestha9686 Před 8 měsíci ⁺¹
Thankyou so much for this amazing video. It is really informative.
@810602jay Před 8 měsíci ⁺¹
Amazing learning stuff ! Very Thanks !~ 🥰🥰🥰
@nova2577 Před 6 měsíci ⁺¹
You deserve many more likes and subscribers!
@Engrbilal143 Před 6 měsíci
Awesome video. Please make tutorial on FSDP as well
@riyajatar6859 Před 6 měsíci ⁺¹
In broadcast , if we are sending the copy of file from rank 0 and rank 4 node to other node. How is the total time still 10 second. Because still I am having same internet speed of 1MB/s.
Could anyone explain? I am bit confused.
Also what happens if I am having odd numbers of nodes
@manishsharma2211 Před 8 měsíci ⁺¹
you teach soooooooo good
@d.s.7857 Před 8 měsíci ⁺¹
Thank you so much for this
@hellochli Před 8 měsíci ⁺¹
Thanks!
@umarjamilai Před 8 měsíci
谢谢你！我们在领英connect吧
@loong6127 Před 5 měsíci ⁺¹
Great video
@SaurabhK9012 Před 28 dny
Please create a video on model parallelism and FSDP.
@madhusudhanreddy9157 Před 8 měsíci
If time permits for you, Please make an video for entire GPU and TPU and how to them effectively and most of us donno .
please create a playlist for pytorch for beginners and intermediates.
Thanks for reading.
@mdbayazid6837 Před 8 měsíci ⁺¹
Federated learning basics please.❤
@svkchaitanya Před 2 měsíci ⁺¹
You rock always 😂
@felipemello1151 Před 4 měsíci ⁺¹
I wish i could like it twice
@umarjamilai Před 4 měsíci
You can share it on social media. That's the best way to thank me 😇
@felipemello1151 Před 4 měsíci
@@umarjamilai not sure if it’s in your plans, but if you are open to suggestions, I would love to watch a video on multimodal models. Again, awesome work!
@umarjamilai Před 29 dny
Check my latest video!
@user-el4uh3uk2k Před 6 měsíci ⁺¹
fantastic
@rohollahhosseyni8564 Před 6 měsíci ⁺¹
great video
@Yo-rw7mq Před 4 měsíci ⁺¹
Great!
@mandarinboy Před 7 měsíci
Great intro video. Do you have any plans to also cover other parallelism: Model, Pipeline, Tensor, etc.
@ramprasath6424 Před 8 měsíci ⁺¹
please do some thing related to audio large models like conformers,quartznet ,etc
@waynelau3256 Před 4 měsíci
Working with fsdp and megatron now and I really want to figure this out from scratch haha, it sounds fun but a big headache
@khoapham7303 Před 8 měsíci ⁺²
I'm always confused with DP and DDP. Can you please tell me the difference between them? While both of them belong to data parallelism method.
@umarjamilai Před 8 měsíci ⁺⁶
DP only works on a single machine, while DDP can work on multiple machines. However, PyTorch now recommends using DDP also for single-machine setup.
@khoapham7303 Před 8 měsíci
@@umarjamilai thank you for your reply
@user-fw5sg5mx4m Před 3 měsíci
could provide another videos with respect to model parallel and pipeline parallel ? thanks..
@tryit-wv8ui Před 8 měsíci
another banger
@sounishnath513 Před 8 měsíci ⁺¹
SUUUPERRRR
@madhusudhanreddy9157 Před 8 měsíci
Hi Umar, Great video and enjoyed thorughly but i have one question.why are we using the approach of sum(grad1+grad2+....+gradN), why cant we use Avg of Gradients.
@umarjamilai Před 8 měsíci ⁺²
Of course you can (but you don't have to) use the average of the gradients. Actually, people usually take the average of the gradients. The reason we use the average is because we want the loss to be (more of less) the same as the non-distributed model, so you can compare the plots of the two. I don't know if PyTorch internally automatically takes the average of the gradients, I'd have to check the documentation/source.
@madhusudhanreddy9157 Před 8 měsíci
@@umarjamilaithanks for the info.
@ai__76 Před 4 měsíci
How to do in Kubernetes? Please explain it.
@Erosis Před 8 měsíci
Wouldn't the accumulated gradient need to be divided by the total number of individual gradients summed (or the learning rate needs to be divided by this value) to make it equivalent?
@umarjamilai Před 8 měsíci ⁺²
Yes, if you want to treat the "cumulative gradient" as a big batch, then you'd usually divide it by the number of items to keep it equivalent to the single-item setup. But it's not mandatory: as a matter of fact, loss functions on PyTorch have a "reduction" parameter, which is usually set to "mean" (so dividing the loss by the number of items) but can also be set to "sum".
One reason we usually calculate the "mean" loss is because we want to make comparisons between models with different hyperparameters (batch size), so the loss should not depend on the batch size.
But remember that mathematically you don't have to
@milonbhattacharya4097 Před 6 měsíci
shouldnt loss be accumulated ? loss += (y_pred - y_actual)^0.5
@user-pt7gs2ei1r Před 6 měsíci
In my understanding, yes the loss is accumulated for one batch theoretically, and the gradients are computed based on this accumulated loss too. But in the parallel implementation, both the loss calculated in the feedforward process, and the gradients calculated in the back propagation process executed in a parallel way. Here @umarjamilai use a for loop to illustrate the de facto parallel mechanism.
@user-ze3ok8hh6c Před 8 měsíci
do you have a discord channel?
@Allen-TAN Před 8 měsíci ⁺¹
Always great to watch your video, excellent work

Další v pořadí

Automatické přehrávání

Quantization explained with PyTorch - Post-Training Quantization, Quantization-Aware Training

Quantization explained with PyTorch - Post-Training Quantization, Quantization-Aware Training

Mistral / Mixtral Explained: Sliding Window Attention, Sparse Mixture of Experts, Rolling Buffer

Mistral / Mixtral Explained: Sliding Window Attention, Sparse Mixture of Experts, Rolling Buffer

Segment Anything - Model explanation with code

Segment Anything - Model explanation with code

Running With Bigger And Bigger Feastables

Running With Bigger And Bigger Feastables

Ondřej Novotný - GELNÁROVÁ JE HODNÁ HOLKA, TRILOGII VYHRAJE KARLOS,PRODALI JSME 17.000 LÍSTKŮ ZA DEN

Ondřej Novotný - GELNÁROVÁ JE HODNÁ HOLKA, TRILOGII VYHRAJE KARLOS,PRODALI JSME 17.000 LÍSTKŮ ZA DEN

Gli occhiali da sole non mi hanno coperto! 😎

Gli occhiali da sole non mi hanno coperto! 😎

هذه الحلوى قد تقتلني 😱🍬

هذه الحلوى قد تقتلني 😱🍬

PyTorch at Tesla - Andrej Karpathy, Tesla

PyTorch at Tesla - Andrej Karpathy, Tesla

The Reparameterization Trick

The Reparameterization Trick

A friendly introduction to distributed training (ML Tech Talks)

A friendly introduction to distributed training (ML Tech Talks)

A Few Moments from the 2016 Election Cycle

A Few Moments from the 2016 Election Cycle

LoRA: Low-Rank Adaptation of Large Language Models - Explained visually + PyTorch code from scratch

LoRA: Low-Rank Adaptation of Large Language Models - Explained visually + PyTorch code from scratch

How Fully Sharded Data Parallel (FSDP) works?

How Fully Sharded Data Parallel (FSDP) works?

Retrieval Augmented Generation (RAG) Explained: Embedding, Sentence BERT, Vector Database (HNSW)

Retrieval Augmented Generation (RAG) Explained: Embedding, Sentence BERT, Vector Database (HNSW)

Everything Starts with a Note-taking System

Everything Starts with a Note-taking System

DL4CV@WIS (Spring 2021) Tutorial 13: Training with Multiple GPUs

DL4CV@WIS (Spring 2021) Tutorial 13: Training with Multiple GPUs

Nurse's Mission: Bringing Joy to Young Lives #shorts

Nurse's Mission: Bringing Joy to Young Lives #shorts

BEST AIRPODS MAGIC SECRET | @Whoispelagheya

BEST AIRPODS MAGIC SECRET | @Whoispelagheya

I play this like Cristiano Ronaldo⚽❓

I play this like Cristiano Ronaldo⚽❓

Classic Italian Pasta Dog

Classic Italian Pasta Dog

C’est qui le plus fort 😂

C’est qui le plus fort 😂

TOHODLE JSTE SI V AVENGERS NEVŠIMLI #zajimavosti #avengers

TOHODLE JSTE SI V AVENGERS NEVŠIMLI #zajimavosti #avengers

NEJRYCHLEJŠÍ Střela v Historii FOTBALU…

NEJRYCHLEJŠÍ Střela v Historii FOTBALU…

Mikuláš Černák: PŘÍBĚH BOSSE (celý dokument)

Mikuláš Černák: PŘÍBĚH BOSSE (celý dokument)