Why Does Diffusion Work Better than Auto-Regression?

Stable Diffusion 3: Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

Tutorial: Video Diffusion Models. Mike Shou, 2023.

Táta ČR na dovolené u moře 🏝️ #selixinho

SPILLED CHOCKY MILK PRANK ON BROTHER 😂 #shorts

Symmetrical face⁉️🤔 #beauty

OpenAI Sora and DiTs: Scalable Diffusion Models with Transformers

Gabriel Mongaras

zhlédnutí 10 789

Přidat do
- Můj playlist
- Přehrát později
Sdílet

Sdílet

Vložit

Velikost videa:

Zobrazit ovladače přehrávání

Automatické přehrávání

Přehrát

čas přidán 22. 08. 2024
Sora: openai.com/sora
Sora paper (Video generation models as world simulators): openai.com/res...
DiTs - Scalable Diffusion Models with Transformers paper: arxiv.org/abs/...
My notes: drive.google.c...

Komentáře • 15

@AbhishekSingh-qp5xk Před 5 měsíci ⁺³
Incredible explanations. Love the clarity of thought and illustrations to visualize the concepts.
@yuxiangzhang2343 Před 6 měsíci ⁺²
All concepts beautifully explained! Very intuitive but accurate at the same time! Thank you so much!
@progzyy Před 4 měsíci
Hey!
Already watched some of your videos before, but randomly got onto this video again when I needed to learn about DiTs
I love how you explain so well and deeply things, even if sometimes you explain basic stuff, it just helps reinforces the learning and it's good
Even if it's 1 hour long, it feels like everything is needed
@xplained6486 Před 4 měsíci
great explaination, not too much detail and not too little. You hit a very good balance which makes it easy to follow the concepts :)
@systemdesignstudygroup315 Před 6 měsíci
I was just looking for this on your channel! Thanks!
@signitureDGK Před 6 měsíci ⁺¹
great explanation. I could see how they probably used a ViViT model for Sora. Vivit models have temporal and spatial encoders for self-attention mechanisms probably two DiT blocks ( factorized encoder ViViT model).
Also, When would the multihead cross attention model version be used? Let's say for generating images from text prompts with more than 1000 classes. Or perhaps conditioning on even more stuff like audio etc. The DiT Block with cross attention would be preferred?
Great video!
@johntanchongmin Před 4 měsíci
Enjoy your content! Keep it up!
@maxziebell4013 Před 6 měsíci
Great walkthrough.
@adidevbhattacharya9220 Před měsícem
That was indeeed a gr8 explanation.
Can you please explain how do we get the hidden dimension d @28:19
For e.g if the img in latent space is 128x128x3 and we consider patch size of 32.
Then no. of tokens = (128/32)^2 = 16
Is the number of dimnesion (d) then = p^2 = 32^2 ?
Please clarify this
@bibiworm Před 6 měsíci ⁺¹
11:58 are you talking about ODE solver, ordinary differential equation? Thanks.
@regrefree Před 6 měsíci
Good explanation on the background part. Question, when you explained cross-attention, you said q=z and k,v = [c;t], they don't have the detail in the paper but I think it should be the other way q=[c;t] and k,v=z right?
@gabrielmongaras Před 6 měsíci
Usually the conditioning goes into the keys and queries such as in the Stable Diffusion paper. If you have Q, K, V of shape (N, d), (M, d), and (M, d) where N is the sequence length and M is the context length, then the output shape is SM[(N, d)(d, M)](M, d) -> (N, M)(M, d) -> (N, d). However, if we invert this then we have Q, K, V of shape (M, d), (N, d), and (N, d), then the output shape is SM[(M, d)(d, N)](N, d) -> (M, N)(N, d) -> (M, d) which is a sequence in terms of the conditioning sequence.
@regrefree Před 6 měsíci
@@gabrielmongaras Yep I agree. I just read the stable diffusion paper carefully, and you are right they use Q = Z and K, V = C. I would have guessed they would be reverse, since the output of the UNet is Z_T-1 using Z_T. Also their cross attentions weights' shape doesn't makes sense, I am sure they made a mistake. They should have said Q, V=Z and K=C.
@thebgEntertainment1 Před 6 měsíci
Great video
@bibiworm Před 6 měsíci
14:13 don’t quite understand equation for x_t-1

Další v pořadí

Automatické přehrávání

Why Does Diffusion Work Better than Auto-Regression?

Why Does Diffusion Work Better than Auto-Regression?

Stable Diffusion 3: Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

Stable Diffusion 3: Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

Tutorial: Video Diffusion Models. Mike Shou, 2023.

Tutorial: Video Diffusion Models. Mike Shou, 2023.

Táta ČR na dovolené u moře 🏝️ #selixinho

Táta ČR na dovolené u moře 🏝️ #selixinho

SPILLED CHOCKY MILK PRANK ON BROTHER 😂 #shorts

SPILLED CHOCKY MILK PRANK ON BROTHER 😂 #shorts

Symmetrical face⁉️🤔 #beauty

Symmetrical face⁉️🤔 #beauty

Friends Supporting Friends

Friends Supporting Friends

Diffusion Models | Paper Explanation | Math Explained

Diffusion Models | Paper Explanation | Math Explained

Stable/Latent Diffusion - High-Resolution Image Synthesis with Latent Diffusion Models Explained

Stable/Latent Diffusion - High-Resolution Image Synthesis with Latent Diffusion Models Explained

How To Build Generative AI Models Like OpenAI's Sora

How To Build Generative AI Models Like OpenAI's Sora

Planning with Diffusion for Flexible Behavior Synthesis

Planning with Diffusion for Flexible Behavior Synthesis

L6 Diffusion Models (SP24)

L6 Diffusion Models (SP24)

The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits and BitNet

The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits and BitNet

DiT: The Secret Sauce of OpenAI's Sora & Stable Diffusion 3

DiT: The Secret Sauce of OpenAI's Sora & Stable Diffusion 3

[UPDATED] ViViT & NaViT papers: How Sora encoded space-time patches | Shawn's ML Notes

[UPDATED] ViViT & NaViT papers: How Sora encoded space-time patches | Shawn's ML Notes

Stable Diffusion in Code (AI Image Generation) - Computerphile

Stable Diffusion in Code (AI Image Generation) - Computerphile

Harley Quinn lost the Joker forever！！！#Harley Quinn #joker

Harley Quinn lost the Joker forever！！！#Harley Quinn #joker

How Hard Is To Slice With The World's Smallest Sword?

How Hard Is To Slice With The World's Smallest Sword?

DIY squeeze bottle

DIY squeeze bottle

ONLYNET Challenge s Mich Sakem a Dodem!

ONLYNET Challenge s Mich Sakem a Dodem!

GAME OF O.U.T. vs MINI CELINE 🙈⚽️

GAME OF O.U.T. vs MINI CELINE 🙈⚽️

Vážně Tohle Řekl? 😨 - OMEGLE RIZZ EDITION💦 ft. Lišák

Vážně Tohle Řekl? 😨 - OMEGLE RIZZ EDITION💦 ft. Lišák

Joker can't swim!#joker #shorts

Joker can't swim!#joker #shorts