An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (Paper Explained)

Vision Transformer Quick Guide - Theory and Code in (almost) 15 min

Why Does Diffusion Work Better than Auto-Regression?

KOCOVINA VE 20 vs VE 30 LETECH 😅😂

Little Girl LOSES IT when she messes up ESPRESSO RIFF w/ Vocal Coach!!!

Cola + Mentos = Exploze

Vision Transformer for Image Classification

Shusen Wang

zhlédnutí 116 783

Přidat do
- Můj playlist
- Přehrát později
Sdílet

Sdílet

Vložit

Velikost videa:

Zobrazit ovladače přehrávání

Automatické přehrávání

Přehrát

čas přidán 29. 08. 2024
Vision Transformer (ViT) is the new state-of-the-art for image classification. ViT was posted on arXiv in Oct 2020 and officially published in 2021. On all the public datasets, ViT beats the best ResNet by a small margin, provided that ViT has been pretrained on a sufficiently large dataset. The bigger the dataset, the greater the advantage of the ViT over ResNet.
Slides: github.com/wan...
Reference:
- Dosovitskiy et al. An image is worth 16×16 words: transformers for image recognition at scale. In ICLR, 2021.

Komentáře • 84

@UzzalPodder Před 3 lety ⁺³⁶
Great Explanation with detailed notations. Most of the videos found in the CZcams were some kind of oral explanation. But this kind of symbolic notation is very helpful for garbing the real picture, specially if anyone want to re-implement or add new idea with it. Thank you so much. Please continuing helping us by making these kind of videos for us.
@mmpattnaik97 Před 2 lety ⁺²
Can't stress enough on how easy to understand you made it
@ai_lite Před 5 měsíci ⁺¹
great expalation! Good for you! Don't stop giving ML guides!
@adityapillai3091 Před 6 měsíci
Clear, concise, and overall easy to understand for a newbie like me. Thanks!
@thecheekychinaman6713 Před 10 měsíci
The best ViT explanation available. Also key to understand this for understanding Dino and Dino V2
@drelvenkee1885 Před 8 měsíci
The best video so far. The animation is easy to follow and the explaination is very straight forward.
@aimeroundiaye1378 Před 3 lety ⁺¹⁰
Amazing video. It helped me to really understand the vision transformers. Thanks a lot.
@drakehinst271 Před rokem ⁺⁶
These are some of the best, hands-on and simple explanations I've seen in a while on a new CS method. Straight to the point with no superfluous details, and at a pace that let me consider and visualize each step in my mind without having to constantly pause or rewind the video. Thanks a lot for your amazing work! :)
@rajgothi2633 Před 11 měsíci
You have explained ViT in simple words. Thanks
@valentinfontanger4962 Před 2 lety ⁺¹
Amazing, I am in a rush to implement vision transformer as an assignement, and this saved me so much time !
@randomperson5303 Před 2 lety
lol , same
@thepresistence5935 Před 2 lety ⁺²
15 minutes of heaven 🌿. Thanks a lot understood clearly!
@ronalkobi4356 Před 2 měsíci
Wonderful explanation!👏
@sheikhshafayat6984 Před rokem ⁺³
Man, you made my day! These lectures were golden. I hope you continue to make more of these
@soumyajitdatta9203 Před rokem
Thank you. Best ViT video I found.
@user-ez1mp1ko3u Před 10 měsíci
Best ViT explanation ever!!!!!!
@ervinperetz5973 Před 2 lety ⁺¹
This is a great explanation video.
One nit : you are misusing the term 'dimension'. If a classification vector is linear with 8 values, that's not '8-dimensional' -- it is a 1-dimensional vector with 8 values.
@MonaJalal Před 2 lety ⁺²
This was a great video. Thanks for your time producing great content.
@vladi21k Před 2 lety
Very good explanation, better that many other videos on CZcams, thank you!
@arash_mehrabi Před rokem
Thank you for your Attention Models playlist. Well explained.
@swishgtv7827 Před 2 lety
This reminds me of Encarta encyclopedia clips when I was a kid lol! Good job mate!
@DerekChiach Před 3 lety
Thank you, your video is way underrated. Keep it up!
@parmanandchauhan6182 Před měsícem
Great Explanation.Thanqu
@tianbaoxie2324 Před 2 lety
Very clear, thanks for your work.
@NisseOhlsen Před 2 lety
Very nice job, Shusen, thanks!
@hongkyulee9724 Před rokem
Thank you for the clear explanation!!☺
@sehaba9531 Před 2 lety
Thank you so much for this amazing presentation. You have a very clear explanation, I have learnt so much. I will definitely watch your Attention models playlist.
@wengxiaoxiong666 Před rokem
good video ,what a splendid presentation , wang shusen yyds.
@nehalkalita Před rokem
Nicely explained. Appreciate your efforts.
@xXMaDGaMeR Před rokem
amazing precise explanation
@muhammadfaseeh5810 Před 2 lety
Awesome Explanation.
Thank you
@Raulvic Před 3 lety
Thank you for the clear explanation
@tallwaters9708 Před 2 lety
Brilliant explanation, thank you.
@aryanmobiny7340 Před 3 lety ⁺¹
Amazing video. Please do one for Swin Transformers if possible. Thanks alot
@boemioofworld Před 3 lety
thank you so much for the clear explanation
@t.pranav2834 Před 3 lety
Awesome explanation man thanks a tonne!!!
@mmazher5826 Před 11 měsíci
Excellent explanation 👌
@user-hu6me8mf7q Před 10 měsíci
that was educational!
@mariamwaleed2132 Před 2 lety
really great explaination , thankyou
@deeplearn6584 Před 2 lety
Very good explanation
subscribed!
@nova2577 Před 2 lety ⁺²
If we ignore output c1 ... cn, what c1 ... cn represent then?
@jidd32 Před 2 lety
Brilliant. Thanks a million
@sevovo Před 10 měsíci ⁺¹
CNN on images + positional info = Transformers for images
@medomed1105 Před 2 lety
Great explanation
@sudhakartummala4701 Před 2 lety
Wonderful talk
@MenTaLLyMenTaL Před 2 lety ⁺¹
@9:30 Why do we discard c1... cn and use only c0? How is it that all the necessary information from the image gets collected & preserved in c0? Thanks
@abhinavgarg5611 Před rokem
Hey, did you get answer to your question?
@chawkinasrallah7269 Před 4 měsíci
The class token 0 is in the embed dim, does that mean we should add a linear layer from embed to number of classes before the softmax for the classification?
@DrAIScience Před 3 měsíci
How data A is trained? I mean what is the loss function? Is it only using encoder or both e/decoder?
@ASdASd-kr1ft Před rokem
Nice video!!, Just a question what is the argue behind to rid of the vectors c1 to cn, and just remain with c0? Thanks
@ogsconnect1312 Před 2 lety
Good job! Thanks
@DrAhmedShahin_707 Před 2 lety
The simplest and more interesting explanation, Many Thanks. I am asking about object detection models, did you explain it before?
@mahdiyehbasereh Před 9 měsíci
That was great and helpful 🤌🏻
@ansharora3248 Před 2 lety
Great explanation :)
@zeweichu550 Před 2 lety
great video!
@ME-mp3ne Před 2 lety
Really good, thx.
@user-wr4yl7tx3w Před rokem
In the job market, do data scientists use transformers?
@mahmoudtarek6859 Před rokem
great
@fedegonzal Před 3 lety
Super clear explanation! Thanks! I want to understand how attention is applied to the images. I mean, using cnn you can "see" where the neural network is focusing, but with transformers?
@swishgtv7827 Před 2 lety ⁺¹
The concept has similarities to TCP protocol in terms of segmentation and positional encoding. 😅😅😅
@saeedataei269 Před 2 lety
great video. thanks. could u plz explain swin transformer too?
@DungPham-ai Před 3 lety ⁺¹
Amazing video. It helped me to really understand the vision transformers. Thanks a lot. But i have a question why we only use token cls for classifier .
@NeketShark Před 3 lety
Looks like due to attention layers cls token is able to extract all the data it needs for a good classification from other tokens. Using all tokens for classification would just unnecessarily increase computation.
@Darkev77 Před 2 lety
@@NeketShark that’s a good answer. At 9:40, any idea how a softmax function was able to increase (or decrease) the dimension of vector “c” into “p”? I thought softmax would only change the entries of a vector, not its dimensions
@NeketShark Před 2 lety ⁺¹
@@Darkev77 I think it first goes through a linear layer which then goes through a softmax, so its the linear layer that changes the dimention. In the video this info were probably ommited for simplification.
@bbss8758 Před 3 lety
Can you explain yhis paper please “your classifier is secretly an energy based model and you should treat it like one “ i want understand these energy based model
@seakan6835 Před 2 lety
其实我觉得up主说中文更好🥰🤣
@boyang6105 Před 2 lety
也有中文版的（ czcams.com/video/BbzOZ9THriY/video.html ），不同的语言有不同的听众
@shamsarfeen2729 Před 3 lety
If you remove the positional encoding step, the whole thing is almost equivalent to a CNN, right?
I mean those dense layers are just as filters of a CNN.
@yinghaohu8784 Před 3 lety
1) you mentioned pretrain model, it uses large scale dataset, and then using a smaller dataset for finetuning. Does it mean, they c0 is almost the same, except the last layer softmax will be adjusted based on the class_num ? and then train on fine-tuning dataset ? Or there're other different settings ? 2)Another doubt for me is, there's completely no mask in ViT, right? since it is from MLM ... um ...
@st-hs2ve Před 2 lety
Great great great
@ThamizhanDaa1 Před 2 lety
WHY is the transformer requiring so many images to train?? and why is resnet not becoming better with ore training vs ViT?
@user-td2xu3lp8v Před rokem
👏
@parveenkaur2747 Před 3 lety
Very good explanation! Can you please explain how we can fine tune these models to our dataset. Is it possible on our local computer
@ShusenWangEng Před 3 lety ⁺⁴
Unfortunately, no. Google has TPU clusters. The amount of computation is insane.
@parveenkaur2747 Před 3 lety
@@ShusenWangEng Actually I have my project proposal due today.. I was proposing this on the dataset of FOOD-101 it has 101000 images
So it can’t be done?
What size dataset can we train on our local PC
@parveenkaur2747 Před 3 lety
Can you please reply?
Stuck at the moment..
Thanks
@ShusenWangEng Před 3 lety
@@parveenkaur2747 If your dataset is very different from ImageNet, Google's pretrained model may not transfer well to your problem. The performance can be bad.
@palyashuk42 Před 3 lety
Why do the authors evaluate and compare their results with the old ResNet architecure? Why not to use EfficientNets for comparison? Looks like not the best result...
@ShusenWangEng Před 3 lety ⁺¹
ResNet is a family of CNNs. Many tricks are applied to make ResNet work better. The reported are indeed the best accuracies that CNNs can achieve.
@randomperson5303 Před 2 lety
Not All Heroes Wear Capes
@yuan6950 Před 2 lety
这英语也是醉了
@kutilkol Před 4 měsíci
this is supposed to be english?
@lionhuang9209 Před 2 lety
Very clear, thanks for your work.

Další v pořadí

Automatické přehrávání

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (Paper Explained)

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (Paper Explained)

Vision Transformer Quick Guide - Theory and Code in (almost) 15 min

Vision Transformer Quick Guide - Theory and Code in (almost) 15 min

Why Does Diffusion Work Better than Auto-Regression?

Why Does Diffusion Work Better than Auto-Regression?

KOCOVINA VE 20 vs VE 30 LETECH 😅😂

KOCOVINA VE 20 vs VE 30 LETECH 😅😂

Little Girl LOSES IT when she messes up ESPRESSO RIFF w/ Vocal Coach!!!

Little Girl LOSES IT when she messes up ESPRESSO RIFF w/ Vocal Coach!!!

Cola + Mentos = Exploze

Cola + Mentos = Exploze

VÉMOLA trénuje s MURADOVEM! 🔥

VÉMOLA trénuje s MURADOVEM! 🔥

The math behind Attention: Keys, Queries, and Values matrices

The math behind Attention: Keys, Queries, and Values matrices

Vision Transformers explained

Vision Transformers explained

Vision Transformer Basics

Vision Transformer Basics

Swin Transformer - Paper Explained

Swin Transformer - Paper Explained

Transformers, explained: Understand the model behind GPT, BERT, and T5

Transformers, explained: Understand the model behind GPT, BERT, and T5

Swin Transformer paper animated and explained

Swin Transformer paper animated and explained

BERT for pretraining Transformers

BERT for pretraining Transformers

Vision Transformer and its Applications

Vision Transformer and its Applications

Vision Transformer (ViT) - An image is worth 16x16 words | Paper Explained

Vision Transformer (ViT) - An image is worth 16x16 words | Paper Explained

Elektrický Šok Za Každý Gól V FC24! @michsako

Elektrický Šok Za Každý Gól V FC24! @michsako

Get 10 Mega Boxes OR 60 Starr Drops!!

Get 10 Mega Boxes OR 60 Starr Drops!!

The First Time You Say ' Mom ' #shortsfeed #funny

The First Time You Say ' Mom ' #shortsfeed #funny

When you discover a family secret

When you discover a family secret

248 lízátek za 2 500 korun! 😝

248 lízátek za 2 500 korun! 😝

Jak Mluvit Jako Sigma

Jak Mluvit Jako Sigma

Lee a jeho progress 👏🦁👏 JanKanurek.cz 🫡 #thisiskurñalife #zabilcz #freerunning #trampoline

Lee a jeho progress 👏🦁👏 JanKanurek.cz 🫡 #thisiskurñalife #zabilcz #freerunning #trampoline

O ka ne ka se gu 초등학생이춘다면??? #춤추는곰돌 #춤추는곰돌의랜덤댄스 #dance #댄스 #kpop #okanekasegu #mamushi #hiphop #춤

O ka ne ka se gu 초등학생이춘다면??? #춤추는곰돌 #춤추는곰돌의랜덤댄스 #dance #댄스 #kpop #okanekasegu #mamushi #hiphop #춤