ConvNeXt: A ConvNet for the 2020s - Paper Explained (with animations)

AI Coffee Break with Letitia

zhlédnutí 20 814

Přidat do
- Můj playlist
- Přehrát později
Sdílet

Sdílet

Vložit

Velikost videa:

Zobrazit ovladače přehrávání

Automatické přehrávání

Přehrát

čas přidán 17. 07. 2024
Can a ConvNet outperform a Vision Transformer? What kind of modifications do we have to apply to a ConvNet to make it as powerful as a Transformer? Spoiler: it’s not attention.
► SPONSOR: Weights & Biases 👉 wandb.me/ai-coffee-break
The official ConvNeXt repo has a W&B integration! Also, W&B built the CIFAR10 training colab linked there: 🥳 / 1486325233711828996
❓ Check out our daily #MachineLearning Quiz Questions: / aicoffeebreak
➡️ AI Coffee Break Merch! 🛍️ aicoffeebreak.creator-spring....
Explained Paper 📜: Liu, Zhuang, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. “A ConvNet for the 2020s.” arXiv preprint arXiv:2201.03545 (2022). arxiv.org/abs/2201.03545
🔗 Tweet of Lukas Beyer (ViT author): / 1481054929573888005
🔗 Depthwise convolutions image and explanation: eli.thegreenplace.net/2018/de...
Referenced videos:
📺 An image is worth 16x16 words: • An image is worth 16x1...
📺 Swin Transformer: • Swin Transformer paper...
📺 This is how Transformers can process both image and text: • Transformers can do bo...
📺 ViLBERT explained: • Transformer combining ...
📺 DeiT explained: • Data-efficient Image T...
📺 Transformers sequence length: • Do Transformers proces...
Referenced papers:
📜 “Image Transformer” Paper: Parmar, Niki, Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Noam Shazeer, Alexander Ku, and Dustin Tran. “Image transformer.” In International Conference on Machine Learning, pp. 4055-4064. PMLR, 2018. arxiv.org/abs/1802.05751
📜 “ViLBERT“ paper: Lu, Jiasen, Dhruv Batra, Devi Parikh, and Stefan Lee. “Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks.” arXiv preprint arXiv:1908.02265 (2019). arxiv.org/abs/1908.02265
Outline:
00:00 A ConvNet for the 2020s
01:58 Weights & Biases (Sponsor)
03:10 Why bother?
04:40 The perks of ConvNets (CNNs)
06:51 Pros and cons of Transformers
09:54 From ConvNets to ConvNeXts
15:54 Lessons?
Thanks to our Patrons who support us in Tier 2, 3, 4: 🙏
donor, Dres. Trost GbR, banana.dev
▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀
🔥 Optionally, pay us a coffee to help with our Coffee Bean production! ☕
Patreon: / aicoffeebreak
Ko-fi: ko-fi.com/aicoffeebreak
▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀
🔗 Links:
AICoffeeBreakQuiz: / aicoffeebreak
Twitter: / aicoffeebreak
Reddit: / aicoffeebreak
CZcams: / aicoffeebreak
#AICoffeeBreak #MsCoffeeBean #MachineLearning #AI #research
Věda a technologie

Komentáře • 30

@user-js9qb7hz5e Před 2 lety ⁺⁸
Hello Letitia, thank you for this great video!
Correct me if I am wrong, but I am pretty sure that I have seen the 1/4 size ratio you talk about in 12:38 in both the original ViT paper and the "Training data-efficient image transformers
& distillation through attention" paper that I have read.
In the original ViT paper they use this MLP block ratio in all almost all of their experiments, without mentioning it implicitly whilst in the second one, they mention the 1/4 ratio of the MLP block in page 5 of the paper. I am a newbie in Deep learning and transformers though so take everything I say with a grain of salt 😅
@AICoffeeBreak Před 2 lety ⁺⁸
Thanks! Yes, it's Table 1 in the ViT paper. Then we totally misunderstood what that factor 4 was referring to while making the video. 🙈
@charlesfoster6326 Před 2 lety ⁺³
Tacking on, an expansion ratio of 3 or 4 in the MLP is also pretty standard in transformers for natural language tasks.
@DerPylz Před 2 lety ⁺¹¹
First coffee bean of the year!! 🎉 congrats on the 11k subs!
@CristianGarcia Před 2 lety ⁺¹⁶
They went all in with the storytelling on this paper, they even extracted the core design choices as "wisdom bits". I really don't believe they achieved the final architecture this way but reading the "linear improvement story" was very entertaining.
@Kartik_C Před 2 lety ⁺⁷
Thank you Miss Coffee Bean! The 60 sec explanation of translational equivariance was amazing!
@marverickbin Před rokem ⁺³
Cant wait to try a unet with convnext backbone
@ElQaheryProductions Před 2 lety ⁺³
This is a really nice way of reviewing papers! Keep it up!
@MrMadmaggot Před 3 měsíci
3:32 I love how SKEWED that fookin graph is maam is just fkn nuts.
@hannesstark5024 Před 2 lety ⁺⁶
10K subscriber congrats! ^^
@AICoffeeBreak Před 2 lety ⁺⁴
Yes! Thank you! 🤝 Means a lot from an early subscriber like yourself.
@pourmohammaddeveloper2034 Před 2 lety ⁺³
many many thanks from Iran
@butterkaffee910 Před 2 lety ⁺⁹
LeCun must be so happy right now
@AICoffeeBreak Před 2 lety ⁺⁶
Absolutely. 😆
@edwardbrown2873 Před 2 lety ⁺⁵
Love this. Superb. Keep it up!
@AICoffeeBreak Před 2 lety ⁺⁴
Thank you! Will do! 😀
@RodrigoCh Před 2 lety ⁺³
Fastest 20 min ever! Thank you for the clear explanation. I especially like how you animate the explanation!
May I ask what do you use to do the animations? Maybe you could add some FAQ section; I can imagine you get this question a lot.
@AICoffeeBreak Před 2 lety ⁺⁷
Thanks, this comments makes us very happy!
I do not want to make a FAQ section: comments and questions are good for making the Algorithm believe it should push us further up into your recommendations:
I animate everything but Ms. Coffee Bean in good old PowerPoint (yeah, tools are what you can make of them 🙈 ).
Ms. Coffee Bean is animated in the video editing software: kdenlive (available for all operating systems and open source).
@hyunkim2172 Před 2 lety ⁺³
Many thanks!
@TimScarfe Před 2 lety ⁺⁵
Awesome 🔥🔥😎😎
@giantbee9763 Před 2 lety ⁺³
Great point on how we jumped right into transformers and forgotten to exactly pin down the effect of small tweaks.
Great video again! :D
@giantbee9763 Před 2 lety ⁺⁴
I think what they might have meant by inverted bottle neck : Key, value, query and the residual connections :D Though would you call that an inverted bottle neck? What do you think @letitia?
@AICoffeeBreak Před 2 lety ⁺³
No, it is a tiny detail that concerns how the MLP layer is built. d -> 4d -> d. Here is Alexa explaining this (link with the right time stamp: czcams.com/video/idiIllIQOfU/video.html )
@AICoffeeBreak Před 2 lety ⁺²
I missed the point there in the video when talking about inverted bottlenecks. I thought about the Swin Transformer 🙈
@giantbee9763 Před 2 lety ⁺¹
@@AICoffeeBreak That's right! I forgot about how the positional feedforward layer is constructed.Which indeed is an inverted bottleneck.
@hararani Před 11 měsíci ⁺¹
Hello Letitia, thank you so much for you video it's great inspiration for my thesis. If you don't mind can I ask you question? In your opinion Is it possible if I do research paper that compare between ViT, DEiT and ConvNext for image classification in 10.000 images as newbie? because the model is considered new and not so many paper already implement those models. Thank you.
@JapiSandhu Před 2 lety ⁺³
Can convnext be used for video classification with time series data?
Can there be a 3D-Convnext ? Like how there would be a 3DCNN?
@AICoffeeBreak Před 2 lety ⁺³
I do not see why this wouldn't be extendable to video. :)
@gabrieldealca4829 Před rokem
What is the best state-of-the-art architecture for regression tasks involving images?
@stewartjohnston7213 Před 2 lety ⁺¹
Needed to hear this 🙌!! Get the stats you deserve = P r o m o s m!

Další v pořadí

Automatické přehrávání

How do Vision Transformers work? - Paper explained | multi-head self-attention & convolutions