Masked Autoencoders Are Scalable Vision Learners - Paper explained and animated!
Vložit
- čas přidán 7. 07. 2024
- “Masked Autoencoders Are Scalable Vision Learners” paper explained by Ms. Coffee Bean. Say goodbye to contrastive learning and say hello (again) to autoencoders in #ComputerVision! Love the simple, yet elegant idea!
► Check out our sponsor: Weights & Biases 👉 wandb.me/ai-coffee-break
📺 Vision Transformer explained: • Vision Transformers ex...
Thanks to our Patrons who support us in Tier 2, 3, 4: 🙏
donor, Dres. Trost GbR, Yannik Schneider
➡️ AI Coffee Break Merch! 🛍️ aicoffeebreak.creator-spring....
Paper 📜: He, Kaiming, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll'ar and Ross B. Girshick. “Masked Autoencoders Are Scalable Vision Learners.” (2021). arxiv.org/abs/2111.06377
References:
🔗 blog.keras.io/building-autoen...
🔗 www.deeplearningbook.org/
🔗 / 1462446494766837773
📺 ViT video: • An image is worth 16x1...
📺 DeiT: • Data-efficient Image T...
📺 Swin Transformer: • Swin Transformer paper...
Outline:
00:00 Intro
00:41 Weights & Biases (Sponsor)
02:10 What are autoencoders?
05:03 Differences between vision and language masked autoencoding
07:02 How does masked autoencoding work for images?
▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀
🔥 Optionally, pay us a coffee to help with our Coffee Bean production! ☕
Patreon: / aicoffeebreak
Ko-fi: ko-fi.com/aicoffeebreak
▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀
----------------
🔗 Links:
AICoffeeBreakQuiz: / aicoffeebreak
Twitter: / aicoffeebreak
Reddit: / aicoffeebreak
CZcams: / aicoffeebreak
#AICoffeeBreak #MsCoffeeBean #MachineLearning #AI #research - Věda a technologie
55 views, I am early bird! I hope you get enough money for coffee from sponsors :) I am not mocking, I really happy that even young channels are supported by sponsors and so happy that this sponsor can be helpful for most of the viewers
Thanks! I can totally relate to your point. I feel the same when it comes to small CZcamsrs I love.
@@AICoffeeBreak Could you list couple of small youtubers you love? I am into 3blue1brown, Yannik and 2min papers but they all are pretty huge
Small but sponsored? No (except Sabine Hossenfelder, but she is not small).
Just small: Machine Learning Street Talk, Alfredo Canziani, Henry AI Labs, Jay Alammar, The AI Epiphany, Aladdin Persson, Gradient Dude, vcubingx
@@AICoffeeBreak wow, you made my weekend instead of watching Monster Hunter with Milla Jovovich I am going to watch Sabine Hossenfelder protein folding videos
I have been procrastinating reading the paper until now and you just made a video, perfect.
You were not procrastinating. You were waiting for us to make the video. 😂
The animation is awesome. Thank you for taking the effort!
Glad you liked it!
Cool stuff. Thanks for keeping us up to date on papers outside of our domain
Can't thank you enough, I have to present this paper in my class and this helps me alot
In addition, I love the sound effects of the layer growing ! Nice video !
Thanks! Including sound effects doesn't mean much most of the time. But at the right spots, it can trigger a sort of 3D effect.
wonderful explanation, amazing narration elegant editing
Awesome video! :) Hadn't the paper on my radar yet, now I'll have to read it.
your content organisation is very good
Thanks! Glad we did a thing right.
The first time I heard the sound effects you used when expanding stuff (parameters, encoder size) I literally thought it's my stomach growling. Darn it right when it was getting serious :D
Lol 😂 You are nominated for the funniest comment award.
I'm old enough!
Awesome explanation :)
Thanks!
thank you from japan
Hey, it's you! Missed your comments. :)
Great video, thanks! I'm a bit confused how the transfer learning/downstream tasks will work with the encoder if it's sequence length now needs to be increased? Or is the encoder sequence length set to the total # patches, and attention masking/padding is used during pretraining?
certified classic
That's a certified hood classic
I've been working on VITMAE for 2 days. Thanks for this video, very interesting.
Glad it was helpful! Keen to share what are you planning to do with it? :)
@@AICoffeeBreak I'm very interested in processing audio, particularly spectrograms. Ideally I think we need the equivalent of a LLM for acoustics. A really good embedding model for time series.
👍
I am curious why non-overlapping patches were chosen. I would think that would lead to reconstruction errors.
Thanks for the question. But could you please elaborate a little bit why this would cause errors and why overlapping patches would ameliorate the problem? The patches are non-overlapping but tile the entire image. And attention allows for patches to be informed about their fellow patches.
@@AICoffeeBreak I dont really have a rigorous answer but my intuition is telling me that forcing the model to predict every boundary between patches is less accurate than a model that actually gets to see the boundary as data.
Thinking more about it though, I do understand though that more patches means more work for the attention and thus would counter the advantage gained from removing patches through masking...
05:18 Are there any references where it is possible to look at in more detail into the phenomena of the introduction of artifacts generated by the usage of masking in CNN autoencoders? At a first glance I couldn't see the author taking care in highlighting this fact.
P.S. The animations are great as always.
Hey there, thanks for the video!
I'm late to the party, but I don't understand something:
How is this architecture usefull for downstream tasks like classification ? I undersatnd you can ditch the decoder and put your downstream classifier instead.
However, the architecture of the encoder reads 25% of the input (75% being masked). Won't this seriously lower the quality of the system compared to a classical autoencoder ?
Hmm, you wouldn't do the masking for classification tasks where one is interested in representations, would you? The masking is just for training.
Thank you for your work, but nonetheless, I still struggle to capture the idea of mask tokens, which seems crucial. I'm new to the field of transformers, but used to good old CNN autoencoders, and what bothers me is: how the masked tokens can be directly fed into the decoder even thought their latent representations hasn't been computed? From what I understood, this isn't the masked tokens which are fed but some learnable shared vector. Am I right?
Thanks for the vidéo. Do you know why bert would not use this strategy and just give to the encoder thé not masked words?
Because the masked words have to be predicted, meaning: a representation has to be computed there which in transformers (as much as goes in, goes out again) means that BERT has to process the mask words too.
Not even the paper presented in the video gets away from that curse, because the decoder has to see the masks again.
@@AICoffeeBreak Ok, and could BERT do this like in this paper (or why they do not use this same strategy)? aka give the (not masked/swapped) word to the encoder, and in the decoder give the embedded words + the masked worlds (that would be learning, like in this paper).
This would also allow to have a bigger encoder during training.
@@Youkouleleh Ah, now I see the confusion: BERT does not actually have a (heavyweight ) decoder. The "decoder" is just an MLP performing classification *on the MASK tokens* after they have been encoded. The decoder you just presented, is in a sense already the BERT encoder.
See first answer to this question: stackoverflow.com/questions/60382793/what-are-the-inputs-to-the-transformer-encoder-and-decoder-in-bert
But it also might be that I am confused. Or Ms. Coffee Bean. If I am right, it is me. If I am wrong, it is Ms. Coffee Bean. 😅
@@AICoffeeBreak thanks for your answer, I had this idea that BERT was some kind of autoencoder, but not really.
But it is quite close to an AE + the matching sentence task. If the classification for non-masked words would count in the loss, I think it would be an autoencoder + matching sentence task
The video is informative and supported by good animations, but you need to speak a little slowly and have some breaks in your speech. Because sometimes there is too much information in one sentence. Thank you for your effort and I hope you will take this feed back. I discovered your channel today and I subscribed.
Any assistance on how to use this model for just encoding without masking , like she suggests at 12:02 ? the huggingface implementation seems to be performing some masking.
Idk what will happen by the time I get into a PhD , AI will be crazy
Where are you at the moment?
@@AICoffeeBreak Bachelors lol
I pity you.
Hold on there.
Thank you for your nice explanation, but I would like to point out that MAE is not the first proposing this idea. in April 2021 which is much much earlier than MAE, we proposed "SiT: self-supervised vision transformers" and showed its merit on small datasets because as a small group, we can not afford training on ImageNet. Despite the fact that we contacted the authors of MAE to acknowledge the original research, they did not respond to us! Similarly, Microsoft also used the same idea in "SimMIM - A Simple Framework for Masked Image Modelling" and they did not acknowledge us. I would really appreciate if you support the original research and mention this story in your channel. Nowadays, the research is only acceptable and acknowledged if it is coming from these tech giants, and there is no place for small groups anymore.
As a member of a small group myself, I really feel your pain. I usually do criticize in my videos that the huge companies are dominating. Often times just use larger resources and not much in terms of ideas and it looks more like engineering scale and less like research.
It's a pity they did not cite you even after pointing this out. This is bad practice.
Feels so bad hearing about this... Hurts enough to think of something and see that it already exists, but this is worse. In general really feels like david vs goaliath at some point... Even aside from not getting visibility, not having the resources sucks, especially when (as it seems) most of the recent cool papers (pathways, dalle2, etc.) seem to stem from having vast amounts of data & computation power, not having cool new ideas :( when even evaluation is so bloody expensive, even on simple datasets, completely can knock you out of competition...