Masked Autoencoders Are Scalable Vision Learners - Paper explained and animated!

Sdílet
Vložit
  • čas přidán 7. 07. 2024
  • “Masked Autoencoders Are Scalable Vision Learners” paper explained by Ms. Coffee Bean. Say goodbye to contrastive learning and say hello (again) to autoencoders in #ComputerVision! Love the simple, yet elegant idea!
    ► Check out our sponsor: Weights & Biases 👉 wandb.me/ai-coffee-break
    📺 Vision Transformer explained: • Vision Transformers ex...
    Thanks to our Patrons who support us in Tier 2, 3, 4: 🙏
    donor, Dres. Trost GbR, Yannik Schneider
    ➡️ AI Coffee Break Merch! 🛍️ aicoffeebreak.creator-spring....
    Paper 📜: He, Kaiming, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll'ar and Ross B. Girshick. “Masked Autoencoders Are Scalable Vision Learners.” (2021). arxiv.org/abs/2111.06377
    References:
    🔗 blog.keras.io/building-autoen...
    🔗 www.deeplearningbook.org/
    🔗 / 1462446494766837773
    📺 ViT video: • An image is worth 16x1...
    📺 DeiT: • Data-efficient Image T...
    📺 Swin Transformer: • Swin Transformer paper...
    Outline:
    00:00 Intro
    00:41 Weights & Biases (Sponsor)
    02:10 What are autoencoders?
    05:03 Differences between vision and language masked autoencoding
    07:02 How does masked autoencoding work for images?
    ▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀
    🔥 Optionally, pay us a coffee to help with our Coffee Bean production! ☕
    Patreon: / aicoffeebreak
    Ko-fi: ko-fi.com/aicoffeebreak
    ▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀
    ----------------
    🔗 Links:
    AICoffeeBreakQuiz: / aicoffeebreak
    Twitter: / aicoffeebreak
    Reddit: / aicoffeebreak
    CZcams: / aicoffeebreak
    #AICoffeeBreak #MsCoffeeBean #MachineLearning #AI #research​
  • Věda a technologie

Komentáře • 54

  • @harumambaru
    @harumambaru Před 2 lety +9

    55 views, I am early bird! I hope you get enough money for coffee from sponsors :) I am not mocking, I really happy that even young channels are supported by sponsors and so happy that this sponsor can be helpful for most of the viewers

    • @AICoffeeBreak
      @AICoffeeBreak  Před 2 lety +6

      Thanks! I can totally relate to your point. I feel the same when it comes to small CZcamsrs I love.

    • @harumambaru
      @harumambaru Před 2 lety +3

      @@AICoffeeBreak Could you list couple of small youtubers you love? I am into 3blue1brown, Yannik and 2min papers but they all are pretty huge

    • @AICoffeeBreak
      @AICoffeeBreak  Před 2 lety +4

      Small but sponsored? No (except Sabine Hossenfelder, but she is not small).
      Just small: Machine Learning Street Talk, Alfredo Canziani, Henry AI Labs, Jay Alammar, The AI Epiphany, Aladdin Persson, Gradient Dude, vcubingx

    • @harumambaru
      @harumambaru Před 2 lety +2

      @@AICoffeeBreak wow, you made my weekend instead of watching Monster Hunter with Milla Jovovich I am going to watch Sabine Hossenfelder protein folding videos

  • @user-js9qb7hz5e
    @user-js9qb7hz5e Před 2 lety +16

    I have been procrastinating reading the paper until now and you just made a video, perfect.

    • @AICoffeeBreak
      @AICoffeeBreak  Před 2 lety +9

      You were not procrastinating. You were waiting for us to make the video. 😂

  • @beizhou4025
    @beizhou4025 Před 2 lety +19

    The animation is awesome. Thank you for taking the effort!

  • @michaellellouch3682
    @michaellellouch3682 Před 2 lety +3

    Cool stuff. Thanks for keeping us up to date on papers outside of our domain

  • @prajwalsood1350
    @prajwalsood1350 Před 2 lety +5

    Can't thank you enough, I have to present this paper in my class and this helps me alot

  • @cipritom
    @cipritom Před 2 lety +3

    In addition, I love the sound effects of the layer growing ! Nice video !

    • @AICoffeeBreak
      @AICoffeeBreak  Před 2 lety +2

      Thanks! Including sound effects doesn't mean much most of the time. But at the right spots, it can trigger a sort of 3D effect.

  • @deoabhijit5935
    @deoabhijit5935 Před 2 lety +3

    wonderful explanation, amazing narration elegant editing

  • @nilsmuller9286
    @nilsmuller9286 Před 2 lety +4

    Awesome video! :) Hadn't the paper on my radar yet, now I'll have to read it.

  • @soumyasarkar4100
    @soumyasarkar4100 Před 2 lety +5

    your content organisation is very good

  • @Mrbits01
    @Mrbits01 Před 2 lety +3

    The first time I heard the sound effects you used when expanding stuff (parameters, encoder size) I literally thought it's my stomach growling. Darn it right when it was getting serious :D

    • @AICoffeeBreak
      @AICoffeeBreak  Před 2 lety +2

      Lol 😂 You are nominated for the funniest comment award.

  • @DerPylz
    @DerPylz Před 2 lety +9

    I'm old enough!

  • @MengJiun_Chiou
    @MengJiun_Chiou Před 2 lety +3

    Awesome explanation :)

  • @arigato39000
    @arigato39000 Před 2 lety +2

    thank you from japan

  • @mattcoleman2819
    @mattcoleman2819 Před 2 lety +7

    Great video, thanks! I'm a bit confused how the transfer learning/downstream tasks will work with the encoder if it's sequence length now needs to be increased? Or is the encoder sequence length set to the total # patches, and attention masking/padding is used during pretraining?

  • @sadface7457
    @sadface7457 Před 2 lety +7

    certified classic

    • @Agrover112
      @Agrover112 Před 2 lety

      That's a certified hood classic

  • @garyhuntress6871
    @garyhuntress6871 Před rokem +1

    I've been working on VITMAE for 2 days. Thanks for this video, very interesting.

    • @AICoffeeBreak
      @AICoffeeBreak  Před rokem +2

      Glad it was helpful! Keen to share what are you planning to do with it? :)

    • @garyhuntress6871
      @garyhuntress6871 Před rokem +1

      @@AICoffeeBreak I'm very interested in processing audio, particularly spectrograms. Ideally I think we need the equivalent of a LLM for acoustics. A really good embedding model for time series.

  • @pohsoonchang6127
    @pohsoonchang6127 Před 2 lety +3

    👍

  • @terryr9052
    @terryr9052 Před 2 lety +4

    I am curious why non-overlapping patches were chosen. I would think that would lead to reconstruction errors.

    • @AICoffeeBreak
      @AICoffeeBreak  Před 2 lety +2

      Thanks for the question. But could you please elaborate a little bit why this would cause errors and why overlapping patches would ameliorate the problem? The patches are non-overlapping but tile the entire image. And attention allows for patches to be informed about their fellow patches.

    • @terryr9052
      @terryr9052 Před 2 lety +3

      @@AICoffeeBreak I dont really have a rigorous answer but my intuition is telling me that forcing the model to predict every boundary between patches is less accurate than a model that actually gets to see the boundary as data.
      Thinking more about it though, I do understand though that more patches means more work for the attention and thus would counter the advantage gained from removing patches through masking...

  • @Tondo95
    @Tondo95 Před 2 lety +1

    05:18 Are there any references where it is possible to look at in more detail into the phenomena of the introduction of artifacts generated by the usage of masking in CNN autoencoders? At a first glance I couldn't see the author taking care in highlighting this fact.
    P.S. The animations are great as always.

  • @antoinegar.638
    @antoinegar.638 Před 4 měsíci +1

    Hey there, thanks for the video!
    I'm late to the party, but I don't understand something:
    How is this architecture usefull for downstream tasks like classification ? I undersatnd you can ditch the decoder and put your downstream classifier instead.
    However, the architecture of the encoder reads 25% of the input (75% being masked). Won't this seriously lower the quality of the system compared to a classical autoencoder ?

    • @AICoffeeBreak
      @AICoffeeBreak  Před 4 měsíci +1

      Hmm, you wouldn't do the masking for classification tasks where one is interested in representations, would you? The masking is just for training.

  • @nicolettileo
    @nicolettileo Před rokem

    Thank you for your work, but nonetheless, I still struggle to capture the idea of mask tokens, which seems crucial. I'm new to the field of transformers, but used to good old CNN autoencoders, and what bothers me is: how the masked tokens can be directly fed into the decoder even thought their latent representations hasn't been computed? From what I understood, this isn't the masked tokens which are fed but some learnable shared vector. Am I right?

  • @Youkouleleh
    @Youkouleleh Před 2 lety +4

    Thanks for the vidéo. Do you know why bert would not use this strategy and just give to the encoder thé not masked words?

    • @AICoffeeBreak
      @AICoffeeBreak  Před 2 lety +3

      Because the masked words have to be predicted, meaning: a representation has to be computed there which in transformers (as much as goes in, goes out again) means that BERT has to process the mask words too.
      Not even the paper presented in the video gets away from that curse, because the decoder has to see the masks again.

    • @Youkouleleh
      @Youkouleleh Před 2 lety +2

      @@AICoffeeBreak Ok, and could BERT do this like in this paper (or why they do not use this same strategy)? aka give the (not masked/swapped) word to the encoder, and in the decoder give the embedded words + the masked worlds (that would be learning, like in this paper).
      This would also allow to have a bigger encoder during training.

    • @AICoffeeBreak
      @AICoffeeBreak  Před 2 lety +3

      @@Youkouleleh Ah, now I see the confusion: BERT does not actually have a (heavyweight ) decoder. The "decoder" is just an MLP performing classification *on the MASK tokens* after they have been encoded. The decoder you just presented, is in a sense already the BERT encoder.
      See first answer to this question: stackoverflow.com/questions/60382793/what-are-the-inputs-to-the-transformer-encoder-and-decoder-in-bert

    • @AICoffeeBreak
      @AICoffeeBreak  Před 2 lety +2

      But it also might be that I am confused. Or Ms. Coffee Bean. If I am right, it is me. If I am wrong, it is Ms. Coffee Bean. 😅

    • @Youkouleleh
      @Youkouleleh Před 2 lety +2

      @@AICoffeeBreak thanks for your answer, I had this idea that BERT was some kind of autoencoder, but not really.
      But it is quite close to an AE + the matching sentence task. If the classification for non-masked words would count in the loss, I think it would be an autoencoder + matching sentence task

  • @youssefprojects7757
    @youssefprojects7757 Před rokem +1

    The video is informative and supported by good animations, but you need to speak a little slowly and have some breaks in your speech. Because sometimes there is too much information in one sentence. Thank you for your effort and I hope you will take this feed back. I discovered your channel today and I subscribed.

  • @aishik11
    @aishik11 Před 2 lety +2

    Any assistance on how to use this model for just encoding without masking , like she suggests at 12:02 ? the huggingface implementation seems to be performing some masking.

  • @Agrover112
    @Agrover112 Před 2 lety +2

    Idk what will happen by the time I get into a PhD , AI will be crazy

  • @Easyy-Peasyy-Cooking
    @Easyy-Peasyy-Cooking Před 2 lety +4

    Thank you for your nice explanation, but I would like to point out that MAE is not the first proposing this idea. in April 2021 which is much much earlier than MAE, we proposed "SiT: self-supervised vision transformers" and showed its merit on small datasets because as a small group, we can not afford training on ImageNet. Despite the fact that we contacted the authors of MAE to acknowledge the original research, they did not respond to us! Similarly, Microsoft also used the same idea in "SimMIM - A Simple Framework for Masked Image Modelling" and they did not acknowledge us. I would really appreciate if you support the original research and mention this story in your channel. Nowadays, the research is only acceptable and acknowledged if it is coming from these tech giants, and there is no place for small groups anymore.

    • @AICoffeeBreak
      @AICoffeeBreak  Před 2 lety +3

      As a member of a small group myself, I really feel your pain. I usually do criticize in my videos that the huge companies are dominating. Often times just use larger resources and not much in terms of ideas and it looks more like engineering scale and less like research.
      It's a pity they did not cite you even after pointing this out. This is bad practice.

    • @Phenix66
      @Phenix66 Před 2 lety +3

      Feels so bad hearing about this... Hurts enough to think of something and see that it already exists, but this is worse. In general really feels like david vs goaliath at some point... Even aside from not getting visibility, not having the resources sucks, especially when (as it seems) most of the recent cool papers (pathways, dalle2, etc.) seem to stem from having vast amounts of data & computation power, not having cool new ideas :( when even evaluation is so bloody expensive, even on simple datasets, completely can knock you out of competition...