How do Vision Transformers work? - Paper explained | multi-head self-attention & convolutions

Sdílet
Vložit
  • čas přidán 17. 07. 2024
  • It turns out that multi-head self-attention and convolutions are complementary. So, what makes multi-head self-attention different from convolutions? How and why do Vision Transformers work? In this video, we will find out by explaining the paper “How Do Vision Transformers Work?” by Namuk & Kim, 2021.
    SPONSOR: Weights & Biases 👉 wandb.me/ai-coffee-break
    ⏩ Vision Transformers explained playlist: • Vision Transformers ex...
    📺 ViT: An image is worth 16x16 pixels: • An image is worth 16x1...
    📺 Swin Transformer: • Swin Transformer paper...
    📺 ConvNext: • ConvNeXt: A ConvNet fo...
    📺 DeiT: • Data-efficient Image T...
    📺 Adversarial attacks: • Adversarial Machine Le...
    ❓Check out our daily #MachineLearning Quiz Questions: ►
    / aicoffeebreak
    ➡️ AI Coffee Break Merch! 🛍️ aicoffeebreak.creator-spring....
    Thanks to our Patrons who support us in Tier 2, 3, 4: 🙏
    Don Rosenthal, Dres. Trost GbR, banana.dev -- Kyle Morris, Joel Ang
    Paper 📜:
    Park, Namuk, and Songkuk Kim. "How Do Vision Transformers Work?." In International Conference on Learning Representations. 2021. openreview.net/forum?id=D78Go...
    🔗 Official implementation: github.com/xxxnell/how-do-vit...
    Outline:
    00:00 Transformers vs ConvNets
    01:04 Sponsor: Weights & Biases
    02:21 Convolutions explained in a nutshell
    03:35 Multi-Head Self-Attention explained
    06:46 Why we thought that MSA is cool
    09:56 Paper insights
    15:26 MSA vs. Convs (more insight)
    16:07 Low-pass filters (MSA) and high-pass filters (Convs)
    ▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀
    🔥 Optionally, pay us a coffee to help with our Coffee Bean production! ☕
    Patreon: / aicoffeebreak
    Ko-fi: ko-fi.com/aicoffeebreak
    ▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀
    🔗 Links:
    AICoffeeBreakQuiz: / aicoffeebreak
    Twitter: / aicoffeebreak
    Reddit: / aicoffeebreak
    CZcams: / aicoffeebreak
    #AICoffeeBreak #MsCoffeeBean #MachineLearning #AI #research​
    Music 🎵 : Bella Bella Beat by Nana Kwabena
  • Věda a technologie

Komentáře • 63

  • @AICoffeeBreak
    @AICoffeeBreak  Před 2 lety +19

    Okay bye!

    • @devstuff2576
      @devstuff2576 Před rokem

      this was so confusing --and when you said "do not be confused, we are talking about inference here..." you then proceeded to make it even more confusing.

  • @MrMIB983
    @MrMIB983 Před 2 lety +9

    Love your channel, best ml videos, you are so kind

  • @DerPylz
    @DerPylz Před 2 lety +7

    Thank you for summarising this paper!

  • @HoriaCristescu
    @HoriaCristescu Před 2 lety +5

    Very good video, especially the mention of the Hessian eigenvalues relation to the curvature of the loss landscape.

  • @willsmithorg
    @willsmithorg Před 2 lety +3

    Thank you. Interesting and well summarised as always.

  • @vladimirtchuiev2218
    @vladimirtchuiev2218 Před 2 lety +6

    From my experience, rather than going full ViT, I think that the structure of a ConvNet backbone that extracts features, which then are fed to MSA (a la DETR) is a stronger structure for vision, as evident in current SOTA approaches for image classification.

  • @MachineLearningStreetTalk

    Amazing video! 😎

  • @edd36
    @edd36 Před 2 lety +3

    Thank you very much for the video!

  • @ronen300
    @ronen300 Před rokem +3

    I wonder if the high frequency details would remain in MSA if they would not patchify the image into (16 by 16) or (8 by 8) pathces , and use each pixel individually ...
    Because it seems to me that the high frequncy robustness of MSA could be related to this process

  • @PritishMishra
    @PritishMishra Před 2 lety +4

    I guess I can collect all the voice samples of your videos and try to train a model to predict which kind of expression Coffee Bean will make 😆

    • @AICoffeeBreak
      @AICoffeeBreak  Před 2 lety +3

      That would be so cool. Let me know if you really try it out. 🤝 I want to know how well it works!

    • @PritishMishra
      @PritishMishra Před 2 lety +2

      ​@@AICoffeeBreak sure, I will let you know :-)

    • @AICoffeeBreak
      @AICoffeeBreak  Před 2 lety +3

      So excited about this! 😄

  • @mr_tpk
    @mr_tpk Před 2 lety +3

    Thank you for sharing ❤️

  • @_bustion_1928
    @_bustion_1928 Před rokem +3

    Very nice video and very nice paper. I've had an idea of combining convs and msa for a long time now...

  • @muhammadsalmanali1066
    @muhammadsalmanali1066 Před 2 lety +3

    Thank you so much for all the hard work
    Regards
    A struggling PhD student

    • @AICoffeeBreak
      @AICoffeeBreak  Před 2 lety +3

      Thanks and whish you all the best!
      A fellow struggling PhD student. 🙃

  • @AD173
    @AD173 Před 2 lety +6

    Why all this emphasis on local minima at the end? You are highly likely to reach them anyhow.
    The bigger issue in higher dimensional space is saddle points; It has been shown that momentum and the like can help avoid those. The Dauphin paper that is referenced in this paper is about saddle points, not local minima. Heck, its title is "Identifying and attacking the saddle point problem in high-dimensional non-convex optimization
    "
    I mean, it's kinda a myth that you are aiming for global minima in deep learning. The point behind using deep learning is that it has been shown empirically, by Yann LeCun and others, that NNs with higher number of neurons tend to have more "good" local minima (that is, local minima that do not have losses much higher than the global one). See, for instance, "The Loss Surfaces of Multilayer Networks" by Choromanska et al.
    I would love to be shown a paper that definitely proves or demonstrates that any method, momentum or not, can "skip" a local minima over to a better local minima or somehow find the global one. I mean, several papers do contain speculative sentences implying, or straight up claiming, that, but I have never seen the papers actually proving that mathematically or empirically. Do you know something I don't?

    • @AICoffeeBreak
      @AICoffeeBreak  Před 2 lety +3

      I was simplistically referring to bad local minima. Better: bad optima.
      Even better: your comment.

    • @AICoffeeBreak
      @AICoffeeBreak  Před 2 lety +2

      Edit: found the quiz question about this! czcams.com/users/postUgysXVJwPznKibv1LR14AaABCQ
      Original: I remember reading about this problem that saddle points are more common than local minima in high dimensions in the "Deep Learning" book by Goodfellow et al. 2016. I could swear we also had a quiz question on this but now I cannot find it and I assume I only dreamed about it. 😬

    • @AD173
      @AD173 Před 2 lety +3

      @@AICoffeeBreak Hehe, thank you, but I kinda hate my comment because I feel a bit pedantic.
      Can I ask you for a favor...?
      Look, the internet is full of bad information regarding our field, and there are tons of people covering the research done by the giants. I get that their results probably drives traffic to your channel, but it's already being covered by everyone.
      The weakness of a lot of those resources is that they get a lot of the basics wrong; Results like the ones I mentioned in my previous comment get ignored; You'll hardly find an "AI expert" here on CZcams talking about how finding good local minima is the reason we create deeper and deeper NN architectures.
      How about covering the more rigorous side of deep learning? I mean, probably not something super-theoretical like the well-posedness of inverse Bayesian problems, but just providing a deeper insights into deep learning? That probably requires going through the papers cited by the giants rather than the giants' papers themselves. No one seem to do that; It leads to a lot of misconceptions in the community (which is full of people who hate mathematics and thus do not read papers but instead learn through simplified intros on CZcams, toward data science and the like).
      It would be good to have a resource focusing on rectifying misconceptions within the community.

    • @AD173
      @AD173 Před 2 lety +3

      ​@@AICoffeeBreak Yeah, the problem isn't just that they exist, but also that they are hard to overcome. Essentially, the situation is "my loss has stopped improving.. Am I at the local minimum now? Or at a saddle point? How can I tell? I mean, the gradient is practically zero in all directions." That's a huge part of the reason behind the creation, and success, of momentum.

    • @AICoffeeBreak
      @AICoffeeBreak  Před 2 lety +4

      I knew it: Here is the question! Buried so deep that YT did not first show it to me. czcams.com/channels/obqgqE4i5Kf7wrxRxhToQA.htmlcommunity?lb=UgysXVJwPznKibv1LR14AaABCQ

  • @anadianBaconator
    @anadianBaconator Před 2 lety +3

    Fantastic!

  • @user-nr7uw1ve9n
    @user-nr7uw1ve9n Před 2 lety +4

    Thank you very much for sharing high quality videos!
    17:42 As far as I know, the blue region of the figure 9(left) implies pooling layers of convnets, not MSA. They also reduce the variance of feature maps by subsampling, which behaves like spatial smoothing of MSA layers.

    • @AICoffeeBreak
      @AICoffeeBreak  Před 2 lety +2

      Yes, correct! I exactly remember that I added these pen highlights last minute and meant to put the MSA on the grey parts of the right hand side of the figure (for ViT). Somehow (the inexplicability of doing something and not looking twice) it landed on the ResNet side. 🙈

    • @AICoffeeBreak
      @AICoffeeBreak  Před 2 lety +2

      Thanks for pointing this out! 🎯

  • @flamboyanta4993
    @flamboyanta4993 Před 2 lety +3

    Good luck with the Phd Letitia (and ms Coffee bean, too)!
    Stay strong!

  • @bradleypliam110
    @bradleypliam110 Před 2 lety +3

    Ms. Coffee Bean isn't animating herself .... yet 😉

  • @LermanProductions
    @LermanProductions Před 2 lety +3

    The graphics and animations here are so good. I’m looking to make deep learning videos. How do you produce these animations, like the one with the convolution kernel?? Not going to rip off your channel btw haha, just surprised to see such fancy graphics in a AI technical video

    • @PritishMishra
      @PritishMishra Před 2 lety +3

      I am not promoting my channel here... but you can look at my latest video on the convolutional neural network I have made with Manim.. the package created by Grant Sanderson aka 3blue1brown.

    • @LermanProductions
      @LermanProductions Před 2 lety +3

      @@PritishMishra Thanks

    • @AICoffeeBreak
      @AICoffeeBreak  Před 2 lety +4

      Definitely have a look at manim if you plan to visualise maths.
      I animated that conv with PowerPoint 🙈🙈

  • @bielmonaco
    @bielmonaco Před rokem +3

    Where are you Letitia? We need you to explain us more pappers! Please, comeback 😊

    • @AICoffeeBreak
      @AICoffeeBreak  Před rokem +3

      It's been hard to find time to publish more than one video per month these days, sorry. :(

  • @wolfganggro
    @wolfganggro Před 2 lety +5

    Quick question, how are you animation your videos? Sorry if this has been answered before.

    • @AICoffeeBreak
      @AICoffeeBreak  Před 2 lety +3

      Everything but the Coffee Bean is just good old Powerpoint. 🙈
      What are your plans? :) The last time I got this question, someone wanted to start a channel himself. (Outlier: czcams.com/video/wcqLFDXaDO8/video.html )

    • @wolfganggro
      @wolfganggro Před 2 lety +2

      @@AICoffeeBreak Hmm.. ja 😅 good questions. Not in particular, sometime I think this could be fun, but doing it like 3Blue1Brown with manim seem like a lot of work. The way you are doing it seems more approachable and you manage to create a great video and get the information across.

    • @AICoffeeBreak
      @AICoffeeBreak  Před 2 lety +2

      @@wolfganggro You can save even more effort in working with the paper directly, see Yannic. That works well too! :)
      Manim is great for math stuff, but I am not sure how this would scale for less math-centric visualizations.

    • @wolfganggro
      @wolfganggro Před 2 lety +2

      @@AICoffeeBreak Thanks for the input. I'm actually more interested in the mathy topics, but I guess I just have to give it a try at some point :-)

    • @AICoffeeBreak
      @AICoffeeBreak  Před 2 lety +2

      Exactly. Try it out. And send me a link when you did. 😶‍🌫️

  • @muhammadwaseem_
    @muhammadwaseem_ Před 10 měsíci +2

    I would like to learn more on hessian and its eigen value interpretation on loss landscapes. Could anyone suggest any good materials?

    • @AICoffeeBreak
      @AICoffeeBreak  Před 10 měsíci +2

      The Deep Learning book by Goodfellow et al. is a good start. And it is freely available.

    • @muhammadwaseem_
      @muhammadwaseem_ Před 10 měsíci +1

      @@AICoffeeBreak Thank you so much!

  • @saeednuman
    @saeednuman Před 2 lety +4

    Thank you for such a detailed video; I have a question regarding the loss landscapes. Recently, I read the paper "When Vision Transformers Outperform ResNets without Pre-training or Strong Data Augmentations," which also talks about the loss landscapes of ViTs and ResNets but. If I compared that to this paper, the conclusion is different. Please can you help me understand what I am missing here?. Note: Sorry, I am reposting it because I cannot see my previous comment.

    • @namukpark
      @namukpark Před 2 lety +4

      Hi, I'm the author of the paper "How Do Vision Transformers Work?" Thanks for the thoughtful feedback!
      The difference between the loss landscape visualization in the paper "When Vision Transformers Outperform ResNets..." and our empirical results is due to the following aspects: (1) They only visualized cross-entropy (NLL) landscapes, but we visualize loss (NLL + L2) landscapes on augmented datasets. Since NN optimizes "NLL + L2" on "augmented datasets" -- not "NLL" on "clean datasets" -- we believe that it is appropriate to visualize NLL + L2 on augmented datasets. (2) They used training configurations that is significantly different from standard practice, while we use a DeiT-style configuration. Since DeiT-style configuration is the de facto standard in ViT training, we believe our insights can be applied to a larger number of studies. (3) Other evidences: A box blur (the simplest low-pass filter) also flattens the losses (arxiv.org/abs/2105.12639); Hybrid model has flat loss; Learning trajectories and Hessian spectra, a *_set_* of Hessian eigenvalues, also lead to the same conclusion; ViT (flat loss) is robust against data perturbations; and so on.
      As pointed out in our paper, loss landscape smoothing methods can improve optimizations by reducing negative Hessian values.

  • @muhammadsalmanali1066
    @muhammadsalmanali1066 Před 2 lety +1

    We perform computations on key (K) and query (Q) for MSA. We get their values by multiplying our feature values (data-dependent) with the weights learned during the training. So even during the inference process the data might vary but the weight values are fixed to get our K and Q, So how MSAs are data agnostic?

    • @AICoffeeBreak
      @AICoffeeBreak  Před 2 lety +3

      Sorry, I am confused now. 😕 We were making this point for convolutions, not for MSA. MSA are dynamic w.r.t data even at inference time. Could you maybe post the timestamp where you understood it otherwise?

  • @Murphyalex
    @Murphyalex Před 2 lety +3

    "We are talking about inference, here, or about one iteration during training" (5:18) - I know you know that people use the term inference differently between ML and statistics, and in statistics it means to learn something about a system based on a sample, while in ML it means putting new data through a trained model. Here, you seem to imply that inference is connected to training, which jars with all previous experience I have with this term. I'd just do like many other researchers do and plainly avoid using the term, in favour of other less ambiguous / contentious words or phrasings.

    • @AICoffeeBreak
      @AICoffeeBreak  Před 2 lety +3

      Sorry, I do not use any new meaning here. I mean inference as in making a prediction by running a sample through a model (forward pass).

    • @AICoffeeBreak
      @AICoffeeBreak  Před 2 lety +3

      I should have just said forward pass. It encompasses both the inference (prediction step at test time) as well as a snapshot in one training step.

    • @Murphyalex
      @Murphyalex Před 2 lety +3

      @@AICoffeeBreak That would have been definitely much more clear to me. I didn't mean it as a criticism (because I love your stuff), it was just a tip to be clearer and sometimes it's good to suggest these things as it means future content is even more clear, concise and easier to follow.

    • @AICoffeeBreak
      @AICoffeeBreak  Před 2 lety +3

      @@Murphyalex constructive criticism is my favourite, thanks!

    • @AICoffeeBreak
      @AICoffeeBreak  Před 2 lety +4

      @@Murphyalex "ML" and a "good terminology" are quite low in overlap and I try my best not to add to the confusion. 😅

  • @simsonyee
    @simsonyee Před rokem

    QK^t @4.43!!