An image is worth 16x16 words: ViT | Vision Transformer explained

Sdílet
Vložit
  • čas přidán 29. 08. 2024

Komentáře • 56

  • @yemiyesufu5745
    @yemiyesufu5745 Před 3 lety +15

    I recently found this channel and I've been binge-watching your videos ever since. Great Job!

    • @AICoffeeBreak
      @AICoffeeBreak  Před 3 lety +4

      Welcome aboard! Binge-watching is disputedly the best approach to this.

  • @jonatan01i
    @jonatan01i Před 3 lety +6

    The first layer of this model is still a convolution.

    • @AICoffeeBreak
      @AICoffeeBreak  Před 3 lety +3

      Good observation!
      @speed100mph made the same two months ago, see below. I responded there. 😀

  • @dianai988
    @dianai988 Před 3 lety +11

    Great video! Especially relevant for me because I was just talking with a professor about how transformers seem to dominate everything in nlp these days. And I think I have an inkling of who these anonymous authors are--looking at you TPUs 😂

  • @JohnDoe-ft5mq
    @JohnDoe-ft5mq Před 2 lety +3

    Love them humors. Keep up the good work

  • @user-or7ji5hv8y
    @user-or7ji5hv8y Před 3 lety +5

    wow, another great video!

  • @sagarsurendran9710
    @sagarsurendran9710 Před 3 lety +4

    Hahaha, I loved your explanations!!

  • @Youkouleleh
    @Youkouleleh Před 3 lety +6

    thanks for the video

  • @sarahjamal86
    @sarahjamal86 Před 2 lety +2

    Great job lady! Watching your videos while in the gym :-)

  • @ashwinjayaprakash7991
    @ashwinjayaprakash7991 Před 3 lety +4

    Coffee bean looks awesome 👌

  • @nguyenanhnguyen7658
    @nguyenanhnguyen7658 Před 3 lety +6

    It tooks ViT 400m images to achieve just about what CNN does on ImageNet 1M, and with only 10-20m params, ViT took the order of magnitude more params though. Simply put, in NLP there are at most a few hundred thousands of words. Well in imaging, you can guess the wildered diversity of images, that is why CNN works.

  • @arigato39000
    @arigato39000 Před 3 lety +4

    thank you

  • @bartlomiejkubica1781
    @bartlomiejkubica1781 Před 6 měsíci +1

    What Can I say other than a simple `Thank you!'... 🙂

  • @ShubhamYadav-xr8tw
    @ShubhamYadav-xr8tw Před 3 lety +17

    The realm dominated for centuries by CNNs. Lol. :P
    Nice video Letitia!
    Do you make your own animations for the explanations of the algorithm?

    • @AICoffeeBreak
      @AICoffeeBreak  Před 3 lety +12

      Thanks for the kind words and for the question! Yes, I do try to have everything self-made if possible! I made all animations of the algorithm, I also drew Ms. Coffee Bean.
      As mentioned in the video description, I use emojis designed by OpenMoji (the bomb, the weightlifter, the feather...), because this saves me time and I can learn more and become better in important things, like the algorithm explanation animations.

  • @speed100mph
    @speed100mph Před 3 lety +2

    btw, first linear projection on patches of 16x16 pixels is essentially or mathematically is convolution with kernel size of 16 and stride 16. So anonymous authors are not proposing anything new :P, it essentially very similar to non local newural networks

    • @AICoffeeBreak
      @AICoffeeBreak  Před 3 lety +5

      Good observation! You are totally right, that linear projection is working like a convolution. But as I see it (and I think our opinions diverge here), this is a non-essential design choice. It could be any kind of transformation to get a 1d vector from 2d image patches for the Transformer to work with.

    • @jonatan01i
      @jonatan01i Před 3 lety +2

      @@AICoffeeBreak We could do anything to map the 2D patches to 1D vectors, but as long as we touch the numbers and the same computation is done to all the patches, that's convolution.

    • @AICoffeeBreak
      @AICoffeeBreak  Před 3 lety +6

      @@jonatan01i It is technically a convolution! Motivated by the prior, that without any knowledge about image regions, we should vectorize all patches the same. But this is an arbitrary design choice, almost like a hyperparameter: In a self-driving car setting where the sky is always on top and the street in the lower image region, one might choose to vectorize up vs. down differently.

    • @jonatan01i
      @jonatan01i Před 3 lety +3

      @@AICoffeeBreak Good point on the horizon.

  • @sujits3458
    @sujits3458 Před 3 lety +4

    Good video, thanks :-)

  • @DerPylz
    @DerPylz Před 3 lety +6

    Good video

  • @keroldjoumessi
    @keroldjoumessi Před 2 lety +2

    nice video. However, I misunderstood something. at 3:45 when you said that "the given pattern can be a limitation" are you talking about the transformer or the CNNs?

    • @christyjestin
      @christyjestin Před 2 lety +1

      CNN since the convolutions and pooling only allow you to consider a small patch of the image at a time (although this "patch" does grow to the full image as you go through the layers). The given pattern is just the size of the kernel and pooling.

  • @Freeak6
    @Freeak6 Před 3 lety +4

    Nice video !! :) However the last sentence of the abstract is: "Vision Transformer attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train" but you seem to say the opposite in the video. Did I miss something ? Thanks.

    • @AICoffeeBreak
      @AICoffeeBreak  Před 3 lety +2

      Good observation! The abstract compared the Transformer to a CNN-based model on the same (HUGE) amount of data, in which case you are entirely right: The Transformer is more efficient (computationally).
      However, I do not see where I negate the abstract. The related sentence to this might be "Why does it do better than CNNs? Because the anonymous authors can train on an awful lot of training data...". In hindsight I see better formulations, because that sentence is all about what follows it: I compare Transformers to CNNs in general and I explain that a CNN can deal with less data (it's design bias helps to find the right optimum), but a Transformer can not, since it has more degrees of freedom; that however, with the right amount of data can find original and better solutions than the CNN.
      Does this address your question? :)

    • @Freeak6
      @Freeak6 Před 3 lety +2

      @@AICoffeeBreak You're right. So, to summarize, you're saying that Transformers can be more computationally efficient than CNN if we train them with HUGE amount of data, but that CNNs actually don't require as much data as Transformers to be trained. Is that right ? Thank you.

    • @AICoffeeBreak
      @AICoffeeBreak  Před 3 lety +2

      I think you understood it well!

  • @user-or7ji5hv8y
    @user-or7ji5hv8y Před 3 lety +4

    but once trained, can it be used as part of transfer learning?

  • @efexzium
    @efexzium Před 3 měsíci

    Thanks great video

  • @leecaste
    @leecaste Před 3 lety +4

    Why is it anonymous?

    • @AICoffeeBreak
      @AICoffeeBreak  Před 3 lety +6

      It was anonymous at the time of making the video. It was under double blind peer review at ICLR. Now it is not anymore and I have already updated the video description. 😊

    • @leecaste
      @leecaste Před 3 lety +4

      Oh I see, thank you 😊
      By the way, would be possible to use an image that starts at low resolution and increases its resolution instead of dividing a high resolution image into sections?
      Sorry if that's a stupid question.

    • @AICoffeeBreak
      @AICoffeeBreak  Před 3 lety +7

      @@leecaste Not a stupid questions at all! Neural Nets (like GANs, where PULSE got a lot of notoriety lately because of biases) have been used to increase resolution of images before and it is just a matter of time until this kind of processing will be done with transformers.
      Why they do not start with low resolution here: Low resolution has less information than high resolution images. The high frequencies of the image are lost, i.e. the edges are smeared out.
      So this ViT Transformer of the presented paper, would have to recover the lost information, which is a task on itself. Because the purpose of this ViT transformer is image recognition, it uses all information that it can get (so high resolution). This is why they split the image into processable regions rather than just downsampling. Does this make sense?

    • @leecaste
      @leecaste Před 3 lety +4

      Yes, thank you very much 🙂

  • @efexzium
    @efexzium Před 3 měsíci

    Can u make a video on how to run VIT?

  • @user-xw9cp3fo2n
    @user-xw9cp3fo2n Před 2 lety +1

    Thanks your explanation is amazing
    But can you explain it with some details

  • @HaiderAli-nm1oh
    @HaiderAli-nm1oh Před 10 měsíci +1

    can this vision transformer be used on audio spectrogram ? and used for my specific related task ?

    • @AICoffeeBreak
      @AICoffeeBreak  Před 10 měsíci +2

      It's worth a try.

    • @HaiderAli-nm1oh
      @HaiderAli-nm1oh Před 10 měsíci +1

      @@AICoffeeBreak , i looked into this and found out that their is an implementation done by hugging face in pytorch for my specific use case : Audio spectrogram transformer , which is inspired by vision transformer to process audio spectrogram images , sadly this is done in pytorch :(( , and all of my work is in tensorflow

    • @AICoffeeBreak
      @AICoffeeBreak  Před 10 měsíci +2

      @@HaiderAli-nm1oh oh no. I feel your pain. ☹️

    • @HaiderAli-nm1oh
      @HaiderAli-nm1oh Před 10 měsíci +1

      @@AICoffeeBreak :(( , i think i have to shift my work to pytorch sooner or later , lol its like changing religion XD 😂

    • @AICoffeeBreak
      @AICoffeeBreak  Před 8 měsíci +2

      Have you found new faith? 😅

  • @NasirAlipro1
    @NasirAlipro1 Před měsícem

    Small custom dataset for ultrasound images how can we achieve state of art performance

    • @AICoffeeBreak
      @AICoffeeBreak  Před měsícem +1

      I'm not sure a transformer is the right choice on small datasets. Better use architectures with more inductive bias. Or use the representations of an already pretrained transformer and just carefully fine-tune it on your data.

    • @NasirAlipro1
      @NasirAlipro1 Před měsícem

      @@AICoffeeBreak that's also the conclusion i have reached.
      I don't know anything about fine tuning a transformers, any help would be great 🫶🏻.

  • @franciscobrunodias7526
    @franciscobrunodias7526 Před 6 měsíci +1

    Here after openai announced sora

    • @AICoffeeBreak
      @AICoffeeBreak  Před 6 měsíci +1

      Sora making patches and ViTs interesting again. 😅

  • @efexzium
    @efexzium Před 3 měsíci

    that sounds like Puerto rican music