Great video! Especially relevant for me because I was just talking with a professor about how transformers seem to dominate everything in nlp these days. And I think I have an inkling of who these anonymous authors are--looking at you TPUs 😂
It tooks ViT 400m images to achieve just about what CNN does on ImageNet 1M, and with only 10-20m params, ViT took the order of magnitude more params though. Simply put, in NLP there are at most a few hundred thousands of words. Well in imaging, you can guess the wildered diversity of images, that is why CNN works.
Thanks for the kind words and for the question! Yes, I do try to have everything self-made if possible! I made all animations of the algorithm, I also drew Ms. Coffee Bean. As mentioned in the video description, I use emojis designed by OpenMoji (the bomb, the weightlifter, the feather...), because this saves me time and I can learn more and become better in important things, like the algorithm explanation animations.
btw, first linear projection on patches of 16x16 pixels is essentially or mathematically is convolution with kernel size of 16 and stride 16. So anonymous authors are not proposing anything new :P, it essentially very similar to non local newural networks
Good observation! You are totally right, that linear projection is working like a convolution. But as I see it (and I think our opinions diverge here), this is a non-essential design choice. It could be any kind of transformation to get a 1d vector from 2d image patches for the Transformer to work with.
@@AICoffeeBreak We could do anything to map the 2D patches to 1D vectors, but as long as we touch the numbers and the same computation is done to all the patches, that's convolution.
@@jonatan01i It is technically a convolution! Motivated by the prior, that without any knowledge about image regions, we should vectorize all patches the same. But this is an arbitrary design choice, almost like a hyperparameter: In a self-driving car setting where the sky is always on top and the street in the lower image region, one might choose to vectorize up vs. down differently.
nice video. However, I misunderstood something. at 3:45 when you said that "the given pattern can be a limitation" are you talking about the transformer or the CNNs?
CNN since the convolutions and pooling only allow you to consider a small patch of the image at a time (although this "patch" does grow to the full image as you go through the layers). The given pattern is just the size of the kernel and pooling.
Nice video !! :) However the last sentence of the abstract is: "Vision Transformer attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train" but you seem to say the opposite in the video. Did I miss something ? Thanks.
Good observation! The abstract compared the Transformer to a CNN-based model on the same (HUGE) amount of data, in which case you are entirely right: The Transformer is more efficient (computationally). However, I do not see where I negate the abstract. The related sentence to this might be "Why does it do better than CNNs? Because the anonymous authors can train on an awful lot of training data...". In hindsight I see better formulations, because that sentence is all about what follows it: I compare Transformers to CNNs in general and I explain that a CNN can deal with less data (it's design bias helps to find the right optimum), but a Transformer can not, since it has more degrees of freedom; that however, with the right amount of data can find original and better solutions than the CNN. Does this address your question? :)
@@AICoffeeBreak You're right. So, to summarize, you're saying that Transformers can be more computationally efficient than CNN if we train them with HUGE amount of data, but that CNNs actually don't require as much data as Transformers to be trained. Is that right ? Thank you.
It was anonymous at the time of making the video. It was under double blind peer review at ICLR. Now it is not anymore and I have already updated the video description. 😊
Oh I see, thank you 😊 By the way, would be possible to use an image that starts at low resolution and increases its resolution instead of dividing a high resolution image into sections? Sorry if that's a stupid question.
@@leecaste Not a stupid questions at all! Neural Nets (like GANs, where PULSE got a lot of notoriety lately because of biases) have been used to increase resolution of images before and it is just a matter of time until this kind of processing will be done with transformers. Why they do not start with low resolution here: Low resolution has less information than high resolution images. The high frequencies of the image are lost, i.e. the edges are smeared out. So this ViT Transformer of the presented paper, would have to recover the lost information, which is a task on itself. Because the purpose of this ViT transformer is image recognition, it uses all information that it can get (so high resolution). This is why they split the image into processable regions rather than just downsampling. Does this make sense?
@@AICoffeeBreak , i looked into this and found out that their is an implementation done by hugging face in pytorch for my specific use case : Audio spectrogram transformer , which is inspired by vision transformer to process audio spectrogram images , sadly this is done in pytorch :(( , and all of my work is in tensorflow
I'm not sure a transformer is the right choice on small datasets. Better use architectures with more inductive bias. Or use the representations of an already pretrained transformer and just carefully fine-tune it on your data.
I recently found this channel and I've been binge-watching your videos ever since. Great Job!
Welcome aboard! Binge-watching is disputedly the best approach to this.
The first layer of this model is still a convolution.
Good observation!
@speed100mph made the same two months ago, see below. I responded there. 😀
Great video! Especially relevant for me because I was just talking with a professor about how transformers seem to dominate everything in nlp these days. And I think I have an inkling of who these anonymous authors are--looking at you TPUs 😂
Love them humors. Keep up the good work
wow, another great video!
Hahaha, I loved your explanations!!
thanks for the video
Great job lady! Watching your videos while in the gym :-)
Coffee bean looks awesome 👌
It tooks ViT 400m images to achieve just about what CNN does on ImageNet 1M, and with only 10-20m params, ViT took the order of magnitude more params though. Simply put, in NLP there are at most a few hundred thousands of words. Well in imaging, you can guess the wildered diversity of images, that is why CNN works.
thank you
What Can I say other than a simple `Thank you!'... 🙂
Thank You for your wonderful comment!
The realm dominated for centuries by CNNs. Lol. :P
Nice video Letitia!
Do you make your own animations for the explanations of the algorithm?
Thanks for the kind words and for the question! Yes, I do try to have everything self-made if possible! I made all animations of the algorithm, I also drew Ms. Coffee Bean.
As mentioned in the video description, I use emojis designed by OpenMoji (the bomb, the weightlifter, the feather...), because this saves me time and I can learn more and become better in important things, like the algorithm explanation animations.
btw, first linear projection on patches of 16x16 pixels is essentially or mathematically is convolution with kernel size of 16 and stride 16. So anonymous authors are not proposing anything new :P, it essentially very similar to non local newural networks
Good observation! You are totally right, that linear projection is working like a convolution. But as I see it (and I think our opinions diverge here), this is a non-essential design choice. It could be any kind of transformation to get a 1d vector from 2d image patches for the Transformer to work with.
@@AICoffeeBreak We could do anything to map the 2D patches to 1D vectors, but as long as we touch the numbers and the same computation is done to all the patches, that's convolution.
@@jonatan01i It is technically a convolution! Motivated by the prior, that without any knowledge about image regions, we should vectorize all patches the same. But this is an arbitrary design choice, almost like a hyperparameter: In a self-driving car setting where the sky is always on top and the street in the lower image region, one might choose to vectorize up vs. down differently.
@@AICoffeeBreak Good point on the horizon.
Good video, thanks :-)
Good video
Thanks for the visit and the comment! ;)
nice video. However, I misunderstood something. at 3:45 when you said that "the given pattern can be a limitation" are you talking about the transformer or the CNNs?
CNN since the convolutions and pooling only allow you to consider a small patch of the image at a time (although this "patch" does grow to the full image as you go through the layers). The given pattern is just the size of the kernel and pooling.
Nice video !! :) However the last sentence of the abstract is: "Vision Transformer attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train" but you seem to say the opposite in the video. Did I miss something ? Thanks.
Good observation! The abstract compared the Transformer to a CNN-based model on the same (HUGE) amount of data, in which case you are entirely right: The Transformer is more efficient (computationally).
However, I do not see where I negate the abstract. The related sentence to this might be "Why does it do better than CNNs? Because the anonymous authors can train on an awful lot of training data...". In hindsight I see better formulations, because that sentence is all about what follows it: I compare Transformers to CNNs in general and I explain that a CNN can deal with less data (it's design bias helps to find the right optimum), but a Transformer can not, since it has more degrees of freedom; that however, with the right amount of data can find original and better solutions than the CNN.
Does this address your question? :)
@@AICoffeeBreak You're right. So, to summarize, you're saying that Transformers can be more computationally efficient than CNN if we train them with HUGE amount of data, but that CNNs actually don't require as much data as Transformers to be trained. Is that right ? Thank you.
I think you understood it well!
but once trained, can it be used as part of transfer learning?
Sure! I do not see why not.
Thanks great video
Why is it anonymous?
It was anonymous at the time of making the video. It was under double blind peer review at ICLR. Now it is not anymore and I have already updated the video description. 😊
Oh I see, thank you 😊
By the way, would be possible to use an image that starts at low resolution and increases its resolution instead of dividing a high resolution image into sections?
Sorry if that's a stupid question.
@@leecaste Not a stupid questions at all! Neural Nets (like GANs, where PULSE got a lot of notoriety lately because of biases) have been used to increase resolution of images before and it is just a matter of time until this kind of processing will be done with transformers.
Why they do not start with low resolution here: Low resolution has less information than high resolution images. The high frequencies of the image are lost, i.e. the edges are smeared out.
So this ViT Transformer of the presented paper, would have to recover the lost information, which is a task on itself. Because the purpose of this ViT transformer is image recognition, it uses all information that it can get (so high resolution). This is why they split the image into processable regions rather than just downsampling. Does this make sense?
Yes, thank you very much 🙂
Can u make a video on how to run VIT?
Thanks your explanation is amazing
But can you explain it with some details
can this vision transformer be used on audio spectrogram ? and used for my specific related task ?
It's worth a try.
@@AICoffeeBreak , i looked into this and found out that their is an implementation done by hugging face in pytorch for my specific use case : Audio spectrogram transformer , which is inspired by vision transformer to process audio spectrogram images , sadly this is done in pytorch :(( , and all of my work is in tensorflow
@@HaiderAli-nm1oh oh no. I feel your pain. ☹️
@@AICoffeeBreak :(( , i think i have to shift my work to pytorch sooner or later , lol its like changing religion XD 😂
Have you found new faith? 😅
Small custom dataset for ultrasound images how can we achieve state of art performance
I'm not sure a transformer is the right choice on small datasets. Better use architectures with more inductive bias. Or use the representations of an already pretrained transformer and just carefully fine-tune it on your data.
@@AICoffeeBreak that's also the conclusion i have reached.
I don't know anything about fine tuning a transformers, any help would be great 🫶🏻.
Here after openai announced sora
Sora making patches and ViTs interesting again. 😅
that sounds like Puerto rican music