DINO: Emerging Properties in Self-Supervised Vision Transformers (Facebook AI Research Explained)

Yannic Kilcher

zhlédnutí 119 587

Přidat do
- Můj playlist
- Přehrát později
Sdílet

Sdílet

Vložit

Velikost videa:

Zobrazit ovladače přehrávání

Automatické přehrávání

Přehrát

čas přidán 29. 08. 2024

Komentáře • 150

@YannicKilcher Před 3 lety ⁺¹⁶
OUTLINE:
0:00 - Intro & Overview
6:20 - Vision Transformers
9:20 - Self-Supervised Learning for Images
13:30 - Self-Distillation
15:20 - Building the teacher from the student by moving average
16:45 - DINO Pseudocode
23:10 - Why Cross-Entropy Loss?
28:20 - Experimental Results
33:40 - My Hypothesis why this works
38:45 - Conclusion & Comments
Paper: arxiv.org/abs/2104.14294
Blog: ai.facebook.com/blog/dino-paws-computer-vision-with-self-supervised-transformers-and-10x-more-efficient-training
Code: github.com/facebookresearch/dino
My Video on ViT: czcams.com/video/TrdevFK_am4/video.html
My Video on BYOL: czcams.com/video/YPfUiOMYOEE/video.html
@samanthaqiu3416 Před 3 lety
from the paper is not clear AT ALL that they detach gradients of the teacher via the Center (C) variable. Will have to look at their repo to see what is going on. Typically things like mean still propagate gradients in pytorch
@samanthaqiu3416 Před 3 lety
yep, it didn't help much that they seem to code like 9 year olds, but from line 304 of main_dino.py ( github.com/facebookresearch/dino/blob/a15f6afee2f3b868f44a5021a0951f718c8e2dd5/main_dino.py#L304 ) it seems clear they are NOT DETACHING all gradients from the teacher network via the `update_center` method
@samanthaqiu3416 Před 3 lety
it will not be a problem since they don't seem to be using those gradients anywhere, although I haven't verified it
@mathildecaron1821 Před 3 lety ⁺¹³⁴
Thanks a lot Yannic for covering DINO, that’s really an honor ! I’m a big fan of your channel :D
Před 2 lety
Hi. Enjoyed the paper and the explanation given in this video. Thank you both.
Are you aware of any robustness analysis (in the context of adversarial examples) done for DINO?
@tiro0oO5 Před rokem ⁺¹
I know version 2 is out. Still congrats to this brake through work!
@fatmaguney3598 Před 3 lety ⁺⁴⁵
"learning to predict cat from cat ear" is a good summary of this paper.
@Metaloid-wv4kz Před 3 lety
simplest form of AI lol
@patf9770 Před 3 lety ⁺⁷⁶
As often said, what a time to be alive!
@GeekProdigyGuy Před 3 lety ⁺⁷
wrong channel xd
@michaelwangCH Před 3 lety ⁺¹
If you are the guy "two minute paper", excellent work - we are living in a exordinary time of human history.
@vsiegel Před 3 lety
@@michaelwangCH Let's enjoy it, the progress is exponential, and we are at a steep region! I decide to ignore the question whether we are near the end of human history, at the time the progress curve goes to infinity...
Actually, I'm not afraid: Progress is exponential, making progress means adding knowledge, tools and scientists, and that allows to make faster Progress.
But I think it is actually a logistic development, that looks very much like exponential, but instead of the singularity, the curve begins to get less steep. It happens when a finite resource is involved. But as physicist, I say: no problem, the observable universe is finite.
@michaelwangCH Před 3 lety
@@vsiegel We have a log curve between scientific output(progress) and resources which we put in - the work as researcher is getting harder, the hardness will increase every year - in other word, the hardness increase expontially to time(an exponential function of time) - that is bad for scientific progress and societal resources distribution. E.g. CERN with over 6000 scientists and $14B + fixed costs per yr, those resources could be used probably more productively in other area of sciences.
@gaypaul5635 Před 6 měsíci
@@vsiegel that's assumes that knowledge can be added like chocolates cakes, but that's a wrong hypothesis for individual humans and humanity. A focus and more knowledge on a topic means other topics are given less attention and are forgotten. This is why the definition of "progress" must be chosen, and according to certain definitions, Dino is not representing a progress in itself as it can have negative indirect effects as any digital technologies.
@sohamroy9868 Před rokem ⁺⁷
I am super impressed how you nailed the pronunciation every single name of the authors of the paper.
@rahuldeora5815 Před 3 lety ⁺¹⁰
Surprisingly fluent pronunciation of the authors .... bet that took more takes than one would expect :)
@sabyasachibandyopadhyay8558 Před 3 lety ⁺¹⁰
Your comment on augmentations is spot on! I have worked with BYOLs in clinical images for a while now, and choosing the correct augmentations makes a heck of a difference, and there is no way to know the right augmentation without trial and error! I think that's a major downside of BYOL, which will obviously percolate to DINO as well. Thanks for your presentation of the paper!
@jaakjpn Před 3 lety ⁺¹⁵
Cool paper, thanks for the review!
About centering vs sharpening. You are right: centering avoids the collapse as each unsupervised class gets pushed to equal running average, i.e., each unsupervised class should pick up 1/K of the images because the means of their logits are 0. This way, model cannot collapse to picking the same class each time. Sharpening is to make sure that each time one class is being picked (otherwise, flat uniform distribution could be the result).
@rahuldeora5815 Před 3 lety
It can still collapse at 0, as output of a neuron can be 0 (or very small value)and its running mean also 0. If most of the neurons have very small mean and outputs then is'int it possible for few classes to always dominate? (This would'nt happen if we divided by the std deviation btw)
@nauman.mustafa Před 3 lety ⁺⁴¹
I like how they include PyTorch code which makes it so easy to implement compared to heavy latex math
@herp_derpingson Před 3 lety ⁺⁴
Why dont more papers do this?
@Metalwrath2 Před 3 lety ⁺²¹
@@herp_derpingson Because most papers don't have reproducable results.
@kanal7523 Před 2 lety ⁺¹
@@Metalwrath2 sheeeeeeeesh
@shivamshrirao2374 Před 3 lety ⁺¹⁴
Was just going through the paper and there's already a video. Noiceeee !!
@mfpears Před 2 lety ⁺¹
Last 10 minutes are really a great explanation of a few concepts
@astudent8885 Před 6 měsíci
Thank you for this presentation. You made sure to explain all background concepts so someone with limited ml knowledge can still understand. I found that really helpful. Thank you so much!
@florianjug Před 3 lety ⁺⁸
Thanks! I hoped you’d be fast to cover DINO... and you delivered! :)
Před 3 lety ⁺⁴⁵
One point to note in this paper is the dataset consist of object centred images and the augmentation method relies on cropping which is learning to represent the images invariant to the cropping position. This form a strong inductive prior that produces a representations that focus on the objects of interests in the image. The main learning signal that guides the self-supervised learning process comes from the cropping augmentation so I don't see how such a method can be trained without augmentation. My hypothesis is that this method would not work with dataset that don't have object centred images like a dataset that has images of rooms since in that case cropping would result in different objects that have little in common which would effectively eliminate the learning signal.
@redseventyfiveprime5018 Před 3 lety ⁺³
In reinforcement learning similarity of teacher and student responses can probably be used to move an agent into a position where an object is centered in its view.
@oncedidactic Před 3 lety ⁺²
I think you could extend the system by pre-training on object centered, and then expanding to more natural imagery, such as scenes as you say. But the cropping augmentation would probably still need adjustment.
@mdrayedbinwahed2172 Před 2 lety ⁺³
Excellent insight. Sounds like a good follow up to this paper.
@kaveh_shh Před rokem
Yeah. In this case we can not expect the model to give us the same representation from both e.g. "sky" and a "cat", cropped from different parts of the same image.
@susdoge3767 Před 4 měsíci
what an insight! thanks for making me think!
@iandanforth Před 3 lety ⁺²
This video made clear to me the strong occlusion prior being introduced by the local/global student/teacher training method. I hadn't picked up on that in my first read through. Thank you!
@Kram1032 Před 3 lety ⁺²
This is super cool! A really clever way to kinda do contrastive stuff without doing contrastive stuff and the results speak for themselves
@dinoscheidt Před 3 lety ⁺⁶
I really like the acronym of this method. 👀
@mathildecaron1821 Před 3 lety ⁺¹
🦖
@dinoscheidt Před 3 lety ⁺²
Yeah... maybe not. Already getting messages with “See! DINO has attention issues”... 😶 thanks fb
@saurabheights Před 3 lety
@@dinoscheidt Could you expand on those messages, interested in "DINO has attention issues"!
@EyedMoon Před 3 lety
God damnit every time there's a new method/architecture I want to try out and can't find the time to really use. Thanks for the video, I read the paper but the hindsight and small pieces of knowledge you give us about those methods and why they work are reaaaally good.
@originalsingh Před 3 lety ⁺¹⁵
Yannic : Nobody takes a picture of dirt and grass and posts it on SM
GameDev artists : Woah look at this dirt patch!
@justwiredme Před rokem
Great presentation I like how you show the visual part best part for me as a beginner am very excited to learn this algorithm as well this is very useful information for me because sometimes in everyday life I can read so the audio is so helpful thank you
@XX-vu5jo Před 3 lety ⁺¹⁴
Stupid I submitted a similar concept before and I was rejected because I am not a well known person. Now, just because fb made it they were glorified! This is crazy!
@herp_derpingson Před 3 lety ⁺¹
Sad
@samanthaqiu3416 Před 3 lety
that's why there is arxiv. Didn't you thought about publishing there?
@herp_derpingson Před 3 lety ⁺³
@@samanthaqiu3416 For many PHD programs, publishing on arxiv is not good enough.
@ensabinha Před 5 měsíci
29:15 - It achieves better results with ViT when compared to the "best ResNet," of course, but it's 3.6 times larger in the number of parameters.
They're comparing a ~3.6x LARGER modern architecture (which probably employs an arsenal of training tricks) with ResNet. Shocking, truly groundbreaking, you can get better results with a larger model.
@DamianReloaded Před 3 lety ⁺⁵
The attention maps look really good, specially the ones in video. It'd be interesting to see what it does when you occlude the thing in the scene it attended to the most. How many things in the scene it would be capable of telling apart as you remove the ones it already attended to.
Regarding the cooking video I think would have been better if it had been 90% about the language model and 10% about cooking. I personally would like to see more programming and possibly interviews with the authors of the papers you reviewd. my2c
@oncedidactic Před 3 lety
I had a similar side. If you paint out the objective attention using another system, what happens? Like Yannic‘s comment about pictures of roads and grass 😂
@zebrg Před 3 lety
czcams.com/video/h3ij3F3cPIk/video.html
@neworldemancer Před 2 lety
tnx Dr. Kilcher, what you do is useful af! ;)
@DistortedV12 Před 3 lety ⁺¹⁰
Yannic some constructive feedback, turn up your volume!
@korota199905 Před 3 lety ⁺¹
Absolutely yesss!
@_tnk_ Před 3 lety ⁺¹
Very interesting and amazing results
@chndrl5649 Před rokem
I love these paper summaries!!!
@ivanr7725 Před 3 lety ⁺¹
Thanks a lot! Dinozaur should be on the cover.
@pensiveintrovert4318 Před 3 lety ⁺⁴
It pays attention to patches with maximal change. Of course we, the erect monkeys, also pay attention to visual fields with maximum change, to get food, or escape danger. Why? Because it works and that is how we have evolved, because it worked.
@tzjtjktzjtzjztjztj Před 3 lety ⁺¹
Great insight and comments, thanks Yannic
@anassbairouk953 Před 3 lety
The data augmentation is important to avoid using clustering which is not scalable when using a huge dataset because you get a huge cluster centroid matrix that you need to store and update each time.
@robertgirard5659 Před 2 lety
Saved for later! Yannic dude love your vids!
@yaoweili681 Před 3 lety
great video, mate! The segmentation results are so good!
@samdavidson5511 Před 3 lety
Awesome vid thanks! and I see they are linking this video of yours on their git repo!
@miladaghajohari2308 Před 3 lety ⁺²
well done!
@francoisplessier9913 Před 2 lety
Great explanations, thank you for this quality video!
I loved the 34:38 insight on augmentations!
And I found your concern about the meme culture quite funny :-)
@kiachi470 Před 3 lety
Amazing Explaination and Paper to,Very interesting
@nahakuma Před 3 lety ⁺¹
Nice final comments. Totally agree in that augmentations should be internal to the learning process. As I see it, we humans do something similar by remembering previously seen patterns, as well as by imagining how things would change if we perturb them (or by actually performing the perturbation). With respect to the global and local crops, does the teacher really only see global crops? Because according to the pseudo-code both x_1 and x_2 go into both models.
@zenchiassassin283 Před 9 měsíci
Very interesting hypothesis !
@mobilexia6285 Před 3 lety ⁺¹
Two quick notes:
1. The video can replace CVPR
2. If the cat can be recognised by its ear, would that mean some 'generative power' has been created within the student?
@JamesAwokeKnowing Před 3 lety
For augmentation we can replace with noisy input. For dataset a a reconstructive loss and world model should give basic objects and cause the model to prefer images that nore significant (less random) semantic meaning. Then at dream time it can train on the meaningful images.
@_arshadm Před 3 lety
Great explainer video, not sure I agree with your conclusion that augmentation may be a major source of the signal that the approach is latching onto. My own suspicion is that it's the clipping that is the main reason this approach works.
@alastairfinlinson898 Před 3 lety ⁺¹
Love the videos! Will you be providing valuable insight for the papers "Multiscale Vision Transformers", "Vision Transformers for Remote Sensing Image Classification" and "Vision Transformers for Dense Prediction"?
@pauljones9150 Před 3 lety ⁺¹
Have my updoot. I loved the cooking video btw
Maybe have a separate channel for cooking like video so you don't get tanked by the algo
@yimml4246 Před 3 lety
The cooking video did not really do "terribly." Yes, perhaps a bit less than the average video, but I watched it and it was adequate. Nonetheless, sometimes we need to try random things to prevent getting stuck in a local maximum. Keep it up!
@odin12 Před 3 lety ⁺⁴
When will the code for Generative Minimization Networks: Training GANs Without Competition be released?
@sheggle Před 3 lety ⁺³
Would love to see time changes in natural video instead of augmentations, to see if "why AI is harder than we think" holds any water
@danielalorbi Před 3 lety ⁺¹
@Robert w No it isn't. We invented a whole new term and everything.
@danielalorbi Před 3 lety
@Robert w Your comment changed. I don't recall exactly what it was initially but the meaning has changed.
@mrwu6565 Před rokem
Thank you Yannic!!! Can you do a video about CutLER ? :)
@yb801 Před 9 dny
Clearly explained, thanks.
@Hydroslyde Před 3 lety ⁺¹
Great video! So are we going to get a PAWS video next? Pretty please???
@odin12 Před 3 lety ⁺¹
This paper looks insane
@yesno3071 Před 2 lety
keep going :) very well
@oliverchalkley1187 Před rokem
Great video thanks! Surely the reason for the softmax is the crossentropy equation requires probabilities and the softmax funciton turns the outputs into probabilities?
@momeho Před 9 měsíci
Thanks for your great video. Do you have any video on DINO2?
@sanj1772 Před 4 měsíci
Amazing video, can you please make one on DinoV2
@444haluk Před 3 lety
Augmentations are so simple in their nature that it can be a part of the evolutionary dynamic of humans on how our perception develops over time. Maybe in your sleep different crops of occipital cortex play this game of augmentation. Maybe you didn't born tabula rasa but born with augmentation dynamics.
@SakvaUA Před 3 lety
Thanks for the video! Enlightening as always. The audio volume is a bit too low though.
@Amin-wd4du Před 11 měsíci
super
@harambe2552 Před 3 lety
The softmax bounds the embedding space to a hypersphere. Otherwise your embedding space is unbounded and gives you an infinite projection space.
@vsiegel Před 3 lety
Confusing to me was:
TL;DE: It seems like it requires video as input, but it works on still images.
In the intro at 0:55 , there are examples shown, and all of them are videos. On first sight, it seemed obvious to me that it is detecting the moving Object. Looking more closely, something more is going on - the movement of the waves is ignored, in a clean way. But still, the information for the separation is available in a very salient way. It took a while until I understood that it is about still images. Now, I think the frames of the example videos are processed individually.
@32121452145225255658 Před 3 lety
your video did make it into the YT “algorithm” for me and showed up on my recommended. So I think your YT skills are just fine!
I can not speak to everyone, but I didn’t watch your cooking video for 2 reasons.
I watch your content to learn, not for entertainment. And it didn’t look like a learning video. (Though writing this I could see you offering interesting insights into gpt-3’s output)
And 2 I had seen it done before and personally found the results disappointing.
All that being said. I’ll prolly go give it a watch after this lol.
@TechyBen Před 3 lety ⁺⁴
Terminator misspelt "Facebook" in the movies.
@vasylcf Před 2 lety
Thanks. it's really intresting.
@DanFrederiksen Před 3 lety ⁺⁵
If it's truly unsupervised, why is it blind to vegetation and ocean waves. It seems they somehow managed to impose the simplistic notion that an image has only one classification.
@jeroenput258 Před 3 lety
Exactly. One of the images shows a dog in a sofa and only pays attention to the dog. What if I'm more interested in the sofa than the dog? It seems to impose a very subjective notion of importance on the image content. Besides, segmentation is highly task dependent, so how could it know whether to segment the dog or its limbs for instance? If you ask me, it just seems to learn from ImageNet to predict the most salient object and then use the features to perform a segmentation.
@randomthoughts3009 Před 3 lety ⁺¹
This is a visual artifact due to plot normalization. The central object has heatmap values that are relatively much higher than the background. Check the running dog example on the project page and look at the last frame where the dog is absent.
@DanFrederiksen Před 3 lety
@@randomthoughts3009 well that it has very faint recognition of other things isn't really an excuse. But I guess it can be a simple result of focus in the training set. The initial dog video tracks the dog so that is naturally a heavy bias towards single object classification.
@Niels1234321 Před 3 lety
Maybe we should try to use consecutive frames of a video as augmentations of the same thing, it requires less augmentation engineering and you could argue that it resembles the data that humans learn from as children.
@iftekharniloy913 Před 3 lety ⁺¹
I am just curious to see people using self supervision on images which have multiple classes of interest.
@jonatan01i Před 3 lety
Right now the images for the student model are sampled from the image with different x,y coordinates. What we could also do is to sample them from different timestamps from a video.
@rakshithv5073 Před 3 lety
Looking into the pseudo code , block diagram (Figure 2) isn't a good representation of what's actually happening right ?
At first sight, I thought x2 only goes through teacher network and x1 goes through student network
@andrewcutler4599 Před 3 lety ⁺¹
ViT for augmentations when?
@lannguyende Před 3 lety ⁺²
I've read the paper and sadly that I didn't find anything new. They just gathered some techniques that already existed and implemented in a self-supervised way. Funny is DINO: DIstill NO labels, but normal distillation training don't use any label at all 😂
@louislouis7388 Před 3 lety ⁺²
Many papers do in such way. Although it is very simple, they tried to magic it to get it complicated and plausible. I found this paper is not impressive at all.
@0_0bserver27 Před 3 lety
I don't exactly understand how distillation prevents collapse in this model in the explanation of it on 13:53. On 19:59 it is mentioned again that the student cannot output the same thing every time because it is prevented, but how exactly? Does someone want to elaborate?
@Bryan-jb6xu Před 2 lety
please make a video explaining about EsVIT. Thanks!
@RoboticusMusic Před 3 lety ⁺¹
What's the framerate for 1080p? Is it realtime?
@piku1920 Před 3 lety
Hi- what does it mean by thresholding the self attention maps to keep 60% of the mass? What does mass represent here?
@akhilezai Před 3 lety ⁺¹
There's no temporal aspect to it?
@iiiiaaaa4548 Před 2 lety
Which model use for downstream? student? teacher?
@anibalgonzalez7990 Před rokem
Could anyone tell me how the teacher knows there are 'k' classes to be identified in a picture?
Cheers!
@michaelwangCH Před 3 lety ⁺¹
What is the intuition behind?
How it does work so well without labeling?
Yannic, can you explain the intuition?
@susdoge3767 Před 4 měsíci ⁺¹
the intuition is you try to make the network learn that an image of a cats ear and a complete image of the cat should have the same representation. The hypothesis is that by forcing the model to learn consistent representations across scales (patch vs. whole image), it can grasp transferable features that are generally useful for computer vision tasks.
@michaelwangCH Před 4 měsíci
@@susdoge3767 thank you. Unsupervised learning is only possible if the latent space representations are similar to each other(minimize the distance in latent space - that is the reason why emergent properties of LLMs we can oberserve, e.g. google trained translator in english can surprisingly translate Indi or other languages without trained on - only reason it works because the human languages have similar structure that it related to human biology resp. brain functions - those processes in the brain are similar to all humans - it is independent of color, gender, nationality or race.
@susdoge3767 Před 4 měsíci
@@michaelwangCH thats another cool insight i didnt know!
@michaelwangCH Před 4 měsíci
@@susdoge3767 happy to help and the knowledge belong the entire human race, not small group of people.
@rahuldeora5815 Před 3 lety
The paper says "We observe that this teacher has better performance than the student throughout the training, and hence, guides the training of the student by providing target features of higher quality. This dynamic was not observed in previous works [28, 56]." How does this make sense given that teacher is updated much more slowly than student?
@jeroenput258 Před 3 lety
That's one thing I don't get either...
@mathildecaron1821 Před 3 lety
Poliak averaging
@susdoge3767 Před 4 měsíci
didnt understand properly about sharpening and centering, can anyone help me understanding it intuitively?
@freemind.d2714 Před 3 lety ⁺²
Basically: DINO = BYOL + Transformers
@godsondeep241 Před 3 lety
Can we use this for the object detection task
@larrybird3729 Před 3 lety ⁺¹⁴
Would I rather watch Gordon Ramsay review the latest AI paper or would I rather Watch Yannic? that might answer your question Yannic😆
@roughr4044 Před 3 lety
What clustering algo does it use on the features?
@roughr4044 Před 3 lety
Linear and knn, got it...
@calaldred2526 Před 3 lety
Yannic “Lightspeed” Kilcher strikes again
@sebbecht Před 3 lety
WTF. I found this and was just about to suggest it to you over linkedin and thought.. what if I just checked if there were any youtube videos on it first...
@JoshuaGAlbert Před 2 lety
Volume is low in this video.
@ssssssstssssssss Před 3 lety ⁺¹
This seems to be an unsupervised clustering algorithm to me. I guess calling it "self-supervised" sounds sexier.
@user-xv4us2ll2s Před 3 lety ⁺¹
好快~
@cunningham.s_law Před 3 lety
seems like attention is all you need
@lwang9175 Před 3 lety ⁺¹
You can the stripes of the horse... Sorry it's a zebra 🦓 hahaha
@HughesPerreault Před 3 lety
Commenting for algo.
@djfl58mdlwqlf Před 3 lety
cooking was good video
lol
@444haluk Před 3 lety
The dataset argument is weak as well because every human you know has a parent or somebody looked after them in their childhood, no human grow alone with the wolves. Hence the "where to look" may be a social aspect of human species, hell, every species. I know cows have a type of attention and understanding which we refer as autistic, wherever they walk, if some unknown things is in the proximity, they freeze and freak out. Maybe they are not good cow culture teachers after all.
@aminabbasloo Před 2 lety
Seems like a game theory problem to me!
@preethamgali3023 Před 3 lety
It looks like double-Q learning. What do you think?
@laurenpinschannels Před 3 lety
offtopic thing - would you be open to adding donation options in a proof of stake coin? I don't have strong opinions about which one, I'd convert to whatever you think is a good option. I don't want to fund gpu demand with my donation :)
@scottmiller2591 Před 3 lety
"Cooking video" - Wat.
@444haluk Před 3 lety ⁺¹
Dude cooking video is done terribly because in the thumbnail there is a "brown" object on the plate and it is pixelated. People may have related it with, I don't know, LITERAL SHIT?
@nurkleblurker2482 Před 3 lety ⁺³
Yannic, your cooking video did terribly because this is an AI channel. None of your viewers want to see you cook, even if the recipe was written by an AI.
@oncedidactic Před 3 lety ⁺¹
This is probably accurate. You know, I think what would work better? A collaboration video with a cooking channel! You should get in touch with Andong
@tostupidforname Před 3 lety
I mean ofcause thats what happen if you make content aimed towards a different audience. At the same time branching out is nessesary for channel growth and most big channels went through a phase where they "changed audiences".
I personally liked the cooking video

Další v pořadí

Automatické přehrávání

Why Does Diffusion Work Better than Auto-Regression?