OpenAI’s CLIP explained! | Examples, links to code and pretrained model

AI Coffee Break with Letitia

zhlédnutí 36 524

Přidat do
- Můj playlist
- Přehrát později
Sdílet

Sdílet

Vložit

Velikost videa:

Zobrazit ovladače přehrávání

Automatické přehrávání

Přehrát

čas přidán 29. 08. 2024

Komentáře • 58

@xxlvulkann6743 Před 24 dny ⁺¹
Very well and succinctly explained! This channel is a great educational resource!
@SinanAkkoyun Před rokem ⁺⁷
its so cute that the coffee bean takes a pause when you take a breath!
Thank you, this video was more conclusive than anything i've seen on CLIP, it really explained the intuition to vector embedding image to text pairs and what that means
@ravivarma5703 Před 3 lety ⁺¹¹
This channel is Gold - Excellent
@AICoffeeBreak Před 3 lety ⁺³
Thank you, you very kind!
@pixoncillo1 Před 3 lety ⁺¹⁰
Wow, Letiția, what a piece of gold! Love your channel
@AICoffeeBreak Před 3 lety ⁺²
Thank you!
@nasibullah1555 Před 3 lety ⁺⁸
Great job again. Thanks to Ms. Coffee Bean ;-)
@AICoffeeBreak Před 3 lety ⁺⁴
Our pleasure! Or rather Ms. Coffee Bean's pleasure. I am just strolling along. 😅
@anilaxsus6376 Před 11 měsíci ⁺¹
I like the fact that you talked about the Ingredients they used, thank you very much for that.
@satishgoda Před 4 měsíci ⁺¹
Thank you so much for this succinct and action packed overview of CLIP.
@AICoffeeBreak Před 4 měsíci ⁺²
Thank you for visiting! Hope to see you again.
@OguzAydn Před 3 lety ⁺⁵
underrated channel
@elinetshaaf75 Před 3 lety ⁺²
absolutely!
@EpicGamer-ux1tu Před 2 lety ⁺³
Amazing video! This definitely deserves more views/likes. Congratulations. Much love.
@vince943 Před 3 lety ⁺⁴
Thank you for your continued research. ☕😇
@AICoffeeBreak Před 3 lety ⁺⁴
Any time! Except when I do not have the time to make a video. 🤫
@Youkouleleh Před 3 lety ⁺⁵
Thanks for the vidéo :)
@AICoffeeBreak Před 3 lety ⁺³
As always, it was Ms. Coffee Bean's pleasure! 😉
@cogling57 Před 2 lety ⁺⁴
Wow, such amazing clear, succinct explanations!
@AICoffeeBreak Před 2 lety ⁺²
Thanks! ☺️
@talk2yuvraj Před 2 lety ⁺³
This is an excellent video, congrats.
@AICoffeeBreak Před 2 lety ⁺²
Thanks! Glad to have you here!
@harumambaru Před 3 lety ⁺³
Thanks for teaching me something new today! I will try to return a favour and tell that dog race is breed :) But as not native English speaker it made perfect sense to me
@harumambaru Před 3 lety ⁺²
Hunderasse is pretty good word :)
@AICoffeeBreak Před 3 lety ⁺¹
You're right! Hunderasse was a false friend to me, thanks for uncovering him to me. 😅
Do you also speak German?
@harumambaru Před 3 lety ⁺²
@@AICoffeeBreak I am only learning it. After I moved to work to Israel and learned Hebrew I decided not to stop fun and continue learning new languages. I made bold suggestion that living in Heidelberg makes you speak German, and then I went to Wikipedia page Dog breed and found German version of the page, then my suggestion was confirmed.
@AICoffeeBreak Před 3 lety ⁺²
@@harumambaru True detective work! :) It's great you are curious and motivated enough to learn new languages. Keep going!
@romeoleon1118 Před 3 lety ⁺⁴
Amazing content ! Thanks for sharing :)
@user-vm4sv5cf9y Před 5 měsíci ⁺¹
Thanks! Very clear!👍
@mikewise992 Před 7 měsíci ⁺¹
Thanks!
@AICoffeeBreak Před 7 měsíci ⁺¹
Wow, thank YOU!
@lewingtonn Před rokem ⁺¹
LEGENDARY!!!
@mishaelthomas3176 Před 3 lety ⁺⁵
Thank you very much mam for such an insightful video tutorial. But I have one doubt. Suppose I trained the CLIP model on a dataset consist of two classes i.e dog and cat. After training. I tested my model on two new classes for example horses and elephants in the same way as told in the CLIP blog of OpenAI. Will it give me a satisfactory result? as you said that it can perform zero short learning.
@AICoffeeBreak Před 3 lety ⁺⁵
Hi Mishael, this is a little mode complicated than that. If you train CLIP from scratch on two classes (dog and cat), it will not recognize elephants, no.
The zero-shot capabilities of CLIP do not come from magical understanding of the world and generalization capabilities, but from the immense amounts of data CLIP has seen during pretraining. In my humble opinion, true zero-shot does not exists in current models (yet). It is just our human surprise when "the model has learned how to read" combined with our ignorance of the fact that the model had a lot of optical character recognition (reading) to do during pre-training. Or: look, it can make something out of satellite images, while its training data was full of those, but with a slightly different objective.
The current state of zero-shot in machine learning is that you have trained on task A (e.g. align images containing text and the text transcription) and that it then can do another, but similar task B (e.g. distinguishing writing styles or fonts).
I am sorry this didn't come across in the video so well and that it left the impression that zero-shot is more than it is. Experts in the field know very well the limitations of this but like to exaggerate it a little bit to get funding and papers accepted; but also because even this limited type of zero-shot merits enthusiasm, because models have not been capable of this at all until recently.
I might make a whole video about "how zero-shot is zero-shot". A tangential video on the topic is this one czcams.com/video/xqdHfLrevuo/video.html where it becomes clear how the wrong interpretation of the "magic of zero-shot" led to mislabeling some behavior of CLIP as "adversarial attack", which is not.
@AICoffeeBreak Před 3 lety ⁺⁵
One thing to add: if you *fine-tune* CLIP on dogs and cats and test the model on elephants, it might recognize elephants; not because your fine-tuning, but because all the pre-training that has been done beforehand. But not even this is not guaranteed: while fine-tuning, the model might catastrophically forget everything from pre-training.
@gkaplan93 Před 8 měsíci ⁺¹
couldnt find the link to colab that lets us experiment, can you please attach to description?
@AICoffeeBreak Před 8 měsíci ⁺¹
The Colab link has become obsolete since the video has been up (a lot is happening in ML). Now you can use CLIP much easier since it has been integrated in hugginggface: huggingface.co/docs/transformers/model_doc/clip This is the link included in the description right now.
@henkjekel4081 Před 3 měsíci ⁺¹
Thank you so much:) So the vectors T from the text encoder and I from the image encoders are the latent representations of the last word/pixel from the encoders that would normally be used to predict the next word?
@AICoffeeBreak Před 3 měsíci ⁺²
Not the last word/image region, but a summary of the entire picture / text sentence. :)
@henkjekel4081 Před 3 měsíci ⁺¹
@@AICoffeeBreak Thank you for your quick reply:) Let me see, so I do understand the transformer architecture very well. The text encoder will just consist of the decoder part of a transformer. Due to all the self attention going on, the latent representation of the last word at the end of the decoder will contain the meaning of the entire sentence. That is why the model is able to predict the next word, based on just the last words latent representation. So could you elaborate on what you mean with the summary of the text sentence? Which latent representations are you talking about?
@henkjekel4081 Před 3 měsíci ⁺¹
Hmm, I'm reading something about a CLS token, maybe that's it?
@AICoffeeBreak Před 3 měsíci ⁺²
@@henkjekel4081 Yes, exactly! So, the idea of CLIP is that they need a summary vector for the image and one for the text to compare them via inner product.
It is a bit of architecture-dependent how exactly to get them. CLIP in its latest versions uses a ViT where the entire image is summarised in the CLS token. But the authors experimented with convolutional backbones as well, as the original implementation had two variants: one using a ResNet image encoder and the other using a Vision Transformer. The ViT variant became more widely available and popular.
And yes, the text encoder happens to be a decoder-only autoregressive (causal attention) LLM, but it could have just been a bidirectional encoder transformer as well. The authors chose a decoder LLM to make future variants of CLIP generate language too.
But for CLIP as it is in the paper, all one needs is a neural net that outputs a image summary vector, and another one that outputs a text summary vector of the same dimensionality as the image vector.
@compilations6358 Před rokem ⁺²
what's that music in the end?
@AICoffeeBreak Před rokem ⁺²
Voices - Patrick Patrikios
@y.l.deepak5107 Před 10 měsíci ⁺¹
The colab isnt Working Mam please kindly check it once
@AICoffeeBreak Před 10 měsíci ⁺¹
Thanks for noticing! The Colab link has become obsolete since the video has been up (a lot is happening in ML). Now you can use CLIP much easier since it has been integrated in hugginggface: huggingface.co/docs/transformers/model_doc/clip
I've updated the video description as well. :)
@andresredondomercader2023 Před 2 lety ⁺³
Hello, CLIP is impressive :) Is there a listing of all possible tags/results it can return?
@AICoffeeBreak Před 2 lety ⁺¹
CLIP can compute image-text similarity for any piece of text you input it has seen during training. I do not know exactly the entire list, but you can think of at least 30k English words.
@andresredondomercader2023 Před 2 lety ⁺¹
@@AICoffeeBreak Many thanks Letitia. I think we are trying to use CLIP the other way around: It seems that the algorithm is great if you provide keywords to identify images containing objects related to those keywords. But we are trying to obtain keywords from a given image, and then categorise those keywords to understand what is in the image. Maybe I'm a bit lost in how CLIP works?
@AICoffeeBreak Před 2 lety ⁺³
@@andresredondomercader2023 CLIP computes similarities between image and text. So what you can do is take the image and compute similarities to every word of interest. When the similarity is high, then the image is likely to contain that word and you have an estimate for what is in the image, right?
@andresredondomercader2023 Před 2 lety ⁺¹
@@AICoffeeBreak thanks so much for taking the time to respond. In our project, we have about 300 categories: "Motor", "Beauty", "Electronics", "Sports"... Each category could be defined by a series of keywords; For instance "Sports" is made up of keywords like "Soccer", "Basketball", "Athlete"..., whilst "Motor" is made of keywords such as "Motorbike", "Vehicle", "Truck"... Our goal would be to take an image and obtain the related keywords (items in the image) that would help us associate the image with one or more categories.
I guess we could invert the process, ie pushing into CLIP the various keywords we have for each category and then analyse the results to see which sets of keywords resulted in the highest probability, hence identifying the related category, but that seems very inefficient, since for each image we'd do 300 iterations (we have 300 categories).
However, if given an image CLIP returned the matching keywords that are most appropriate to it, we could more easily then match those keywords returned by CLIP with our category keywords.
Not sure if I'm missing something or maybe CLIP is just not suitable in this case.
Thanks so much!
@AICoffeeBreak Před 2 lety ⁺³
@@andresredondomercader2023 You are right, this would be inefficient to do 300 iterations per image, just so one can use it out of the box without changing much to it.
But I would argue that:
1. inference is not that costly and you can to the following optimizations:
2. For one image: since the image stays the same during the 300 queries, you only have to run the visual branch once. Saves you a lot of compute. Then you have to encode only the text 300 times for the 300 labels, but it is quite fast because your textual sequence length is so small (one word, mostly).
3. For all images: You only have to compute the textual representations (run the textual) branch 300 times. Then you have the encodings.
So a tip would be to compute the 300 textual representations (vectors). Store them. For each image, run the visual backbone and do the dot product of the image representation with the 300 (stored) textual representations.
@joaquinpunales4365 Před 2 lety ⁺⁶
Hi everybody :), we have been working locally with CLIP and exploring what can we achieve with the model, however we are still not sure if CLIP can be used in a production environment, I mean commercial usage, we have read CLIP's licence doc but it's still not clear, so if someone has a clear idea if that's allowed or not I'd be more than grateful !
@arrozenescau1539 Před 7 měsíci
freat video
@renanmonteirobarbosa8129 Před 3 lety ⁺³
Make LSTMs great again, they are sad :/
@AICoffeeBreak Před 10 měsíci ⁺²
I've been prompted by someone to think whether LSTMs should still be part of neural network fundamental courses. What do you think?
Is it CNN then Transformers directly? Or are LSTMs more than a historical digression?
@renanmonteirobarbosa8129 Před 10 měsíci ⁺¹
@@AICoffeeBreak The concepts are more important and understanding why it works. LSTMs are fun
@ashikkamal7912 Před 2 lety
Subscribe kar diya bhai

Další v pořadí

Automatické přehrávání

Transformers can do both images and text. Here is why.