OpenAI CLIP model explained

Sdílet
Vložit
  • čas přidán 13. 09. 2024
  • CLIP: Contrastive Language-Image Pre-training
    In this video, I describe the CLIP model published by OpenAI. CLIP is based on Natural Language Supervision for pre-training. Natural Language Supervision is not a new, in fact there are two approaches for this, one approach tries to predict the exact caption for each image, whereas the other approach is based on contrastive loss, where instead of predicting the exact caption, they try to increase the similarity of correct pairs.

Komentáře • 8

  • @AI_For_Scientists
    @AI_For_Scientists Před 2 dny

    Great video series on vit and derivatives, watched all of it. Thank you very much for sharing.

  • @SebastianRaschka
    @SebastianRaschka Před 3 měsíci

    Very nice video! I can also imagine that predicting the caption text exactly isn't only more difficult but it would also be more likely result in (more) overfitting if it is learned this way.
    At 5:43, the pair-wise similarities, they are basically like cross-attention scores?

    • @PyMLstudio
      @PyMLstudio  Před 3 měsíci +1

      Yes, in a way, it’s analogous to cross-attention, taking dot-product between the features from the text encoder and image encoder. This dot-product similarity is used as the final output of the model to determine if an image and a text caption are related or not.
      Good question, thanks for the comment

  • @fouziaanjums6475
    @fouziaanjums6475 Před 2 měsíci +2

    Please cover FasterViT model too...

    • @PyMLstudio
      @PyMLstudio  Před 2 měsíci

      Absolutely, I’ll cover that , I have a few other topics lined up, then I’ll get to FasterViT
      Thanks for the suggestion!

  • @randomstuff39280
    @randomstuff39280 Před měsícem

    thank you for explaining! very clear!
    but I'm wondering how do you know WiT dataset is based on 50000 queries and 20000 pairs for each query? I can't find it in the paper.

    • @PyMLstudio
      @PyMLstudio  Před 26 dny +1

      Thanks for the comment!
      Please see Page 3, section 2.2: Creating a sufficiently large dataset
      But it’s 500000 queries, balancing 20000 (Image, text) pairs per query