Vision Transformers (ViT) Explained + Fine-tuning in Python

Sdílet
Vložit
  • čas přidán 6. 09. 2024

Komentáře • 48

  • @zoelav1398
    @zoelav1398 Před 10 měsíci +6

    This was extremely clear and I was able to understand ViTs better. Thank you so much!

  • @fidelodok4393
    @fidelodok4393 Před rokem +5

    Really enjoyed every bit. Trying to setup the transformer for an Audio Regression task, the ViT has shown amazing performance in classification

    • @msg2clash
      @msg2clash Před rokem

      Hi, do you suggest any articles that allow me to use transformers with audio classification? I would appreciate any help.

  • @philipplagrange314
    @philipplagrange314 Před rokem +4

    Great video! I've watched quite a few videos and read papers about Transformers, but your video really made me understand the concept

  • @blueaquilae
    @blueaquilae Před rokem +2

    The clarity of your discourse is unmatched and it's always a pleasure to follow your videos. Kinda a side effect of your passion for the domain!?

    • @jamesbriggs
      @jamesbriggs  Před rokem +1

      thanks a ton, I'm glad it helps - and yep it's definitely a bonus doing the videos in such a cool domain

    • @blueaquilae
      @blueaquilae Před rokem

      @@jamesbriggs I think the content is so clear that you can imagine a full courses bundle ^^

    • @jamesbriggs
      @jamesbriggs  Před rokem +1

      It is part of a (free) course/ebook :) www.pinecone.io/learn/image-search/

  • @zappist751
    @zappist751 Před rokem +2

    James is the top G in deep learning

  • @aradhyadhruv9084
    @aradhyadhruv9084 Před 11 měsíci +1

    The is by far the best explanation of the paper that I could find. Thanks a lot!

  • @antient_atlas
    @antient_atlas Před rokem +1

    Great explanation, unique on YT. Thanks!

  • @NikolaosTsarmpopoulos

    Very good introductory video. Thanks for sharing.

  • @lechavs
    @lechavs Před 10 měsíci

    Oh man, really great explanation, easy to digest. Keep it up!

  • @matheusrdgsf
    @matheusrdgsf Před rokem +2

    Incredible content! Thx James!

  • @salehahmad5625
    @salehahmad5625 Před 10 měsíci

    Great explanation. Very fine details. Great work

  • @leonardvanduuren8708
    @leonardvanduuren8708 Před rokem

    Another great video of yours. So clear and clarifying. Thx !

  • @fabianaltendorfer11
    @fabianaltendorfer11 Před 11 měsíci

    You are an inspiration james

  • @knorkeize
    @knorkeize Před 6 měsíci

    at 5:10 it seems that max pooling and conv layers are accidentally swapped. The max pooling layer has a smaller dimension than the leading layer and usually comes after a convolution.

  • @pranaymathur997
    @pranaymathur997 Před rokem +1

    Thank you so much for this video :)

  • @EkShunya
    @EkShunya Před rokem

    Thank you for the effort ur putting here in your explanations. :)

  • @user-hx3hn1ni1o
    @user-hx3hn1ni1o Před 10 měsíci

    Great video man. Keep it up👍👍

  • @PauClimentPerez
    @PauClimentPerez Před rokem

    Well, Bag of Words and Bag of Visual Words WAS a merger of NLP and Computer Vision, back in the day (2010s)

  • @amanalok4647
    @amanalok4647 Před rokem

    Thanks a lot for this ! Amazing amazing explanation!

  • @conairebyrne7298
    @conairebyrne7298 Před rokem +2

    Great video man cheers! Do you have a video about using a dataset made up of your own images on the vision transformer?

  • @user-wr4yl7tx3w
    @user-wr4yl7tx3w Před rokem +1

    Are there such thing that is similar to word embeddings? Or you simply take your pixel data as patches and run it through the dense layer to get projections?

  • @Sara-he1fz
    @Sara-he1fz Před rokem +1

    In this video, there is no explanation for the output of a vision transformer. In NLP transformers, the output is a probability distribution over the vocab but in vision transformers, I guess it is over a code book. But what this code book is and how it is aligned to the input image is not clear. Thanks a lot for this video but it is incomplete

    • @jamesbriggs
      @jamesbriggs  Před rokem +1

      The output from an NLP transformer is a set of token-level embeddings not a probability distribution over the vocab.
      The probability distribution over the vocab that you’re referring to is actually an extra component (a head) that is used for Masked Language Modeling (MLM). ViT doesn’t use MLM for pretraining (unlike NLP transformers) so a equivalent head isn’t used.
      So the output for the ViT is actually the same as the NLP transformer, it is a set of token (for ViT this is patch)-level embeddings.
      I hope that makes sense? Thanks!

    • @Sara-he1fz
      @Sara-he1fz Před rokem +1

      @@jamesbriggs Yes it is very useful. You are right the output is a token embedding but the output for MLM is a probability distribution over the vocab in NLP. I guess MIM is used as a pretraining in vision transformer. In this case, there should be a code book if I am not mistaken

  •  Před rokem +1

    Excellent video, James. Thank you!
    I have a question, how do you compute the 9.8 MM comparisons at 10:09 ?

  • @achukstok
    @achukstok Před rokem

    hey, thanks a lot. i have come from TensorFlow. so can u please answer, is it training the whole vit model for our dataset or freezing the vit pre trained part and training classification head only (like trainable=false in tf)?

  • @scottkorman4953
    @scottkorman4953 Před rokem

    Thanks a lot for the video. I cant find any precise explanation about the function of self-attention layer and MLP layer in the encoder modules. Could you maybe add some information about that?

    • @dhaneshr
      @dhaneshr Před rokem

      go watch nanoGPT video by andre karpathy

  • @diasposangare1154
    @diasposangare1154 Před 2 měsíci

    please can i have access to your powerpoint

  • @Diego0wnz
    @Diego0wnz Před rokem

    it currently gives the error 'no module named 'datasets', anybody has a fix?

  • @suchiralaknath7576
    @suchiralaknath7576 Před rokem

    This video is really helful. Thank you!

  • @shaheerzaman620
    @shaheerzaman620 Před rokem

    great stuff!

  • @RAZZKIRAN
    @RAZZKIRAN Před rokem

    thank you sir

  • @dhaneshr
    @dhaneshr Před rokem +2

    no fun using huggingface transformers library. you should have explained vision transformers using a more basic implementation, than a high level library

  • @rockwellthivierge9193

    Nice one..! This content desperately needs "Promo SM"!