Swin Transformer: Hierarchical Vision Transformer using Shifted Windows (paper illustrated)

Sdílet
Vložit
  • čas přidán 29. 08. 2024

Komentáře • 63

  • @phattailam9814
    @phattailam9814 Před rokem +1

    Thank you so much for the explanation!

  • @mahmoudimus
    @mahmoudimus Před 6 měsíci +1

    Great explanation. Love the music + the voice :)

    • @AIBites
      @AIBites  Před 6 měsíci

      Thanks. Glad you liked it!

  • @kalluriramakrishna5732
    @kalluriramakrishna5732 Před rokem +1

    Thank you for your fabulous Explanation

  • @muhammadsalmanali1066
    @muhammadsalmanali1066 Před 2 lety

    Thank you so much for the explanation. Please keep the videos coming.

  • @robosergTV
    @robosergTV Před 2 měsíci +1

    huh? ViT was the first backbone Trasnformer arch for vision, not swin

    • @AIBites
      @AIBites  Před dnem

      awesome spot. And thanks for this info.

  • @user-ev8be1lk3x
    @user-ev8be1lk3x Před 5 měsíci +1

    This is brilliant!

  • @suke933
    @suke933 Před 2 lety +3

    Thanks for the video dear AI Bites. I was struggling to understand the SWIN architecture. It was very easily elaborated up to the point, but I would like to ask on "the motivation for different C value selection". Why is it important? If you would convey, it would further give more meaningful understanding to me.

  • @JC-ru4bp
    @JC-ru4bp Před 3 lety +1

    Very clear explanation of the paper idea, thanks.

    • @AIBites
      @AIBites  Před 3 lety

      very encouraging to keep making videos :)

    • @JC-ru4bp
      @JC-ru4bp Před 3 lety

      @@AIBites Keep up, man,

  • @tonywang7933
    @tonywang7933 Před 4 měsíci +1

    Thank you!! So nicely explained

    • @AIBites
      @AIBites  Před 4 měsíci

      You're welcome. So would you like to see more of papers explained or would you like more of coding videos?

  • @deadbeat_genius_daydreamer

    This is seriously underrated, I enjoyed this visual approach, Thanks and regards for your efforts to make this explanation. Cheers🎊👍

    • @AIBites
      @AIBites  Před rokem

      Thank you so much Harshad! 😊

  • @garyhuntress6871
    @garyhuntress6871 Před 2 lety +1

    Excellent review, thanks. I've subscribed for future papers! Do you use manim for your animations?

    • @AIBites
      @AIBites  Před 2 lety

      Hi Gary, Thanks for your comments! In some places I use manim but not always. :)

  • @manub.n2451
    @manub.n2451 Před 2 lety +1

    Thank you so much

  • @tensing2009
    @tensing2009 Před 2 lety

    Great Video!
    Thanks for making it! :)

  • @arpita0608
    @arpita0608 Před 2 lety +1

    Thank you for illustrating this architecture. Can you make videos more on segmentation algorithms which are being used now a days please. Thanks.

    • @AIBites
      @AIBites  Před 2 lety +2

      Sure. Will plan to make one on SegFormers.

    • @arpita0608
      @arpita0608 Před 2 lety

      @@AIBites cool ❤️
      And thanks for this presentation

  • @keroldjoumessi
    @keroldjoumessi Před 2 lety +1

    Thanks for the video. It was very awesome and easy to follow. Therefore even if the Windows architecture reduces the complexity to compute the self-attention, I think we still have this computational issue for the overall image and the attention becomes locally as in CNNs instead of globally like in RNN. Anyway thanks for your explaination

    • @readera84
      @readera84 Před 2 lety +1

      How you are saying such complex things so easily 😫 I couldn't even understand what he said 🤕

    • @keroldjoumessi9597
      @keroldjoumessi9597 Před 2 lety

      ​@@readera84 what don't you understand? maybe I can give you a hand

    • @readera84
      @readera84 Před 2 lety

      @@keroldjoumessi9597 Windows shifting diagonally...an you make it more clear it to me

  • @triminh3849
    @triminh3849 Před 2 lety

    great video with excellent visualization, thanks a lot

  • @muhammadwaseem_
    @muhammadwaseem_ Před rokem +1

    Good explanation

  • @harutmargaryan9980
    @harutmargaryan9980 Před 2 lety

    Thank you, well done!

  • @user-gy9ef7mr7g
    @user-gy9ef7mr7g Před rokem +1

    Great explanation

  • @rybdenis
    @rybdenis Před 3 lety +1

    cool, thank you

  • @kashishbansal2651
    @kashishbansal2651 Před 3 lety

    AMAZING EXPLANATION!

  • @TheMomentumhd
    @TheMomentumhd Před 2 lety

    You think these swin transformers would be usefull in real time object detection? (are they fast enough)?

  • @anonymous-random
    @anonymous-random Před 3 lety

    The video is awesome! Thanks a lot!

  • @sanjeetpatil1249
    @sanjeetpatil1249 Před rokem

    Can you kindly explain this line in the paper, related to the patch merging layer, "The first patch merging layer concatenates the
    features of each group of 2 × 2 neighboring patches, and applies a linear layer on the 4C-dimensional concatenated
    features".
    Thank you for the video

  • @djeros666
    @djeros666 Před 3 lety

    Thank you for the great effort.

  • @saeedataei269
    @saeedataei269 Před 2 lety +1

    Thanks for the explanation. plz review more SOTA papers.

    • @AIBites
      @AIBites  Před 2 lety +1

      Sure will do Saeed! Thx. 🙂

  • @jialima8298
    @jialima8298 Před 2 lety

    Love the voice!

  • @parveenkaur2747
    @parveenkaur2747 Před 3 lety +1

    Very informative video!

    • @AIBites
      @AIBites  Před 3 lety

      Thanks! Glad you liked it.

  • @taoufiqelfilali2224
    @taoufiqelfilali2224 Před 3 lety

    great exlplanation, thank you

    • @AIBites
      @AIBites  Před 3 lety

      Thanks for your postive comment! :)

  • @EngRiadAlmadani
    @EngRiadAlmadani Před 2 lety +2

    thanks for this great video just one question why we used linear layer in patch merging while we can reshaping the input patches directly using reshape method ???

    • @AIBites
      @AIBites  Před 2 lety +2

      Great question. One thing I can think of is efficiency. I believe reshape is also challenging to propagate gradients backwards.

    • @Deshwal.mahesh
      @Deshwal.mahesh Před 2 lety +1

      Maybe thy're trying to make the model learn how to merge with knowledge? Just like solving a graphical puzzle?

    • @suke933
      @suke933 Před 2 lety

      @@AIBites Can we use the convolution within this scenario?

  • @harshkumaragarwal8326
    @harshkumaragarwal8326 Před 3 lety

    great work, thanks :)

  • @rajatayyab7737
    @rajatayyab7737 Před 3 lety +1

    next should Dynamic Head: Unifying Object Detection Heads with Attentions

    • @rybdenis
      @rybdenis Před 3 lety

      agreed

    • @AIBites
      @AIBites  Před 3 lety

      Thanks Raja for pointing out. We will try to prioritise the paper at some point.

  • @anhminhtran7609
    @anhminhtran7609 Před 3 lety

    Can you civer a bit more on the using Swin for object detection please?

  • @peddisaivivek6676
    @peddisaivivek6676 Před 2 lety

    Great video. But can you refrain from putting the music in the background while explaining. It's a little distracting when viewing it at higher speed.

    • @AIBites
      @AIBites  Před 2 lety

      Sure will take it on board when making the future ones 👍

  • @nguyenanhnguyen7658
    @nguyenanhnguyen7658 Před 3 lety

    NLP, you have 100,000 words at most to permute and train with. With images? Well. ViT with 400m images can hardly manage to match ImageNet :)