Swin Transformer paper animated and explained

SdĂ­let
VloĆŸit
  • čas pƙidĂĄn 17. 07. 2024
  • Swin Transformer paper explained, visualized, and animated by Ms. Coffee Bean. Find out what the Swin Transformer proposes to do better than the ViT vision transformer.
    đŸ“ș ViT explained: ‱ An image is worth 16x1...
    đŸ“ș Transformer explained: ‱ The Transformer neural...
    đŸ“șâ–ș Positional embeddings (playlist): ‱ Positional encodings i...
    ▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀
    Thanks to our Patrons who support us in Tier 2, 3, 4: 🙏
    donor, Dres. Trost GbR, Yannik Schneider
    âžĄïž AI Coffee Break Merch! đŸ›ïž aicoffeebreak.creator-spring....
    đŸ”„ Optionally, pay us a coffee to help with our Coffee Bean production! ☕
    Patreon: / aicoffeebreak
    Ko-fi: ko-fi.com/aicoffeebreak
    ▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀
    Paper discussed:
    📜 Liu, Ze, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. "Swin transformer: Hierarchical vision transformer using shifted windows." arXiv preprint arXiv:2103.14030 (2021). arxiv.org/abs/2103.14030
    đŸ’» Swin Transformer code on GitHub: github.com/microsoft/Swin-Tra...
    Outline:
    00:00 Problems with ViT / Swin Motivation
    04:16 Swin Transformer explained
    06:00 Shifted Window based Self-attention
    08:58 positional embeddings in the Swin Transformer
    09:29 Task performance of the Swin Transformer
    Music đŸŽ” : Bay Street Millionaires by Squadda B
    ---------------------
    🔗 Links:
    AICoffeeBreakQuiz: / aicoffeebreak
    Twitter: / aicoffeebreak
    Reddit: / aicoffeebreak
    CZcams: / aicoffeebreak
    #AICoffeeBreak #MsCoffeeBean #MachineLearning #AI #research​
    Video and thumbnail contain emojis designed by OpenMoji - the open-source emoji and icon project. License: CC BY-SA 4.0 16x16 pixels comprehensible artificial intelligence
  • Věda a technologie

Komentáƙe • 99

  • @SomexGupta
    @SomexGupta Pƙed rokem +29

    Awesome Video, explained concept in very easy to understand,
    Small query at time 2:21 when we divided 256*256 pixels in 16*16 pixels then total number of token should be 256 according to me, as (256*256)/(16*16) = 256 tokens but in explanation it's mentioned 16 tokens can you guide on this.

    • @AICoffeeBreak
      @AICoffeeBreak  Pƙed 11 měsĂ­ci +4

      Hi, you are right, my mistake. Pinned your comment, thanks!

  • @CristianGarcia
    @CristianGarcia Pƙed 2 lety +33

    Alternative title for the paper:
    Convolutional Transformer.

  • @astroferreira
    @astroferreira Pƙed 2 lety +30

    Great video! I think the passage in the abstract is related to the fact that text has a fixed scale compared to images. The smallest piece of text you can have is a single character while for images, a single pixel can represent wildly different scales and can't really be considered the 'smallest scale possible'. In microscopy a single pixel can have scales of 1e-4 m while for astronomy a single pixel can represent kiloparsecs or ~1e19 m.

  • @minhquanao7492
    @minhquanao7492 Pƙed 2 lety +14

    I think the idea of applying Transformer over a small window also appears in "Deformable DETR: Deformable transformers for end-to-end object detection". However, like Deformable Convolution, this paper lets the model learn the location where each patch pays attention to rather than fix the attention window (e.g. the immediate 3x3 neighborhood).

  • @AnilKeshwani
    @AnilKeshwani Pƙed 2 lety +8

    My gosh these video explainers are good. Fantastically clear and intuitively presented

  • @visintel
    @visintel Pƙed rokem +3

    I love the low-key comparison to simple convolution. Looks like we made a full circle lol.

  • @tane_ma
    @tane_ma Pƙed 2 lety +11

    I am a new fan of the channel. Always good and quick explanations and logic/storytelling, animations, segmented sections, length of videos, and link for the paper and repo in the description ❀

    • @AICoffeeBreak
      @AICoffeeBreak  Pƙed 2 lety +2

      Hey, thanks for the kind words! Happy to have you here.

  • @alouped
    @alouped Pƙed 2 lety +1

    Nice videos, thanks for putting in the work.

  • @anirudhthatipelli8765
    @anirudhthatipelli8765 Pƙed rokem +1

    Thanks a lot! This was very detailed!

  • @littlevu735
    @littlevu735 Pƙed 2 lety +2

    Great channel, keep going!

  • @SuperShadowmasterZ
    @SuperShadowmasterZ Pƙed 2 lety +7

    I saw a similar tranformer useage in Fastformer: Additive Attention Can Be All You Need

  • @nilsmuller9286
    @nilsmuller9286 Pƙed 2 lety +5

    Great content as always. :)

  • @RAZZKIRAN
    @RAZZKIRAN Pƙed rokem +1

    Great channel , thank u

  • @madhavjariwala4548
    @madhavjariwala4548 Pƙed 2 lety +2

    Thank you for this video. You're the best!

  • @Peebuttnutter
    @Peebuttnutter Pƙed 2 lety +3

    thanks!!

  • @erdemakagunduz2078
    @erdemakagunduz2078 Pƙed 2 lety +6

    great video. But if we must compare a Fyodor Dostoevsky novel to something in vision, it is not an entire single image, it is a Andrei Tarkovsky movie. So moral of the story, vision still rocks! :)

  • @Harry-jx2di
    @Harry-jx2di Pƙed rokem +1

    Thanks!

  • @amreamer362
    @amreamer362 Pƙed 2 lety +1

    Very awesome

  • @fast_harmonic_psychedelic
    @fast_harmonic_psychedelic Pƙed 2 lety +2

    is it able to encode text or is the image projection able to be compared via cosine similarity like clip? can this replace clip? Let me know in the comments below

    • @AICoffeeBreak
      @AICoffeeBreak  Pƙed 2 lety +2

      It's s transformer, so sure you can have the two branches in CLIP to be replaced by two Swim Transformers.

    • @fast_harmonic_psychedelic
      @fast_harmonic_psychedelic Pƙed 2 lety +1

      @@AICoffeeBreak i tried it but i cant figure it out. so many outputs to swin that are different shapes that are incompatible. I tried to have it encode separately side by side with clip and then maybe get a mean of both encodings but theres just too many errors and parameters to change i ended up giving up.
      what i dont understand is -- whats the point of this without some sort of text module? Like.. what does it do.. lol .. it just takes the image and outputs the same image?

    • @fast_harmonic_psychedelic
      @fast_harmonic_psychedelic Pƙed 2 lety +1

      Like - i can understand if this was replacing clips VIT it would be magical to get attention on all these different scales. But alone, with no understanding of token embedding similarity to image patches -- is it just good for benchmarking or what lol

    • @AICoffeeBreak
      @AICoffeeBreak  Pƙed 2 lety

      @@fast_harmonic_psychedelic I get your problem. So, no. This image-only transformer (in its current form) basically autoencodes the image, yes. But there are special [CLS] tokens to solve tasks like image recognition.

    • @fast_harmonic_psychedelic
      @fast_harmonic_psychedelic Pƙed 2 lety

      @@AICoffeeBreak is there some sort of map of CLS tokens that someone could refer to in order to activate certain features?

  • @debanjanchakraborty9946
    @debanjanchakraborty9946 Pƙed rokem +2

    really love your content and i actually shifted algoritms coz they dont run on my system and i wanted more accurate results

  • @ThamizhanDaa1
    @ThamizhanDaa1 Pƙed 2 lety +4

    I think SWIN Transformer perforcmance should be compared with other convNets for semantic segmentation, including DeiT regular size.. you're right its pretty deceptive to ignore those results haha. But then again, this is a good idea for self-attention, regardless of this

  • @keroldjoumessi
    @keroldjoumessi Pƙed 2 lety +4

    Very nice video. I have really enjoyed it as it was quite easy to follow with no prior knowledge. Therefore I don't quite understand why we still need to transform the patch vectors (features dimensionality) into another dimensionality C. In other words, what is the idea behind this transformation (from the initial features dimensionality to another C-feature dimensionality)?

    • @philip2.042
      @philip2.042 Pƙed rokem

      Because we re merging multiple vectors from self attention layer into one, we enlarge our representarion vector (C) under a hypothesis that it will better capture more info coming from larger patches

  • @soumyasarkar4100
    @soumyasarkar4100 Pƙed 2 lety +17

    isn't shifted window based self attention similar to local attention in longformer ?

    • @AICoffeeBreak
      @AICoffeeBreak  Pƙed 2 lety +16

      đŸ€« you're diminishing the novelty.

    • @LKRaider
      @LKRaider Pƙed 2 lety +3

      @@AICoffeeBreak LOL

  • @paoloceric6464
    @paoloceric6464 Pƙed 2 lety +16

    Nice video, but I think you made a mistake when calculating the number of patches (both in the 256x256 and 1920x1920 example). A 16x16 patch would produce 256 patches in the first image, and 14400 in the second, not 16 and 120.

    • @AICoffeeBreak
      @AICoffeeBreak  Pƙed 2 lety +2

      It's totally possible I made a mistake, but for the moment, I do not get it. We said that a 256^2 pixel image would need 16 of those 16^2 patches. A 1920^2 pixel image would need 14400 of those 16^2 patches. How do you calculate this?

    • @spongemeryl
      @spongemeryl Pƙed 2 lety

      Same comment/doubt here, maybe I didn't quite get it right, but isn't 256^2/16^2 = 256, and 1920^2/16^2 = 14400?

    • @paoloceric6464
      @paoloceric6464 Pƙed 2 lety +3

      @@AICoffeeBreak Okay, then it seems I don't get what a patch/image vector actually is. You said "if the image is 256x256 pixels then extracting 16x16 patches would lead to 16 patches", but why only 16? If we divide a 256x256 image into 16x16 squares, we get 256 squares, that's my only point. If we indeed only use 16 of those 256 squares then my question is - why?

    • @AICoffeeBreak
      @AICoffeeBreak  Pƙed 2 lety +2

      @@paoloceric6464 An image of 256x256 pixels has 256 pixels in width, 256 pixels in height. A 16x16 patch, is a pixel tile of 16 pixels in width and 16 pixels in height. How many of these patches do you need to achieve a complete tiling of the image?
      16. Because 256/16=16. So We need 16 patches to tile the image with them.

    • @patakk8145
      @patakk8145 Pƙed 2 lety +8

      @@AICoffeeBreak are you sure? You can’t just divide 256 by 16, that only gives you the amount of patches in one dimensions (e.g. width). In order to fill the whole area you need 256 patches.
      Or you can think of it as 256x256=65536 total pixels that you’re filling with 16x16=256 pixel patches. There’s obviously 256 of them in the whole image.

  • @giantbee9763
    @giantbee9763 Pƙed 2 lety +3

    Very nice video :) !

    • @AICoffeeBreak
      @AICoffeeBreak  Pƙed 2 lety +4

      Thanks for the visit!
      I saw you commented something on the "Generalization - Interpolation - Extrapolation video" but the comment is no longer available. Either:
      1. you removed it
      2. YT removed it (did you have a link in there?)
      But I did not remove it. I am actually quite curious to know what you had to say there. :)
      I am mentioning this because I had previous experience of good comments being removed by YT without any of my doing and people were a little perplexed and confused why I am censoring them. đŸ€

    • @giantbee9763
      @giantbee9763 Pƙed 2 lety +3

      @@AICoffeeBreak Hi Letita, Yup I did comment on the video but I ended up removing it, so it wasn't the CZcams algorithm this time. :D
      That's all :D I had been living under the rock of "not using twitter", so I'm probably quite late to the party anyway.

    • @AICoffeeBreak
      @AICoffeeBreak  Pƙed 2 lety +4

      Haha, great to hear then that YT is not messing with comments this time. :) Still curious what you had to say. I guess it will stay forever a mystery. đŸ€«

  • @shubhamsuryavanshi1461
    @shubhamsuryavanshi1461 Pƙed rokem

    Great work 😃, could you please make a video on deformable transformers for end to end object detection? â˜ș

  • @Jack-gb1nw
    @Jack-gb1nw Pƙed 2 lety

    was it potentially the Longformer or the Reformer NLP papers that reminded you of localised attention?

  • @syedadzha362
    @syedadzha362 Pƙed rokem

    Amazing video

  • @asn9329
    @asn9329 Pƙed 2 lety +2

    can this transformer be used for super-resolution task, for unpaired data.

  • @veggeata1201
    @veggeata1201 Pƙed rokem +1

    I'm not sure of the origin of windowed attention, but it is used in big bird along with other sparse attention methods.

  • @sachinlodhi8542
    @sachinlodhi8542 Pƙed rokem +3

    at 2:23 how 16x16 pixel patches generated from 256x256 image would sum up to 16? Would there not be total of 256 patches of 16x16 ?

    • @AICoffeeBreak
      @AICoffeeBreak  Pƙed 11 měsĂ­ci +1

      Hi, you are right, my mistake. I've pinned a comment explaining this, thanks!

  • @kristoferkrus
    @kristoferkrus Pƙed 10 měsĂ­ci +1

    Great video! And I know you published it close to two years ago, but about the window-limited self attention, I guess that's pretty standard in generative LLMs nowadays, such as Llama or the GPT family by OpenAI?

    • @kristoferkrus
      @kristoferkrus Pƙed 10 měsĂ­ci +1

      But maybe I'm diminishing the novelty now 😁

    • @AICoffeeBreak
      @AICoffeeBreak  Pƙed 8 měsĂ­ci +2

      Yes, it is the case for Long Context Transformers. But the problem there is the network forgets at the end what was said in the beginning. So the paper on attention sinks is a simple hacky solution to that.

    • @kristoferkrus
      @kristoferkrus Pƙed 8 měsĂ­ci +1

      @@AICoffeeBreak Thanks; I will check it out!

  • @lucasbeyer2985
    @lucasbeyer2985 Pƙed 2 lety +3

    That paper where you've seen this before is either HaloNet or SaSaNet (standalone self-attention)

  • @undefined-mj6oi
    @undefined-mj6oi Pƙed 2 lety +3

    2:28
    Could you please explain why 16*16 patches lead to 16 image tokens here?

    • @AICoffeeBreak
      @AICoffeeBreak  Pƙed 2 lety +5

      😅 No, I can't because it leads to 256 image tokens. See the whole comment and thread by @Paolo Čerić in here where he was the first to make me realize this mistake.

    • @undefined-mj6oi
      @undefined-mj6oi Pƙed 2 lety +1

      @@AICoffeeBreak Got it! Thanks!

  • @reasoning9273
    @reasoning9273 Pƙed rokem +1

    I think you forgot to square 120. 1920x1920 resolution will generate 14.4k image tokens of size 16x16, which is 3164 times more computation compared to 256x256 case when calculating dot product attention. I don't think any single GPU can manage this calcuation.

  • @sachinlodhi8542
    @sachinlodhi8542 Pƙed rokem +2

    at 3:30 how 256x256 pixels result in 63504 ?

    • @AICoffeeBreak
      @AICoffeeBreak  Pƙed 11 měsĂ­ci

      Hi, you are right, my mistake. I've pinned a comment explaining this, thanks!

  • @kaustavdas6550
    @kaustavdas6550 Pƙed 3 měsĂ­ci

    Casa? Cascading Self attention seems similar?

  • @Jose-pq4ow
    @Jose-pq4ow Pƙed 2 lety +5

    The tricks needed to efficiently run these models on computer vision tasks seem to be too "complex" in comparison to standard CNNs....

    • @AICoffeeBreak
      @AICoffeeBreak  Pƙed 2 lety +14

      Yeah, it looks quite messy at the moment. On the other side, tricks to get CNNs to work were complex at their time too (pooling, dropout, fully convolutional architectures, batch norm, etc.). It's just that we got used to it (and educated about it).
      After the proliferation of tricks to make the transformer more data-efficient and get it to work on long sequences, there will be half a dozen of tricks that will stick with them and will be taught to posterity as actually quite simple tricks. It looks like quite a mess because we are not there yet.

  • @mrigankanath7337
    @mrigankanath7337 Pƙed 11 měsĂ­ci +1

    if image size is 256 x 256 and patch size is 16 x 16 shouldnt there be 256 tokens? ((256 x 256)/ (16 x16)) = 256

    • @AICoffeeBreak
      @AICoffeeBreak  Pƙed 11 měsĂ­ci

      Hi, you are right, my mistake. I've pinned a comment explaining this, thanks!

  • @ishaqkhan5418
    @ishaqkhan5418 Pƙed 8 měsĂ­ci +1

    Its really great video, but maybe you had to explain the architectures in a little more details like 3 4 minutes more would have made it the best.
    Anyways thank you for the great content!

    • @AICoffeeBreak
      @AICoffeeBreak  Pƙed 8 měsĂ­ci +1

      Thanks for your feedback! :) Appreciate it!

  • @chez8990
    @chez8990 Pƙed 9 měsĂ­ci +1

    Longforner restricts attention window to expand token limiit

    • @AICoffeeBreak
      @AICoffeeBreak  Pƙed 9 měsĂ­ci +1

      Thanks for this, Longformer is a great reference. Even before Swinformer, there were papers restricting the attention window. This idea has now become even more represented.

  • @toyuyn
    @toyuyn Pƙed 2 lety +10

    Isn't that just local attention?
    "Yeah but you can achieve global attention at later layers because of the receptive fields"
    Isn't that what CNN's do? Then why bother with transformers?
    "..."
    Something something attention, something something dynamic convolutions.

    • @elinetshaaf75
      @elinetshaaf75 Pƙed 2 lety +5

      Whatever buzzword makes a publication these days.

    • @AICoffeeBreak
      @AICoffeeBreak  Pƙed 2 lety +4

      It seems that a lot of research nowadays is to introduce some of the inductive biases of the CNN into the transformer.
      What is better than a complete related work? An incomplete one and a paper that claims to be the first to have invented the wheel. :)

    • @VVi11
      @VVi11 Pƙed 2 lety +2

      pretty much

    • @cipritom
      @cipritom Pƙed 2 lety +1

      My thoughts exactly. So the gains must come from somewhere else (over the ConvNets). And indeed, a few months later, we have ConvNeXt showing the gains do indeed come from other parts, not from attention

  • @DerPylz
    @DerPylz Pƙed 2 lety +21

    Shifted WINDOWS transformer by Microsoft research đŸ€”đŸ€”đŸ€”đŸ€”

    • @AICoffeeBreak
      @AICoffeeBreak  Pƙed 2 lety +18

      It's because they could not shift the Linux, lol.

    • @DerPylz
      @DerPylz Pƙed 2 lety +5

      @@AICoffeeBreak I prefer shifted Apple transformers. Even though they are often confused with pizza...

  • @yusufani8
    @yusufani8 Pƙed rokem +2

    I am putting here a counter for how many times I forget what does Swin Transformer.
    Counter = 1

  • @lucasbeyer2985
    @lucasbeyer2985 Pƙed 2 lety +2

    Haha no need to be triggered. By "scale of visual entities" they mean "size of things in the picture", so that sometimes an orange covers just 10 pixels and sometimes it covers 1000 pixels. This effect indeed does not really exist in language.

  • @yimingqu2403
    @yimingqu2403 Pƙed 2 lety +6

    ICCV 2021 best paper

    • @AICoffeeBreak
      @AICoffeeBreak  Pƙed 2 lety +3

      Really? You're attending?

    • @yimingqu2403
      @yimingqu2403 Pƙed 2 lety +3

      @@AICoffeeBreak not me, but my colleagues at MSR

    • @AICoffeeBreak
      @AICoffeeBreak  Pƙed 2 lety +3

      @æ›Č侀龣 Well then, congrats to the authors! 👏

  • @shinkai791
    @shinkai791 Pƙed 11 měsĂ­ci +1

    A little bit like "local attention Transformer"?

  • @subhanshubansal4704
    @subhanshubansal4704 Pƙed 10 měsĂ­ci +2

    Local Attention ? (Shifted Windows)

  • @erengurses123
    @erengurses123 Pƙed rokem

    1920x1920 image have 120 image tokens where patch size is 16x16 ???? At least 120 should be the square of something.

  • @gauravlochab9614
    @gauravlochab9614 Pƙed 2 lety +1

    using detr for face recognition

  • @CyrusVatankhah
    @CyrusVatankhah Pƙed rokem +1

    Can you get rid of the coffee bean? Or if it is your "brand", at least don't change/move it throughout the video. It is super distracting!

    • @AICoffeeBreak
      @AICoffeeBreak  Pƙed rokem +1

      Thanks for sharing your feedback. We had this discussion in a video before, so I did a poll on this: czcams.com/users/postUgkxU0F0Y69SrC6HhZ6uD97gVxrANlH1CElk
      I personally am quite attached to her.