Why Does Diffusion Work Better than Auto-Regression?

Sdílet
Vložit
  • čas přidán 6. 06. 2024
  • Have you ever wondered how generative AI actually works? Well the short answer is, in exactly the same as way as regular AI!
    In this video I break down the state of the art in generative AI - Auto-regressors and Denoising Diffusion models - and explain how this seemingly magical technology is all the result of curve fitting, like the rest of machine learning.
    Come learn the differences (and similarities!) between auto-regression and diffusion, why these methods are needed to perform generation of complex natural data, and why diffusion models work better for image generation but are not used for text generation.
    The following generative models were featured as demos in this video:
    Images: Adobe Firefly (www.adobe.com/products/firefl...)
    Text: ChatGPT (chat.openai.com)
    Audio: Suno.ai (suno.ai)
    Code: Gemini (gemini.google.com/app)
    Video: Lumiere (Lumiere-video.github.io)
    Chapters:
    00:00 Intro to Generative AI
    02:40 Why Naïve Generation Doesn't Work
    03:52 Auto-regression
    08:32 Generalized Auto-regression
    11:43 Denoising Diffusion
    14:19 Optimizations
    14:30 Re-using Models and Causal Architectures
    16:35 Diffusion Models Predict the Noise Instead of the Image
    18:19 Conditional Generation
    19:08 Classifier-free Guidance

Komentáře • 209

  • @doku7335
    @doku7335 Před 13 dny +178

    At first I thought "oh, another random video explaining the same basics and not adding anything new", but I was so wrong. It's an incredibly clear explanation of diffusion, and the start with the basic makes the full picture much clearer. Thank you for the video!

    • @gonfpv
      @gonfpv Před 6 dny +4

      You should check the rest of his videos. All are of sublime quality

    • @pvic6959
      @pvic6959 Před dnem +1

      > makes the full picture much clearer
      hehe did it help denoise

  • @jupiterbjy
    @jupiterbjy Před 15 dny +99

    kinda sorry to my professors and seniors but this is the single best explanation of logics behind each models. About dozen min vid > 2 years of confusion in univ

  • @erfanasgari21
    @erfanasgari21 Před 2 dny +5

    This is literally the best explanation of the diffusion models I have ever seen.

  • @algorithmicsimplicity
    @algorithmicsimplicity  Před 3 měsíci +187

    Next video will be on Mamba/SSM/Linear RNNs!

    • @benjamindilorenzo
      @benjamindilorenzo Před 3 měsíci

      great! Also maybe think about the Tradeoff between scaling and incremental improvements, in case your perspective is, that LLM´s also always approximate the data set and therefore memorize rather than any "emergent capabilities". So that ChatGPT also does "only" curve fitting.

    • @harshvardhanv3873
      @harshvardhanv3873 Před 19 dny +2

      I am student who is pursuing a degree in ai and we want more of your videos for even simplest of the concepts in ai, trust me this channel will be a huge deal in the near future, good luck!!

    • @QuantenMagier
      @QuantenMagier Před 9 dny

      Well take my subscription then!!1111

    • @atishayjain1141
      @atishayjain1141 Před 5 dny

      From where did you learn, all these also have to tried to code for the same?

  • @user-my3dd4lu2k
    @user-my3dd4lu2k Před měsícem +118

    Man I love the fact that you present the fundamental idea with an Intuitionistic approach, and then discuss the optimization.

  • @jasdeepsinghgrover2470
    @jasdeepsinghgrover2470 Před 25 dny +37

    This is a much better explanation than the diffusion paper itself. They just went all around variational inference to get the same result!

  • @yqisq6966
    @yqisq6966 Před 20 dny +53

    The clearest and most concise explanation of diffusion model I've seen so far. Well done.

  • @user-fh7tg3gf5p
    @user-fh7tg3gf5p Před 3 měsíci +40

    This genius only makes videos occassionally, that are not to be missed.

  • @pw7225
    @pw7225 Před 14 dny +16

    The way you tell the story is fantastic! I am surprised that all AI/ML books are so terrible at didactics. We should always start at the intuition, the big picture, the motivation. The math comes later when the intuition is clear.

    • @dustinandrews89019
      @dustinandrews89019 Před 7 dny +4

      I have seen the "math-first, intuition later or never" approach in a lot of teaching. High school and college math, physics and programming classes are rife with this approach. I agree it's sub-optimal for most students. I have some vague ideas about why this approach perpetuates itself and I have seen a lot of gatekeeping around learning in a bottom up way. It's lovely to see some educators like AlgorithmicSiplicity and Three Blue One Brown break things down in much more intuitive way that then allows us to understand the maths.

  • @poipoi300
    @poipoi300 Před dnem +1

    This is refreshing to watch in a sea of people who don't know what they're talking about and decide to make "educational" videos on the subject anyways. The simplifications are often harmful.

  • @rafa_br34
    @rafa_br34 Před 22 dny +23

    Such an underrated video, I love how you went from the basic concepts to complex ones and didn't just explain how it works but also the reason why other methods are not as good/efficient.
    I will definitely be looking forward to more of your content!

  • @themodernshoe2466
    @themodernshoe2466 Před 2 dny +1

    This has been on my watch later for 3 months. Finally got to watching it, glad I did. This is an exceptional explanation of the technologies at play here.

  • @Jack-gl2xw
    @Jack-gl2xw Před 19 dny +16

    I have trained my own diffusion models and it required me to do a deep dive of the literature. This is hands down the best video on the subject and covers so much helpful context that makes understanding diffusion models so much easier. I applaud your hard work, you have earned a subscriber!

  • @Veptis
    @Veptis Před 13 dny +7

    This is a great explanation on how image decoders work. I haven't seen this approach and narrative direction yet.
    This now makes my reference for explaining it to people that got no idea.!

  • @riddhimanmoulick3407
    @riddhimanmoulick3407 Před dnem +3

    Kudos for an incredibly intuitive explanation! Really loved the visual representations too!!

  • @nasseral-bess564
    @nasseral-bess564 Před 3 dny +1

    This is actually one of the best if not the best deep learning related video on CZcams
    Thanks for your efforts

  • @RicardoRamirez-dr6gc
    @RicardoRamirez-dr6gc Před 20 dny +10

    This is seriously one of the best explainer videos i've ever seen. I've spent a long time trying to understand diffusion models and not a single video has come close to this one

  • @pseudolimao
    @pseudolimao Před 13 dny +22

    this is insane. I feel bad for getting this level of content for free

  • @HD-Grand-Scheme-Unfolds
    @HD-Grand-Scheme-Unfolds Před 24 dny +8

    You truly understand how to simplify... to engage our imagination... to employ naive thought or ideas to make comparisons to bring across a deeper more core principles and concepts to make the subject for more easier to grasp and get an intuition for. Algorithmic Simplicity indeed... thank you for your style of presentation and teaching. love it love it... you make me know what question I want to ask but didn't know I wanted to ask. CZcams needs your contribution in ML education. please don't forget that.

  • @Frdyan
    @Frdyan Před 11 dny +4

    I have a graduate degree in this shit and this is by far the clearest explanation of diffusion I've seen. Have you thought about doing a video running over the NN Zoo? I've used that as a starting point for lectures on NN and people seem to really connect with that paradigm

  • @benjamindilorenzo
    @benjamindilorenzo Před 3 měsíci +8

    Very good job.
    My suggestion is that you explain more about how it actually works, that the model learns to understand complete sceneries just from text prompts.
    This could fill its own video.
    Also it would be very nice to have a video about Diffusion Transformers like OpenAIs Sora probably is.
    Also it could be great to have a Video about the paper "Learning in High Dimension Always Amounts to Extrapolation".
    best wishes

    • @algorithmicsimplicity
      @algorithmicsimplicity  Před 3 měsíci +6

      Thanks for the suggestions, I was planning to make a video about why neural networks generalize outside their training set from the perspective of algorithmic complexity. That paper "Learning in High Dimension Always Amounts to Extrapolation" essentially argues that the interpolation vs extrapolation distinction is meaningless for high dimensional data, and I agree, I don't think it is worth talking about interpolation/extrapolation at all when explaining neural network generalization.

    • @benjamindilorenzo
      @benjamindilorenzo Před 3 měsíci +2

      @@algorithmicsimplicity yes true. It would be great also because this links back to the LLM´s discussions, wether scaling up Transformers actually brings up "emergent capabilities", or if this is simple and less magical explainable by extrapolation.
      Or in other words: either people tend to believe, that Deep Learning Architectures like Transformers only approximating their training data set, or people tend to believe, that seemingly unexplainable or unexpected capabilities emerge while scaling.
      I believe, that extrapolation alone explains really good why LLM´s work so well, especially when scaled up AND that LLM´s "just" approximate their training data (curve fitting). This is why i brought this up ;)

  • @lusayonyondo9111
    @lusayonyondo9111 Před 3 hodinami +1

    wow, this is such an amazing resource. I'm glad I stuck around. This is literally the first time this is all making sense to me.

  • @TheTwober
    @TheTwober Před 17 hodinami +1

    The best explanation I have found on the internet so far. 👍

  • @GianlucaTruda
    @GianlucaTruda Před 7 dny +3

    Holy shit, at 11:03 I suddenly realised what you were cooking! I've been trying to find a way to articulate this interesting relationship between autoregression and diffusion for ages (my thesis developed diffusion models for tabular data). This is such a brilliantly-visualised and intuitively explained video! Well done. And the classifier-free guidance explanation you threw in at the end has got to be some of the most high-ROI intuition pumping I've seen on CZcams.

  • @karlnikolasalcala8208
    @karlnikolasalcala8208 Před 18 dny +4

    This channel is gold, I'm glad I've randomly stumbled across one of your vids

  • @banana_lemon_melon
    @banana_lemon_melon Před 16 dny +1

    bruh, I loved your contents. Other channel/video usually explain general knowledge that can be easily found on internet. But you're going deeper to the intrinsic aspects of how the stuff works. This video, and one of your video about transformer, are really good.

  • @photamasan9661
    @photamasan9661 Před dnem +1

    You’re him 🙌🏽. Thank you so much. Getting this kind of information or well explanation is not easy with all the “BREAKING AI NEWS !😮‼️” on CZcams now.

  • @MeriaDuck
    @MeriaDuck Před 10 dny +1

    This must be one of the best and concise explanations I've seen!

  • @londonl.5892
    @londonl.5892 Před 7 dny +1

    So glad this came across my recommended feed! Fantastic explanation and definitely cleared up a lot of confusion I had around diffusion models.

  • @CodeMonkeyNo42
    @CodeMonkeyNo42 Před 16 dny

    Great video. Love the pacing and how you distiled the material into such an easy to watch video. Great job!

  • @justanotherbee7777
    @justanotherbee7777 Před 3 měsíci +3

    A person with very less background can understand what he describes here.. commenting to make youtube so it gets recommended for other ..
    wonderful video! really good one

  • @neonelll
    @neonelll Před dnem +1

    The best explanation I've seen. Great work.

  • @Matyanson
    @Matyanson Před 15 dny +3

    Thank you for the explanation. I already knew a little bit about diffusion but this is exactly the way I'd hope to learn. Start from the simplest examples(usually historical) and progresivelly advance, explaining each optimisation!

  • @updated_autopsy_report
    @updated_autopsy_report Před 6 dny +1

    I really enjoyed this video!! took a lot of notes while watching it too. you have a god tier ability to explain concepts in an easy to follow way

  • @kkordik
    @kkordik Před 3 dny +1

    Bro, this is amazing!!! Your explanation is so clear, like it

  • @shivamkaushik6637
    @shivamkaushik6637 Před 9 dny

    Never knew youtube could give random suggestion to videos like these. This was mind blowing. The way you teach is work of art.

  • @ecla141
    @ecla141 Před 11 dny +2

    Awesome video! I would love to see a video about graph neural networks

  • @mrdr9534
    @mrdr9534 Před 15 dny +1

    Thanks for taking the time and effort of making and sharing these videos and Your knowledge.
    Kudos and best regards

  • @xaidopoulianou6577
    @xaidopoulianou6577 Před 19 dny +1

    Very nicely and simply explained! Keep it up

  • @JordanMetroidManiac
    @JordanMetroidManiac Před 15 dny +1

    I finally understand how models like Stable Diffusion work now! I tried understanding them before but got lost at the equation (17:50), but this video describes that equation very simply. Thank you!

  • @abdelhakkhalil7684
    @abdelhakkhalil7684 Před 17 dny +1

    This was a good watch, thank you :)

  • @iestynne
    @iestynne Před 15 dny +1

    Wow, fantastic video. Such clear explanations. I learned a great deal from this. Thank you so much!

  • @akashmody9954
    @akashmody9954 Před 3 měsíci +2

    Great video....already waiting for your next video

  • @jcorey333
    @jcorey333 Před 3 měsíci +7

    This is an amazing quality video! The best conceptual video on diffusion in AI I've ever seen.
    Thanks for making it!
    I'd love to see you cover RNNs.

  • @iancallegariaragao
    @iancallegariaragao Před 3 měsíci +2

    Great video and amazing content quality!

  • @istoleyourfridgecall911
    @istoleyourfridgecall911 Před 10 hodinami +1

    Hands down the best video that explains how these models work. I love that you explain these topics in a way that resembles how the researchers created these models. Your video shows the thinking process behind these models, combined with great animated examples, it is so easy to understand. You really went all out. Only if youtube promoted these kinds of videos instead of brainrot low quality videos made by inexperienced teenagers.

  • @user-yj3mf1dk7b
    @user-yj3mf1dk7b Před 18 dny +1

    nice explanations, although, i've already knew about diffusion. examples from simplest to final diffusion -- were a really nice touch.

  • @RezaJavadzadeh
    @RezaJavadzadeh Před 2 dny +1

    such complete explanations, keep it up thank you

  • @RobotProctor
    @RobotProctor Před 20 dny +2

    I like to think of ML as a funky calculator. Instead of a calculator where you give it inputs and an operation and it gives you an output, you give it inputs and outputs and it gives you an operation.
    You said it's like curve fitting, which is the same thing, but I like thinking the words funky calculator because why not

  • @zlatanonkovic2424
    @zlatanonkovic2424 Před dnem +1

    What a great explanation!

  • @vidishapurohit4709
    @vidishapurohit4709 Před 3 dny +1

    very nice visual explanations

  • @marcusbluestone2822
    @marcusbluestone2822 Před 7 dny +1

    Brilliant explanation. Thank you very much

  • @anatolyr3589
    @anatolyr3589 Před 2 měsíci +1

    Great explanation!👍👍, I personally would like to see a video observing all major types of neural nets with their distinctions, specifics, advantages, disadvantages etc. the author explains very well 👏👏

  • @vasil_astrov
    @vasil_astrov Před dnem +1

    Thank you! This is great explanation❤

  • @user-er9pw4qh6j
    @user-er9pw4qh6j Před 27 dny +2

    Soooo Good!!! Thanks for making it!!!!

  • @art4eigen93
    @art4eigen93 Před 2 dny +1

    So simple ! Thank you.

  • @sanjeev.rao3791
    @sanjeev.rao3791 Před 11 dny +1

    Wow, that was a fantastic explanation.

  • @abhijeetvishwasrao
    @abhijeetvishwasrao Před 5 dny +1

    Awesome explanation 👏

  • @1.4142
    @1.4142 Před 3 měsíci +4

    Some2 really brought out some good channels

  • @tkimaginestudio
    @tkimaginestudio Před 10 dny +1

    Great explanations, thank you!

  • @khangvutien2538
    @khangvutien2538 Před 18 dny

    Thank you very much.
    I enjoyed the first part, the first 10 seconds.
    After, there are too any shortcuts in the explanations that I struugled to understand and be able to explain it again to myself. Still, I subscribed.
    As for suggestions for other videos, I'll check whether you have explained the U-Net already. If not I'd appreciate to have the same kind of explanation about it.

  • @alaad1009
    @alaad1009 Před 4 dny +1

    Amazing video !

  • @Mhrn.Bzrafkn
    @Mhrn.Bzrafkn Před 20 dny +3

    It was too easy understanding👌🏻👌🏻

  • @atifadib
    @atifadib Před 6 dny +1

    great video... loved it!

  • @ShubhamSinghYoutube
    @ShubhamSinghYoutube Před 10 dny +1

    Love the conclusion

  • @RobotProctor
    @RobotProctor Před 20 dny +1

    Thank you. This video is wonderful

  • @sobhhi
    @sobhhi Před 16 dny +2

    I think it would help to mention that the auto-regressors may be viewing the image as a sequence of pixels (RGB vectors). Overall excellent video, extremely intuitive.

    • @algorithmicsimplicity
      @algorithmicsimplicity  Před 16 dny +1

      In general, auto-regressors do not view images as a sequence. For example, PixelCNN uses convolutional layers and treats inputs as 2d images. Only sequential models such as recurrent neural networks would view the image as a sequence.

    • @sobhhi
      @sobhhi Před 16 dny

      @@algorithmicsimplicity of course, but I feel mentioning it may help with intuition as you’re walking through pixel by pixel image generation

  • @hmmmza
    @hmmmza Před 3 měsíci +3

    what a great rare content!

  • @gabrielgraf2521
    @gabrielgraf2521 Před 5 dny +2

    Boah what a good explanation. I alwa6was wondering how these big NN like chatgpt and dalle are working. Thank you

  • @joaosousapinto3614
    @joaosousapinto3614 Před 23 dny +1

    Great video, congrats.

  • @oculuscat
    @oculuscat Před 19 dny +4

    Diffusion doesn't necessarily work better than auto-regression. The "Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction" paper introduces an architecture they call VAR that upscales noise using an AR model and this currently out-performs all diffusion models in terms of speed and accuracy.

  • @vijayaveluss9098
    @vijayaveluss9098 Před 18 dny +1

    Great explanation

  • @paaabl0.
    @paaabl0. Před 16 dny

    Great video! Focus on the right elements.

  • @johnbolt2686
    @johnbolt2686 Před 16 dny

    I would recommend reading about active inference to possibly understand the role of generative models in intelligence.

  • @zephilde
    @zephilde Před 22 dny +3

    Great visualisation! Good job!
    Maybe next video on LoRA or ControlNet ?

  • @pon1
    @pon1 Před 19 dny +1

    Still feels like magic to me 🙌🙌

  • @demohub
    @demohub Před 17 dny +1

    Just subscribed. Great video

  • @ollie-d
    @ollie-d Před 12 dny +1

    Solid video!

  • @meanderthalensis
    @meanderthalensis Před 19 dny +1

    Great video!

  • @mojtabavalipour
    @mojtabavalipour Před 16 dny +1

    Well done!

  • @marcinstrzesak346
    @marcinstrzesak346 Před 25 dny +1

    Very good video. Thank you

  • @aaronhandleman7277
    @aaronhandleman7277 Před 2 dny +1

    A paper about doing autoregression with images that seems to work pretty well dropped after this video - would be interested in your thoughts:
    Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction

    • @algorithmicsimplicity
      @algorithmicsimplicity  Před 2 dny

      Yep I read that paper recently. Seems like a really solid idea: instead of using noise to remove information, down sample (i.e. blur) the image to remove information. This also has the property that it removes information from everywhere in the image, so it should give near optimal compute-vs-quality trade off, but it has the advantage that the image size is smaller for the generation steps. I would wait to see a few more reproductions of it before claiming that it is better than diffusion, though.

  • @anthonybernstein1626
    @anthonybernstein1626 Před měsícem +3

    I had a good idea how diffusion models work but I still learned a lot from this video. Thanks!

  • @ArtOfTheProblem
    @ArtOfTheProblem Před 22 dny +1

    great work

  • @AurL_69
    @AurL_69 Před 20 dny +1

    thanks for explaining

  • @robosergTV
    @robosergTV Před 5 dny +1

    good stuff, thanks

  • @hjups
    @hjups Před 3 měsíci +1

    Do you have a citation that supports your claim for eps vs x0 prediction?
    It's true that the first sampling step with x0 tends to produce a blurry / averaged result, but that's a result of the loss function used when training DDPMs. If you were to use something more complex or another NN, then you'd have a GAN, which don't produce blurry or averaged results on a single forward pass.
    Also, if you examine the output of x0 = noise - eps for the first step, it's both mathematically and visually equivalent to the first x0 prediction sample - a blurry / averaged result. The same thing is also true when predicting velocity, but velocity is arguably harder for a network to predict due to the phase transition.

  • @jayantdubey3025
    @jayantdubey3025 Před 2 dny +1

    In your neural network animations, the traveling highlight starts from the image, goes through the neural net, then to the output pixel. I understand this as information traveling forward. When the highlights reverse direction, does this represent back propagation at the regressed value of the pixel? Great video by the way!

    • @algorithmicsimplicity
      @algorithmicsimplicity  Před 2 dny

      Yep it's just meant to demonstrate the weights in the network changing based on the error in the predicted value.

  • @recklessroges
    @recklessroges Před 12 dny +1

    Could you explain why the YOLO image classify is/was so effective? Thank you.

  • @ChristProg
    @ChristProg Před 24 dny +1

    Thank you So much Sir. Really interesting video. But i will like you to create a video on how the generative model uses the text promt during training. Thank you Sir. I subscribed !😊

  • @winstongraves8321
    @winstongraves8321 Před 19 dny +1

    Great video

  • @IceMetalPunk
    @IceMetalPunk Před 18 dny +1

    And the newest/upcoming models seem to be tending more towards diffusion Transformers, which from my understanding is effectively a Transformer autoencoder with a diffusion model plugged in, applying diffusion directly to the latent space embeddings. Is that correct?

  • @morrisdehaan6679
    @morrisdehaan6679 Před 7 dny +1

    So good!

  • @ralusek
    @ralusek Před 3 dny +1

    What are you using in order to make this visualizations? Great video

    • @algorithmicsimplicity
      @algorithmicsimplicity  Před 3 dny

      I am using a mix of Manim (for rendering LaTeX) and my own 3d renderer written in Pytorch.

  • @craftydoeseverything9718

    This was genuinely such a great video. I honestly feel like I could come away from this video and implement an image generator myself :) /gen

  • @IsaOzer-lx7sn
    @IsaOzer-lx7sn Před 9 dny +2

    I want to learn more about the causal architecture idea for auto regressors, but I can't seem to find anything about them anywhere. Do you know where I can read more about this topic?

    • @algorithmicsimplicity
      @algorithmicsimplicity  Před 9 dny +1

      I haven't seen any material that cover them really well. There are basically 2 types of causal architectures, causal CNNs and causal transformers, with causal transformers being much more widely used in practice now. Causal transformers are also known as "decoder only transformers" ("encoders" uses regular self-attention layers, "decoders" use causal self-attention). If you search for encoder vs decoder-only transformers you should find some resources that explain the difference.
      Basically, to make a self-attention layer causal you mask the attention scores (i.e. set some to 0), so that words can only attend to words that came before them in the input. This makes it so that every word's vector only contains information from before it. This means you can use every word's vector to predict the word that comes after it, and it will be a valid prediction because that word's vector never got to attend (i.e. see) anything after it. So, it is as if you had applied the transformer to every subsequence of input words, except you only had to apply it once.

  • @iwaniw55
    @iwaniw55 Před 9 dny +1

    Hi @algorithmicsimplicity, I am curious which papers/material did you reference for the general autogressor? I cannot seem to find any info on using random spaced out pixels to predict the next batch of pixels. Any help would be appreciated. Also great videos!!!

    • @algorithmicsimplicity
      @algorithmicsimplicity  Před 9 dny +1

      It is more widely known as "any-order autoregression", see e.g. this paper arxiv.org/abs/2205.13554

    • @iwaniw55
      @iwaniw55 Před 9 dny

      @@algorithmicsimplicity Thank you so much! This is exactly what I was missing.

  • @alex65432
    @alex65432 Před 3 měsíci +1

    Can you make a video about the loss landscape.Like what effects do different weight inits. Optimizers or architectures like resnet have.

    • @algorithmicsimplicity
      @algorithmicsimplicity  Před 3 měsíci

      Thanks for the interesting suggestion! I was already planning to do a video about why neural networks generalize outside of their training set, I should be able to talk about the loss landscape in that video.

  • @infographie
    @infographie Před 17 dny +1

    Excellent.

  • @yk4r2
    @yk4r2 Před 10 dny +2

    Hey, could you kindly recommend more on causal architectures?

    • @algorithmicsimplicity
      @algorithmicsimplicity  Před 10 dny

      I haven't seen any material that cover them really well. There are basically 2 types of causal architectures, causal CNNs and causal transformers, with causal transformers being much more widely used in practice now. Causal transformers are also known as "decoder only transformers" ("encoders" uses regular self-attention layers, "decoders" use causal self-attention). If you search for encoder vs decoder-only transformers you should find some resources that explain the difference.
      Basically, to make a self-attention layer causal you mask the attention scores (i.e. set some to 0), so that words can only attend to words that came before them in the input. This makes it so that every word's vector only contains information from before it. This means you can use every word's vector to predict the word that comes after it, and it will be a valid prediction because that word's vector never got to attend (i.e. see) anything after it. So, it is as if you had applied the transformer to every subsequence of input words, except you only had to apply it once.