Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention (Paper Explained)

Sdílet
Vložit
  • čas přidán 29. 08. 2024

Komentáře • 82

  • @dingleberriesify
    @dingleberriesify Před 4 lety +35

    The one thing that irks me is kernels keep being referred to as "mapping to infinite dimensional space" in these vids. While true, it's kind of unhelpful. It's probably more helpful to think of it as mapping to a (potentially infinite) space where the (potentially infinite) bases aren't restricted to be vectors. The whole trick about the Hilbert space is it's just a space with a well-defined inner product, and that inner product can be between functions etc. You could have a function mapping to a space where the bases are the sin and cosine functions, for example, and you use that space to take the inner product between different sine/cosine wave combinations. Typically the kind of space you want to map to will express something about the problem you're trying to solve.
    I bring this up because I've seen kernels mentioned twice now, and both times the explanations are kind of lacking. Not Yannic's fault, I was lucky enough to spend a bunch of time around kernel gurus that cleared up a bunch of misconceptions for me...before that, I was making similar mistakes. Content otherwise very solid, as usual!

    • @YannicKilcher
      @YannicKilcher  Před 4 lety +5

      Thank you :)

    • @zhonghuahe3556
      @zhonghuahe3556 Před 3 lety

      Hi, can you recommend some learning materials about "kernels"

    • @brandomiranda6703
      @brandomiranda6703 Před 2 lety

      Mit’s 9.520 with Lorenzo Rosasco ;)

    • @CosmiaNebula
      @CosmiaNebula Před 2 lety +1

      This is a note for the mathematicians:
      When people say "vector" in computer science, they mean "an array with finitely many entries". In other words, they mean a function from {1, 2, ..., n} to some set X. This is a severe generalization/weakening of the typical vectors that we mean, where a vector lives in a vector space, where there are all kins of symmetries (the group GL(n, K)).
      So when Alex Stenlake says "they aren't restricted to be vectors", he meant "they aren't restricted to be a finite-entry array". They are in fact, *exactly* what we mean when we say "vectors in a countable, separable vector space", in other words, any vector space we use in functional analysis. So as we see, this is a severe strengthening of a severe weakening (first, the CS scientist weakens the mathematician's idea of finite dimensional vector space, then they strengthen it to return to the mathematician's idea), which is why Alex had to point it out.

    • @andrewgjkgjk
      @andrewgjkgjk Před 2 měsíci

      This probably comes from the idea that, practically, these bases are computed via an infinite polynomial series that can only be approximated in compute using techniques like random fourier sampling, no?

  • @NicheAsQuiche
    @NicheAsQuiche Před 4 lety +61

    I love that you go over and explain the topics each time you introduce them (e.g. transformers) even though you go over them in other videos it's so useful to me to hear different explanations of it and realy clrafiy what's going on inside. Really helps me get a better understanding.

  • @utku_yucel
    @utku_yucel Před 4 lety +14

    the idea of Explaining the papers is awesome! Thanks!

  • @florianhonicke5448
    @florianhonicke5448 Před 4 lety +17

    I like that you explain some stuff again in different videos. It makes sure everyone is on track and if someone would be annoyed by the redundance the person could just fast forward.
    To me it was very helpful to get multiple explanations about transformers

  • @good_user35
    @good_user35 Před 4 lety +9

    At first, I thought your videos are a bit lengthy, but after watching this now I feel that I woudn't be able to understand the contents without sacrficing such an amount of time. And what you point out on the paper similarly tells me that the cost of softmax function is the needed price considering that the function can do in an unknown task. Thanks, it was good watching.

    • @YannicKilcher
      @YannicKilcher  Před 4 lety +1

      also, I suggest just changing the playback speed to 2x if it's too long :)

  • @darkmythos4457
    @darkmythos4457 Před 4 lety +15

    Thanks Yannic, soon enough you will be one of the requirements for ML jobs; Published at Neurips, ICML, and Yannic YT Channel.

    • @YannicKilcher
      @YannicKilcher  Před 4 lety +6

      Haha :D problem is I'm very corrupt so idk how that's going to work out ;)

  • @fotisj321
    @fotisj321 Před 4 lety +3

    Thanks for making a difficult paper accessible. I especially like it that there a red threads in you selection of papers, which create almost something like a story-line, for example the one starting with the paper on the transformer architecture and the follow-up papers trying to remedy some of its shortcomings but keep its achievements.

  • @andyblue953
    @andyblue953 Před 4 lety

    Great explanation. I really like the way you break down publications to the most essential parts. Never was "reading" articles that comfortable.

  • @veedrac
    @veedrac Před 4 lety +8

    The natural question I have after this paper: why not both? Test a model with X% compute allocated towards ‘real’ attention and 1-X% towards linear attention, and see what's best.

    • @GeekProdigyGuy
      @GeekProdigyGuy Před 4 lety +2

      You can always ensemble any two models, but unless they provide independent contributions, there isn't really any point. Plus in this case their proposed model's main advantage is reduced compute/memory, so it would seem entirely counterproductive to ensemble with plain attention.

    • @veedrac
      @veedrac Před 4 lety +4

      @@GeekProdigyGuy The idea is that traditional attention is high quality, but struggles at reaching a large context. Clearly that high quality attention is necessary on short scales, but only a small reduction in the traditional attention would be needed to allow for a linear attention that reached very large context sizes. Assuming the far context needs less character-level detail (same thesis as Compressive Transformers), the lossier linear attention should suffice there.

    • @lucidraisin
      @lucidraisin Před 4 lety +6

      Veedrac - I actually explore that here github.com/lucidrains/linear-attention-transformer

  • @OwenCampbellMoore
    @OwenCampbellMoore Před 4 lety +4

    These are amazing, love the depth and really appreciate the amount of content!
    I know you’ve said you’re not planning on it, but I’d be super happy to support on Patreon if that helps support you in making so much content!

  • @that_guy4690
    @that_guy4690 Před 4 lety +1

    Your videos are pretty long, but they are really not long enough :). Each time the video ends, I get a feeling that I need an even deeper explanation e.g. their code review

  • @blue_bear_music
    @blue_bear_music Před 4 lety +5

    Cool charts at 26:20. Interesting that Reformer is slower than regular transformer at shorter sequences. You would't know that just from asymptotic time complexity. Looks like the paper conveniently omits that :)
    Linear attention seems super fast tho.
    Also it's probably obvious, but two times higher slope on a log chart corresponds to a square of a function on a linear chart.

    • @YannicKilcher
      @YannicKilcher  Před 4 lety

      Yep, I'm just a moron when it comes to non-linear charts :D

  • @sitioprueba2855
    @sitioprueba2855 Před 2 lety

    Thank you so much for all this great explinations!

  • @daryoushmehrtash7601
    @daryoushmehrtash7601 Před 3 lety

    I appreciate your comment about different neural network architecture as different routing of information.

  • @davidsabatini9100
    @davidsabatini9100 Před 2 lety

    A really good video! I found it super informative, thank you!

  • @tomw4688
    @tomw4688 Před 3 lety

    Good shit! This video was very helpful and many ELI5 moments useful for students. Thanks!

  • @noreddine
    @noreddine Před 3 lety +1

    Yannic: just lett me think what you think.
    Me: I think you're awesome

  • @DanielHesslow
    @DanielHesslow Před 4 lety +2

    This is super cool and I think we will see the performance improve down the line as people explore other feature mappings, elu + 1 seems a biiit arbitrary

  • @snippletrap
    @snippletrap Před 4 lety +1

    Re: causal masking implementation @38:00, input[i] != target[i] because the target is shifted by one before being passed into the model.

  • @welcomeaioverlords
    @welcomeaioverlords Před 6 měsíci

    I'm very late, but well done. Good call out on the RNN caveat! I missed that. And one correction (or I misunderstand something): it seems that we're not actually going to a higher dimensional space through phi, but rather applying a point-wise non-linearity and maintaining the same dimensionality. Any comments on that choice? That's seemingly contrary to the point of the kernel trick as I understood it. Why not do the typical DL thing and simply make phi a learnable function?

  • @MrMIB983
    @MrMIB983 Před 4 lety +1

    Love it! Universal transformer please!

  • @kiyoonyoo4761
    @kiyoonyoo4761 Před 4 lety +1

    Thanks Yannic for the explanation.
    I was actually having some difficulty understanding how they used the associative property to transition from Eq. 4 -> Eq.5. Glad to see you've already made a video!
    This wasn't explained in detail on the video also, maybe because it is something trivial, but am I the only one who cannot get around why the two formulations (Eq.4 and Eq.5) are equivalent?
    The dimensions of the numerators definitely are not the same as the former is a linear combinations of column vectors, while the latter is a row vector multiplied by square matrix. Also, associative property does not seem to be true in general for vector multiplication involving vector dot products.
    Any help or explanation from anyone would be very appreciated :0

    • @YannicKilcher
      @YannicKilcher  Před 4 lety

      Sometimes people are very loose with row/column vectors and don't properly apply transposes. Can you find a counter-example where the equation doesn't hold (even when glossing over transpose issues)?

    • @norik1616
      @norik1616 Před 3 lety

      ​@@YannicKilcher I had hard times with the indexes as well, so I wrote it in numpy:
      It does not help with transpositions (as numpy matches them *mostly correctly* automatically), but it helps with where to put which product - the equation holds true.
      import numpy as np
      shape = 5, 7
      K = np.random.rand(*shape)
      Q = np.random.rand(*shape)
      V = np.random.rand(*shape)
      classic_attention = np.matmul(np.matmul(Q, K.T), V)
      assert classic_attention.shape == shape # stripped down classic attention
      res = []
      for i in range(shape[0]):
      res.append(sum([np.matmul(Q[i], K[j]) * V[j] for j in range(shape[0])]))
      matmul_byhand_attention = np.array(res)
      print(np.allclose(classic_attention, matmul_byhand_attention))
      res = []
      for i in range(shape[0]):
      s = sum([np.outer(K[j], V[j]) for j in range(shape[0])]) # associativity and distributivity
      res.append(np.matmul(Q[i], s))
      matmul_byhand_attention2 = np.array(res)
      print(np.allclose(classic_attention, matmul_byhand_attention2))

  • @Erotemic
    @Erotemic Před 4 lety +3

    That is quite an impressive speed increase, and the ability to perform RNN like inference is incredibly useful.
    I wish they investigated other kernel functions (or maybe they did and I missed it), but I'd be interested in seeing how the choice of kernel impacts the results.
    Am I understanding the RNN autoregressive-transformer equivalence correctly? Is it the case that translation transformer models are not RNNs?

    • @YannicKilcher
      @YannicKilcher  Před 4 lety

      only transformers that are using the specific causal masking are RNNs

  • @Kerrosene
    @Kerrosene Před 4 lety +1

    I seems like the closer we are to approximating the soft-max operator with a finite dimensional feature map, the better the performance of this linear transformer would be...which is what they suggested in the conclusion..Also, they don't use positional encodings..

  • @raminrasoulinezhad
    @raminrasoulinezhad Před 4 lety

    Cool description. Thanks

  • @herp_derpingson
    @herp_derpingson Před 4 lety +2

    12:33 If Adding all the K[i] terms means that there is no way for the transformer to know the position of words in the sentence. Maybe the positional embeddings are helping it somehow. It would be interesting to see the performance of this transformer without the positional embeddings. Then again, RNNs dont have any explicit mechanism to store positions of words.

    • @YannicKilcher
      @YannicKilcher  Před 4 lety +1

      True, the positional embeddings probably do a lot of work here. RNNs don't have explicit position, but they can theoretically learn to count the number of input tokens - or what's probably more relevant - learn to estimate distances between tokens, which is another thing that position embeddings in transformers are good for.

  • @samm9840
    @samm9840 Před 4 lety +1

    How about applying the kernel (O(n)) trick at inferencing alone after training?
    I am especially interested whether the DETR model (Yannic already did a video on that), that uses the heavy-weight transformer be modified to do object detection simply with this linear version! May be it's an idea for a new paper. Looking forward to it.

  • @muneebth5508
    @muneebth5508 Před 4 lety +1

    Sir, Thank You

  • @lucasnestler6933
    @lucasnestler6933 Před 4 lety +1

    As the paper states that transformers attending to one side only are equal to RNNs, doesn't it imply that full transformers are equal to residual bidirectional RNNs?

    • @YannicKilcher
      @YannicKilcher  Před 4 lety

      Only in an abstract sense. The computations are really different from a bidirectional RNN

  • @zhangcx93
    @zhangcx93 Před 4 lety +3

    According to there code, the main computation is like this:
    KV = torch.einsum("nshd,nshm->nhmd", K, values)
    Z = 1/(torch.einsum("nlhd,nhd->nlh", Q, K.sum(dim=1))+self.eps)
    V = torch.einsum("nlhd,nhmd,nlh->nlhm", Q, KV, Z)
    can someone translate this einsum into a normal pytorch way?

    • @eelcohoogendoorn8044
      @eelcohoogendoorn8044 Před 4 lety +3

      Better off just reading up on einsum and getting familiar with it. it is by far the most readable and intuitive way of expressing these statements; and I say that as a numpy veteran of 15 years.

    • @YannicKilcher
      @YannicKilcher  Před 4 lety +1

      yea the problem is that would translate to multiple matmuls, elementwise products, transposes and summations :D I also suggest you dive into einsum notation

    • @markdaoust4598
      @markdaoust4598 Před 3 lety

      I came to ask “what’s the einsum for 21:03.”. Thanks!
      So.. dropping the “batch” and “head” indices it’s:
      Vld = sum_sd Qld*Ksd*Vsm / sum_sd Qld Ksd

  • @ShokhanBirlikov
    @ShokhanBirlikov Před 4 lety

    Your intuition at 31:40 is great! Maybe a silly question, but is it a possible architecture where the transformation function phi is learned as Nx(#of intermediate units) matrix of trainable parameters?

  • @rayaengazou2140
    @rayaengazou2140 Před 4 měsíci

    Hey! that's greate !! i would like to ask about the model he used to produce results in the paper ?in github there was only the library not the model ! can you share with us the model ?

  • @andyblue953
    @andyblue953 Před 4 lety

    I was wondering to which extend their idea is related to A² double-net attention? In the latter, the outer product is likewise separated into a gathering step and a distribution step, and it's likewise shown to be on par with the original transformer. Anyway, the general mathematical idea can already be found in the SVD.

  • @mattiasfagerlund
    @mattiasfagerlund Před 4 lety +1

    Hi - is the reason that the softmax version is O(N^2) that softmax (in the general case) has N^2 gradients before they're collapsed back? Or is there something else going on?

    • @YannicKilcher
      @YannicKilcher  Před 4 lety +1

      It's because you need to do N^2 inner products. Maybe if the softmax wasn't there, you could somehow optimize that away, so I think you do have a point

  • @sajjadayobi688
    @sajjadayobi688 Před 3 lety

    thank Yannic, we saw a few papers for transformers in long sequences like ReFormer, LinFormer, LongFormer but which one of them is really well in real-world
    if you make a video about comparing these papers for faster or longer Transformer I would be very thankful

  • @RobotProctor
    @RobotProctor Před 4 lety +1

    If Q.K is usually the highest for the same token, I wonder why Q and K need to be different at all for a transformed to work properly. Why not just output a K and a V?

    • @YannicKilcher
      @YannicKilcher  Před 3 lety +1

      interesting idea. some papers fuse k and v, but I've never heard of fusing k and q

  • @RuminRoman
    @RuminRoman Před 4 lety +1

    I wonder why you do not have a video about Universal Transformers

  • @andres_pq
    @andres_pq Před 4 lety +1

    So in practice you could stack way more tranformer layers and have better performance?

  • @bhuvanb8156
    @bhuvanb8156 Před 4 lety +2

    at 14:17 numerator and denominator are same except multiplying with Vj, how's is that possible

    • @YannicKilcher
      @YannicKilcher  Před 4 lety

      it characterizes a normalized distribution over V

  • @baqirhusain5652
    @baqirhusain5652 Před 2 lety

    The complexity works well when N > D^2. So the paper is saying if dimensionality is 512 my sequence length should be greater than 512^2 = 262144. Is this really practical !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

  • @couragefox
    @couragefox Před 4 lety +1

    commenting for algo. love your channel

  • @seraphim9723
    @seraphim9723 Před 4 lety +1

    Wait a second: Figure 5c is showing PER over wall-clock time, and despite the significant speed-up the original softmax transformer is better AND faster (at reaching any given error rate).

    • @YannicKilcher
      @YannicKilcher  Před 4 lety

      Yes, that's because on this task it seems to be very important to have the quadratic attention

  • @myungchulkang5716
    @myungchulkang5716 Před 4 lety +1

    thank you :)

  • @Bencurlis
    @Bencurlis Před 3 lety

    The generalized formulation of equation 3 looks like Bayes formula, has anyone noticed? Is it accidental or is there a deeper meaning here?

  • @minhoryu2786
    @minhoryu2786 Před 4 lety +1

    Can anyone summarize difference between linformer and this paper?

    • @meerkatj9363
      @meerkatj9363 Před 4 lety +10

      Linformer has a linear complexity thanks to a constant dimension of the keys and values because they are projected in a fixed lower dimension. Whereas this does not fix the dimension of the sequence, it only reorders the computation. The reordering can only be made with conditions that need the kernel thing.

  • @siyn007
    @siyn007 Před 4 lety +2

    speed over accuracy for me tbh. Not everyone has Google's resources to run it on 112312412312423 TPUs

  • @first-thoughtgiver-of-will2456

    If I was rich I'd give you money.

  • @charudattamanwatkar8340
    @charudattamanwatkar8340 Před 4 lety +8

    Who else saw balls in the thumbnail?

  • @joery8290
    @joery8290 Před 4 lety +1

    I love your videos, but could you try to make them more compact? 50 minutes is a big time investment. I feel like most papers can be explained in high detail in half that time.