CoPE - Contextual Position Encoding: Learning to Count What's Important

Sdílet
Vložit
  • čas přidán 11. 09. 2024

Komentáře • 24

  • @marinepower
    @marinepower Před 3 měsíci +4

    I wonder if this method could be improved by having a new projection matrix of size [hidden_dim x 1] that computes the width of each token, we take the sigmoid, the cumulative sum, we do the interpolation as described, but we add it to the queries and keys, then do normal attention.
    This requires a new ((small)) matrix but would allow us to use flash attention directly without needing a new cope kernel.

    • @gabrielmongaras
      @gabrielmongaras  Před 3 měsíci

      This makes perfect sense to work just as well as the CoPE method to me! The only drawback I see is that CoPE recalculates all positions for each token while positions in this case would not change for past tokens. However, I'm guessing this recomputation is unneeded and that you will get similar performance as CoPE, but it's a lot more efficient computationally. Are you thinking of implementing this?

    • @marinepower
      @marinepower Před 3 měsíci

      @@gabrielmongaras All positions here would be recalculated every layer, so they would affect past tokens. For example, the width for token 1 might be computed as 0.8 in layer 1 and 0.5 for layer 2, etc. Since we're doing the cumulative sum all positions are updated, since every token relies on token 1 (but this applies to every token in the series).
      As far as actually implementing this, however, I ran into a small issue. What this method uses (floor, ceil) is non-differentiable. Instead, the embeddings would need to be calculated directly from the cumulative embedding position, which I think is fairly expensive since these embeddings use a lot of sines and cosines, etc. It seems possible, it's just that we might want to simplify the embedding function a bit.

    • @Anonn724
      @Anonn724 Před 3 měsíci +1

      Sounds interesting. Definitely will give it a try to implement this. I quickly implemented some CUDA as Gabriel mentioned if sb is interested gh juvi21/CoPE-cuda

    • @gabrielmongaras
      @gabrielmongaras  Před 3 měsíci

      ​@@marinepower That makes sense, though at the same layer, the positions are calculated once, in CoPE, they're recalculated for each token, but I don't think this difference will be very beneficial, meaning this method can probably work just as well!
      I think a detach can be used on the floor/ceil, the weight on the other hand will be the differentiable part:
      For a given token, we have a positional value of say p which is the sum of all positional value up to that token.
      For the positional embedding for a specific token, we would have: e = (1 - (p - floor(p)))*PE[floor(p)] + (p - floor(p))*PE[ceil(p)])
      Where PE[i] is the ith absolute positional encoding.
      In this case, floor(p) and ceil(p) are not differentiable, but p is, so there's still gradient flow
      So I think this idea should still work!

    • @marinepower
      @marinepower Před 3 měsíci +1

      @@gabrielmongaras Hmm, yeah, I think that works! I am not really training any transformers right now that need variable positional encodings so I probably won't test this method, but the next time I look into llms I'll try this out! (Although... training LLMs from scratch requires a tremendous amount of compute, so I moreso hope someone else notices this comment chain and tries it out lol)

  • @lienlrac7644
    @lienlrac7644 Před 3 měsíci

    When Gabriel drops I stop everything and listen

  • @jonatan01i
    @jonatan01i Před 3 měsíci

    The .log on the mask is there to make the attn_logits' to be masked values negative infinity, that's how you make them disappear to 0 in the attn

    • @gabrielmongaras
      @gabrielmongaras  Před 3 měsíci +1

      ah makes sense, turns a binary mask into a mask of -inf and 0. Usually I just pass a mask of -inf and 0 through the entire model.

    • @jonatan01i
      @jonatan01i Před 3 měsíci +1

      @@gabrielmongaras btw, thanks for the upload!:)

  • @einsteinsapples2909
    @einsteinsapples2909 Před 3 měsíci

    Don't you have the domensions wrong at 2:30? Shouldn't the Q, K, V matrices be Tokens X Channels, with each row representing a single token?

    • @gabrielmongaras
      @gabrielmongaras  Před 3 měsíci +2

      I usually like to transpose the matrices when drawing the QKV matrices out for attention. I feel like a sequence going left to right rather than up to down is more intuitive, but idk.

    • @einsteinsapples2909
      @einsteinsapples2909 Před 3 měsíci +1

      @@gabrielmongaras Interesting for me its more intuitive to have a token per line. Probably because I think of it as a Python list. Like a matrix to me, is a list of lists.

    • @gabrielmongaras
      @gabrielmongaras  Před 3 měsíci +1

      @@einsteinsapples2909 I'll keep this in mind for future videos! Idk why, but seeing a token be vertical is the way I've always drawn it. I guess as long as the shape is written out correctly, it's fine.

    • @einsteinsapples2909
      @einsteinsapples2909 Před 3 měsíci

      @@gabrielmongaras Hi Gabriel, I don't know about your background, if you're a student or not. I personally, have never formally studied any of these topics, so I don't know what the conventions are (for all I know, what you're doing is the norm).
      I know in the "Attention is All You Need" paper they write the function as QK^t (they transpose the Keys matrix), the same way you write it in your video. The only way you can do QK^t and end with an "S by S" matrix is if the Q and K matrices are S x d (each row represents a token). If you want the Q,K,V matrices to be "d x S" (columns representing tokens) then you should do K^tQ instead to get the attention scores.

    • @gabrielmongaras
      @gabrielmongaras  Před 3 měsíci +2

      @@einsteinsapples2909 I don't have a degree in ML, mostly just self study. I think the notation that you're suggesting would be correct from a linear algebra perspective meaning I draw the keys right, but need to flip the queries/values. Nonetheless, Q, K, and V are (S, d) regardless of how they're drawn out, or else the attention formula wouldn't work. I guess it's confusing when I draw the tokens as row vectors 😅

  • @jonatan01i
    @jonatan01i Před 3 měsíci

    It's not true that they don't have positional encoding "at all", it's very far from true.
    Every token only gets access to what came before, and that's a lot of info available to rely on.

    • @gabrielmongaras
      @gabrielmongaras  Před 3 měsíci +1

      It still doesn't have information of position though. Position tells the model where in the sequence a certain token is, differentiating the same token in different positions in the sequence. Without positional encodings, the model operates on sets of information, not sequences.

    • @jonatan01i
      @jonatan01i Před 3 měsíci

      @@gabrielmongaras yes I know what you mean but the elements of it are not discrete but very dependent on their positions (what is there before me) in the sequence. If you switch two tokens’ place in it, the first few tokens that come before the leftmost of the two switched tokens will remain the same, but from that point on in the sequence everything changes.. they have contextual positional encodings built into them we could say that in some sense (given you have the future masking, for vanilla encoder it doesn’t work, that one has no position information at all I agree with that but not with the future masked one)

    • @jonatan01i
      @jonatan01i Před 3 měsíci

      ​@@gabrielmongaras
      if we have a transformer model with let's say 5 tokens and without additional positional encodings, and the sole task of it would be to have a random permutation of these 5 tokens given to it two times (same permutation twice),
      essentially it's task is to copy the first five token one after the other in the same order.. if you believe a transformer (with future mask!) has no positional information at all, then you realise that you are suggesting that this task is impossible for the transformer to learn.
      Do you agree with this or are we talking about two different things?

    • @gabrielmongaras
      @gabrielmongaras  Před 3 měsíci +1

      oh yea I think I see what you're saying. Without PEs, the transformer cannot distinguish position of words, however it can still get a sense of the position of the word it is currently generating based on the context of the tokens.
      I suppose it may be easier to take a look at this through the idea of sets. A normal transformer with PEs is essentially operating on an ordered set where each element is unique due to the ordering. Without PEs, the transformer operates on an unordered set of data, but it can still get an idea of the size of the set and maybe also an idea of the "count" of the number of duplicated items in the set.

    • @jonatan01i
      @jonatan01i Před 3 měsíci

      @@gabrielmongaras and also an idea of [2,3,(1,5,4)][2,3, and now what comes next?
      How would it know if it was (1,4,5),(1,5,4),(4,1,5),…or(5,4,1)?
      I get it that the addition operation is commutative and all, but the model will eventually rely on context of previous tokens to get to come up with a useful (relative)positional information for itself because it is possible to do given there is the future mask given to it (if no mask, then the transformer yes becomes commutative truly, you can change the order of the tokens and it won’t make any difference, (given you don’t use convolutions with kernels bigger than 1 and positional embeddings neither)