Gabriel Mongaras
Gabriel Mongaras
  • 70
  • 226 081
Deterministic Image Editing with DDPM Inversion, DDIM Inversion, Null Inversion and Prompt-to-Prompt
Null-text Inversion for Editing Real Images using Guided Diffusion Models: arxiv.org/abs/2211.09794
An Edit Friendly DDPM Noise Space: Inversion and Manipulations: arxiv.org/abs/2304.06140
Prompt-to-Prompt Image Editing with Cross Attention Control: arxiv.org/abs/2208.01626
00:00 Intro
01:24 Current image editing techniques
11:42 Deriving DDPM and DDIM
23:08 DDIM inversion
32:46 Null inversion
47:15 DDPM inversion
1:01:18 Prompt-to-prompt
1:10:52 Conclusion
zhlédnutí: 587

Video

Attending to Topological Spaces: The Cellular Transformer
zhlédnutí 560Před měsícem
Paper here: arxiv.org/abs/2405.14094 Notes: drive.google.com/file/d/12g_KkHqXD6mEDILJzYbCC08i8cDHITfC/view?usp=drive_link 00:00 Intro 01:39 Cellular complexes 07:26 K-cochain 13:26 Defining structure on the cell 20:28 Cellular transformer 34:18 Positional encodings and outro
Learning to (Learn at Test Time): RNNs with Expressive Hidden States
zhlédnutí 1,9KPřed měsícem
Paper here: arxiv.org/abs/2407.04620 Code!: github.com/test-time-training/ttt-lm-pytorch Notes: drive.google.com/file/d/127a1UBm_IN_WMKG-DmEvfJ8Pja-9BwDk/view?usp=drive_link 00:00 Intro 04:40 Problem with RNNs 06:38 Meta learning and method idea 09:13 Update rule and RNN inner loop 15:07 Learning the loss function outer loop 21:21 Parallelizing training 30:05 Results
WARP: On the Benefits of Weight Averaged Rewarded Policies
zhlédnutí 683Před měsícem
Paper here: arxiv.org/abs/2406.16768 Notes: drive.google.com/file/d/11UK7mEZwNVUMYuXwvOTfaqHhN8zSYm5M/view?usp=drive_link 00:00 Intro and RLHF 17:30 Problems with RLHF 21:08 Overview of their method 23:47 EMA 28:00 Combining policies with SLERP 37:34 Linear interpolation towards initialization 40:32 Code 44:16 Results
CoDeF: Content Deformation Fields for Temporally Consistent Video Processing
zhlédnutí 711Před měsícem
Paper: arxiv.org/abs/2308.07926 Paper page: qiuyu96.github.io/CoDeF/ Code: github.com/qiuyu96/CoDeF My notes: drive.google.com/file/d/10PMKdd5XBd6Y60HlRB9IW9naR2bWziDT/view?usp=drive_link 00:00 Intro 03:00 Method overview 08:40 Method details 15:24 Tricks done for training and how to actually train this thing 19:24 Flow loss and masking 25:10 Conclusion
Mamba 2 - Transformers are SSMs: Generalized Models and Efficient Algorithms Through SSS Duality
zhlédnutí 6KPřed 2 měsíci
Paper here: arxiv.org/abs/2405.21060 Code!: github.com/state-spaces/mamba/blob/main/mamba_ssm/modules/mamba2.py Notes: drive.google.com/file/d/1 XGPFeXQyx4CPxgYjzR4qrLd-baLWQC/view?usp=sharing 00:00 Intro 01:45 SSMs 08:00 Quadratic form of an SSM 15:02 Expanded form of an SSM 24:00 Attention - it's all you need?? 29:55 Kernel attention 32:50 Linear attention 34:32 Relating attention to SSMs 38:...
CoPE - Contextual Position Encoding: Learning to Count What's Important
zhlédnutí 1,2KPřed 2 měsíci
Paper: arxiv.org/abs/2405.18719 My notes: drive.google.com/file/d/1y9VHZc7MLqc6t2SHHdlVTYeW3czmmRbl/view?usp=sharing 00:00 Intro 02:44 Background 09:58 CoPE 24:50 Code 32:16 Results
NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models
zhlédnutí 742Před 2 měsíci
Paper: arxiv.org/abs/2403.03100 Demo: speechresearch.github.io/naturalspeech3/ Code: huggingface.co/spaces/amphion/naturalspeech3_facodec My notes: drive.google.com/file/d/1xnzErd_86B6eLwqpLckhoEQKqkxFPyM_/view?usp=drive_link 00:00 Intro 05:34 Architecture overview 18:45 GRL and subspace independence 24:45 Discrete diffusion Model 41:00 factorized diffusion model 44:00 Conclusion and results
xLSTM: Extended Long Short-Term Memory
zhlédnutí 1,8KPřed 3 měsíci
Paper: arxiv.org/abs/2405.04517 My notes: drive.google.com/file/d/1wFYvU_1oUWcCNuQ91zTpSGAeNUsPjlt3/view?usp=drive_link 00:00 Intro 05:44 LSTM 13:38 Problems paper addresses 14:12 sLSTM 23:00 sLSTM Memory mixing 27:08 mLSTM 35:14 Results and stuff
KAN: Kolmogorov-Arnold Networks
zhlédnutí 54KPřed 3 měsíci
Paper: arxiv.org/abs/2404.19756 Spline Video: m.czcams.com/video/qhQrRCJ-mVg/video.html My notes: drive.google.com/file/d/1twcIF13nG8Qc10_qeDqCZ4NaUh9tFsAH/view?usp=drive_link 00:00 Intro 00:45 MLPs and Intuition 05:12 Splines 19:02 KAN Formulation 28:00 Potential Downsides to KANs 32:09 Results
LADD: Fast High-Resolution Image Synthesis with Latent Adversarial Diffusion Distillation
zhlédnutí 784Před 3 měsíci
Paper: arxiv.org/abs/2403.12015 My notes: drive.google.com/file/d/1s1-nnWR_ZR26PNSAoZR1Xj3nuD9UZlvR/view?usp=sharing 00:00 Intro 01:31 Diffusion Models 08:08 Latent Diffusion Models 10:04 Distillation 12:02 Aversarial Diffusion Distillation (ADD) 17:06 Latent Aversarial Diffusion Distillation (LADD) 22:20 Results
Visual AutoRegressive Modeling:Scalable Image Generation via Next-Scale Prediction
zhlédnutí 1,7KPřed 4 měsíci
Paper: arxiv.org/abs/2404.02905 Demo: var.vision/ Code: github.com/FoundationVision/VAR My notes: drive.google.com/file/d/1qym3JG-0xqEgQhdvkt9N17o-ZzUWy2sn/view?usp=drive_link 00:00 Intro 00:53 DiTs 04:06 Autoregressive Image Transformers 06:23 Tokenization problem with AR ViTs 08:43 VAE 10:47 Discrete Quantization - VQGAN 16:42 Visual Autoregressive Modeling 21:31 Causal Inference with VAR 24:...
Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention
zhlédnutí 3,5KPřed 4 měsíci
Paper: arxiv.org/abs/2404.07143 My notes: drive.google.com/file/d/1plWJDwHTZkRK9PDdvaLMnZjFR6fVvNLH/view?usp=drive_link 00:00 Intro 07:17 Model intuition 11:00 Memory retrieval operation 16:29 Hidden state updates 21:58 Delta update 24:10 Is it causal? 25:26 Combining local attention and RNN 27:26 Results 30:25 Sampling and Conclusion
Mixture-of-Depths: Dynamically allocating compute in transformer-based language models
zhlédnutí 1,9KPřed 4 měsíci
Paper: arxiv.org/abs/2404.02258 My notes: drive.google.com/file/d/1o4v5te1yfuK_FQPvvS8SR55Sysg04dYK/view?usp=drive_link 00:00 Intro 06:02 Mixture of Experts (MoE) 15:12 Mixture of Depths (MoD) 17:04 The gradients must flow! 22:40 Autoregressive Sampling 33:58 Results
Q* AGI Achieved (Apr Fools)
zhlédnutí 777Před 4 měsíci
Q* paper link: link.springer.com/content/pdf/10.1007/BF00992698.pdf April fools 😏
Stable Diffusion 3: Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
zhlédnutí 3,2KPřed 4 měsíci
Stable Diffusion 3: Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection
zhlédnutí 965Před 5 měsíci
GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection
The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits and BitNet
zhlédnutí 5KPřed 5 měsíci
The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits and BitNet
DoRA: Weight-Decomposed Low-Rank Adaptation
zhlédnutí 1,8KPřed 5 měsíci
DoRA: Weight-Decomposed Low-Rank Adaptation
OpenAI Sora and DiTs: Scalable Diffusion Models with Transformers
zhlédnutí 11KPřed 6 měsíci
OpenAI Sora and DiTs: Scalable Diffusion Models with Transformers
A Decoder-only Foundation Model For Time-series Forecasting
zhlédnutí 3,4KPřed 6 měsíci
A Decoder-only Foundation Model For Time-series Forecasting
Lumiere: A Space-Time Diffusion Model for Video Generation
zhlédnutí 640Před 6 měsíci
Lumiere: A Space-Time Diffusion Model for Video Generation
Exphormer: Sparse Transformers for Graphs
zhlédnutí 408Před 6 měsíci
Exphormer: Sparse Transformers for Graphs
Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads
zhlédnutí 1,4KPřed 6 měsíci
Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads
Boundary Attention: Learning to Find Faint Boundaries at Any Resolution
zhlédnutí 455Před 7 měsíci
Boundary Attention: Learning to Find Faint Boundaries at Any Resolution
Cached Transformers: Improving Transformers with Differentiable Memory Cache
zhlédnutí 835Před 7 měsíci
Cached Transformers: Improving Transformers with Differentiable Memory Cache
Translatotron 3: Speech to Speech Translation with Monolingual Data
zhlédnutí 760Před 7 měsíci
Translatotron 3: Speech to Speech Translation with Monolingual Data
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
zhlédnutí 9KPřed 8 měsíci
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference
zhlédnutí 1,9KPřed 8 měsíci
Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference
Adversarial Diffusion Distillation
zhlédnutí 1,7KPřed 8 měsíci
Adversarial Diffusion Distillation

Komentáře

  • @szebike
    @szebike Před 5 dny

    Technically impressive that its possible however I only see limited application for this. Pactically most models below 8 bit quantization are way less "aware" of input and context. If I alter a situation ina 8b model it can adjust its output accordingly any model below that is very rigid. That being said maybe you can metigate those effects when you train them on that quantization to begin with and not compress it when it was trained on higher values if it makes sense what I say...

  • @HenryDeng-o2x
    @HenryDeng-o2x Před 13 dny

    34:00 wait, how? I don't think "(L。QK)V" is equal to "Q*cumsum(KV)"

    • @gabrielmongaras
      @gabrielmongaras Před 13 dny

      If L is just the causal mask, then we can write the equation as Q@cumsum(K^T @ V). This is because at each time, t, the query is unique and is multiplied by the sum of all the past keys and values. For example: O1 = Q1 @ K1^T @ V1 = Q1 @ [ K1^T @ V1 ] O2 = Q2 @ K1^T @ V1 + Q2 @ K2^T @ V2 = Q2 @ [ K1^T @ V1 + K2^T @ V2] The Cumsum operation sums all the previous elements from timestep 1 to t, which is the relationship above. It is kind of just a compact way for me to write this recurrence relationship in one equation.

    • @HenryDeng-o2x
      @HenryDeng-o2x Před 12 dny

      @@gabrielmongaras Got it, and I kinda feel the relationship between causal map and cumsum(). Thank you :D

  • @user-fd5qz1wo7e
    @user-fd5qz1wo7e Před 14 dny

    How is the equality in DDPM established in 17:49?

    • @gabrielmongaras
      @gabrielmongaras Před 13 dny

      Looks like I forgot to write out the square root over the first term. As for the inner term that got turned into a fraction, I just multiplied sqrt{1-a_t} by the fraction sqrt{1-a_t}/sqrt{1-a_t}.

  • @AarreLisakki
    @AarreLisakki Před 17 dny

    a tiny correction: its phOnEme, not as you say, phEnOme

    • @gabrielmongaras
      @gabrielmongaras Před 13 dny

      Thanks for letting me know! I have no idea where I got phOnEme from ha!

  • @TheCrmagic
    @TheCrmagic Před 18 dny

    Isn't the recurrence A_t@h_(t-1). So a_1 should be multiplied with h_0 and so on?

    • @gabrielmongaras
      @gabrielmongaras Před 13 dny

      It depends on how you define the hidden state. Since I'm defining it as h_t = h_t-1 + K^T @ V, then the output would be defined at timestep t, not t-1. This means that the output at time t takes into account all past information and the current token. If we had a relation on t-1, then it would only consider previous information.

  • @coc2912
    @coc2912 Před 19 dny

    Good video, we need these kind of advanced paper explanation.

  • @WesleyWilliams-x9y
    @WesleyWilliams-x9y Před 22 dny

    You did an excellent job in your explanations! I can tell you recently had to grasp the topic since it's pretty new! I'm curious as to how you think the architecture should be expanded to larger problems?

  • @EkShunya
    @EkShunya Před 22 dny

    great one, really liked it thanks

  • @AM-yk5yd
    @AM-yk5yd Před 23 dny

    My only topology knowledge comes from "Experiments in Topology" by Barr where one chapter was something like a court case about making holes and using them and repairing or something like that. Book was written in 1960s and I read it when I was teen in early 2000s. The book covered the most basics of topology and was quite fun even for teenager me as it had fun exercises.

  • @jahovajenkins5947
    @jahovajenkins5947 Před 27 dny

    Huge thanks dude, this explanation has saved me much agony.

  • @farafr46
    @farafr46 Před 27 dny

    Спасибо за видно, ничего не понял

  • @thisismambonumber5
    @thisismambonumber5 Před 28 dny

    this video was so cool thanks ❤ i know its diffusion model specific but it would be cool if you could explain lycoris like this

  • @palfers1
    @palfers1 Před měsícem

    I cannot listen to you because you continually interrupt yourself. Practise speaking.

  • @nozulani
    @nozulani Před měsícem

    eli5?

    • @gabrielmongaras
      @gabrielmongaras Před měsícem

      Sure! Typical graph data usually takes on a node-edge form where the data is assigned to each node with connections via edges. This notion of a graph can be generalized by considering a graph with faces formed through a connection of edges, for example a triangle or square. These graphs are called "cell complexes" as each face kind of looks like a cell. Instead of data being put only on vertices, we can also assign data to edges and faces. This makes the data representation and model more expressive since we look at higher-order connections on faces rather than only on edges. The authors form a transformer utilizing the properties of higher order structures and find good resouls on a few datasets.

    • @nozulani
      @nozulani Před měsícem

      @@gabrielmongaras thanks, wasn't exactly eli5 but i understand this is a difficult topic

    • @gabrielmongaras
      @gabrielmongaras Před měsícem

      Is there anything that's particularly confusing?

    • @nozulani
      @nozulani Před měsícem

      @@gabrielmongaras thanks i don't want to take your time, i'm going to study this myself

  • @KaiSyunHou
    @KaiSyunHou Před měsícem

    I have understand that the inner loop will update W_t with the modified reconstruction loss, then how is the theta_K, theta_V, theta_Q being updated in outer loop? specifically, what loss is related to these three parameters?

    • @gabrielmongaras
      @gabrielmongaras Před měsícem

      The "outer loop" uses the normal training strategy where we have the negative log likelihood or cross entropy loss for next token prediction. This gradient of this loss backpropogated to all layers like normal. So the "outer loop" is just the rest of the network.

    • @KaiSyunHou
      @KaiSyunHou Před měsícem

      @@gabrielmongaras but the normal cross entropy loss for next token prediction should not incur any gradient flow to theta_K and theta_V? They are for reconstruction loss, so they are not involved in token prediction

    • @gabrielmongaras
      @gabrielmongaras Před měsícem

      The theta params are trained via the next token prediction loss. This can be thought of as the outer model querying the inner model for information by changing the loss function with these theta params. I think the outer loop actually differentiates the inner loop (including the gradient of the loss) so the K,V there params are updated by the outer loop in this way. The only param that's "trained" using the inner loop is the hidden state.

  • @adidevbhattacharya9220
    @adidevbhattacharya9220 Před měsícem

    That was indeeed a gr8 explanation. Can you please explain how do we get the hidden dimension d @28:19 For e.g if the img in latent space is 128x128x3 and we consider patch size of 32. Then no. of tokens = (128/32)^2 = 16 Is the number of dimnesion (d) then = p^2 = 32^2 ? Please clarify this

  • @lexer_
    @lexer_ Před měsícem

    I really like seeing videos like this that are actually willing to explore the math and break it down to make it easier to catch onto how it actually works. Too many just discuss the abstract or skip past the math because it hampers audience maximization.

  • @gabrielmongaras
    @gabrielmongaras Před měsícem

    I want to point out that after training, the inner loop in the RNN still needs to be "optimized" since the gradient update rule is the RNN update rule itself. The normal NLL training procedure just trains the model on how to "teach" the internal RNN layers. This isn't normal optimization though. It doesn't require an optimizer since we know the functional form of the gradient wrt. to hidden state so it's very efficient.

  • @hcp-s4k
    @hcp-s4k Před měsícem

    Does each pixel in the image have a box? For a pixel, the paper says that the vertex is the center of the pixel, but the video says that the vertex can be outside the box. I don't quite understand.I hope there will be a more detailed description of the training details.

  • @hi_6546
    @hi_6546 Před měsícem

    Hey, if the frame rate is equal to 50*64, that mean that each second the model need to predict 50*64 values? (without counting the dim)

    • @hi_6546
      @hi_6546 Před měsícem

      the model works with the index of the codebook no?

    • @nickackerman8755
      @nickackerman8755 Před 20 dny

      Where do you get the 64? If we have e.g. 4 codebooks, and there are 50 time steps per one second of audio, we need to predict 4 * 50 values per second (4 codebook indices per time step), I believe.

  • @lexer_
    @lexer_ Před měsícem

    There is a strange disconnect between the kinds concepts you explain here. Some of them are like absolute beginner, you have never coded before and don't know anything about machine learning but the majority of the video requires like at least multiple months of crash course if not years of experience in the field to understand. I think your videos could benefit from a more consistent assumption of competence of the audience. Or maybe approach it in sections for difference audience competencies? I really apprechiate these walk-throughs through papers. It makes it so much easier to concentrate for me than only reading on my own so I really want anyone that does this to succeed.

    • @gabrielmongaras
      @gabrielmongaras Před měsícem

      Thanks for the feedback!! Usually I try to keep most papers I read a little more technical as I'll find myself explaining things like transformers over and over again, which could lead to videos being unnecessarily long (they already feel way too long, been trying to reduce lengths). Some videos, like the stable diffusion one, I try to make a little more beginner friendly assuming knowledge of CNNs, MLPs and basic training. I think I should probably communicate this better and make a clear distinction between the two perhaps in the title somewhere. I think with this video, I wanted it to be somewhere in the middle which led to problems. I like the idea of approaching in sections based on knowledge level. For example, I could've explained REINFORCE a little or point to a resource about it. While those who know it could skip this part, those who don't could watch it or go to the resource. Will think about this some more for future videos!

    • @lexer_
      @lexer_ Před měsícem

      @@gabrielmongaras I personally got annoyed at the very beginner level explanations early on which felt like it took forever and I feel lucky that I stuck it out long enough to actually make it to the interesting parts later on. For a while I thought that was really all there would be in this video, explanations of some very basic concepts. But that is of course very different depending on where one might be comming from in terms of familiarity with these topics. I will watch out for future videos of yours for sure.

    • @gabrielmongaras
      @gabrielmongaras Před měsícem

      @@lexer_ Got it! I didn't realize the intros were annoying 😅 Thanks again for you feedback, it's really helpful! From now, I think I will directly mention if I am explaining a basic concept, saying to skip if one already knows said concept and to provide resources instead of explaining concepts I assume most should know. In this video, I could've probably just skipped the RLHF explanation all together and pointed to one of the many explanations of RLHF. I'm going to experiment with this a little in future videos!

    • @codylane2104
      @codylane2104 Před měsícem

      @@gabrielmongaras Yeah, pointing to good basic level explanatory material is a great idea! There's a ton of materials on the net, yet only a fraction is really worth studying. 🙂

  • @SOFTWAREMASTER
    @SOFTWAREMASTER Před měsícem

    Great video. Thank you for covering this! Could you also have a look at 'Terminator' paper and make a video on that? It looks like it's a new architecture.

    • @gabrielmongaras
      @gabrielmongaras Před měsícem

      Will take a look at it! Not sure if I'll make a video on it yet though. I usually do if the model proposed something really interesting or has really good results!

  • @nicEDITS_
    @nicEDITS_ Před měsícem

    Really helpful, thanks!

  • @hakikatsingh6254
    @hakikatsingh6254 Před měsícem

    dayummmmmmmm , pretty awesome dude

  • @yadav-r
    @yadav-r Před měsícem

    I do not understand it, but can we use it to do stock market forecast, sports forecast. How do I use the tool, is there a tutorial for it? Thank you .

  • @PhucNguyenXuan-cn8rg
    @PhucNguyenXuan-cn8rg Před 2 měsíci

    thank you so much for yout clearly explanation!

  • @danieloh4511
    @danieloh4511 Před 2 měsíci

    Great video. BTW, what is the note-taking app in the video?

    • @gabrielmongaras
      @gabrielmongaras Před 2 měsíci

      Thanks! I'm using the Samsung notes app. Pretty standard, but I like the style and ease of use of the app.

  • @acasualviewer5861
    @acasualviewer5861 Před 2 měsíci

    If they can convert between Transformers and Mamba2, why not convert GPT2 (which has published weights) to Mamba2 and prove that the converted performs as well as the original? Or do the same with a bigger model such as Llama? Or does their method not allow importing Transformer weights into the equivalent Mamba2?

    • @gabrielmongaras
      @gabrielmongaras Před 2 měsíci

      GPT2 and models that use the original transformer architecture use softmax attention which cannot be decomposed down into an RNN or SSM due to the nonlinear dependence of the softmax function on the entire row. This forces you to compute an entire row in memory to get the output of that row. However, if you remove that dependence with the kernel trick or a similar method, such as using ReLU on the quereis and keys, then you can decompose the attention mechanism into an RNN/SSM. This is because you no longer have to compute an entire row of the attention matrix to get the output value, rather it's just a sum of value along that row, giving the RNN formulation. So, original softmax transformer models can't be changed into an RNN/SSM, meaning it has to stay quadratic (unless flash attention is used, but this is just a cuda trick). This is why I think we should move away from softmax attention. Changing to an SSM architecture of linear attention is way more efficient and papers such as this show that linear methods are comparable to naive transformers in terms of accuracy.

    • @acasualviewer5861
      @acasualviewer5861 Před 2 měsíci

      @@gabrielmongaras thank you for responding and for your very valuable videos. You are by far my favorite "Paper explainer" channel. Back to the question: given that the Transformers cannot be converted to Mamba2, I'm not sure its convincing to say that Mamba2 is equivalent to GPT as the paper claims. If it is, then the proof should be in the results. And it should be able to compete with the state of the art. Though I get your point about Softmax. Frankly I find it confusing that Softmax counts as a non-linearity. But in the end, results should speak for themselves.

    • @gabrielmongaras
      @gabrielmongaras Před 2 měsíci

      Thank you! Glad you're finding my videos helpful! I think it may be better to say that SSMs, GPT, and RNNs are very similar in that changing the nonlinearity in GPT and changing the A block in SSMs result in an equivalent RNN. This RNN formulation is what Mamba 2 is. The softmax "nonlinearity" also confuses me. If you try replacing it with other nonlinearities, you'll find that for some reason it has crazy good properties when used on the attention matrix that other nonlinearities don't have, making it get very good accuracy. It also has the fact that the values sum up to 1 which is very nice for stability reasons. Very happy to see something replace this thing as it feels very naive to just slap a softmax onto a matrix. I think the authors make a convincing argument with their results. Since other papers have shown that using a linear attention block can be comparable to softmax attention in terms of accuracy, Mamba 2 is pretty believable. Also they form a lot of theory and release code making me believe this papers claims more. I guess we'll see how Mamba 2 compares when tested on large scale LLM cases.

    • @acasualviewer5861
      @acasualviewer5861 Před 2 měsíci

      @@gabrielmongaras Softmax seems to compare values, though layer norm does something similar. Maybe the comparison is what makes it special. With the output of softmax we're always creating a weighted sum of sorts. Like an interpolation of vectors. Relu seems incapable of doing the same.

  • @elon-69-musk
    @elon-69-musk Před 2 měsíci

    awesome 👍😎

  • @Ryu-ix8qs
    @Ryu-ix8qs Před 2 měsíci

    Thanks for the video. Very helpful to understand how it works.

  • @gabrielmongaras
    @gabrielmongaras Před 2 měsíci

    Sorry the audio got messed up on this one. Got a new mic and it acts a bit weird with my tablet sometimes. Tried to fix it in post processing, but wasn't able to get the wobbling sound out. Maybe I should just train an audio algorithm that takes my garbage audio in and spits out professional audio via consistency loss, ha!

  • @noadsensehere9195
    @noadsensehere9195 Před 2 měsíci

    How can I implement this paper

  • @einsteinsapples2909
    @einsteinsapples2909 Před 2 měsíci

    Don't you have the domensions wrong at 2:30? Shouldn't the Q, K, V matrices be Tokens X Channels, with each row representing a single token?

    • @gabrielmongaras
      @gabrielmongaras Před 2 měsíci

      I usually like to transpose the matrices when drawing the QKV matrices out for attention. I feel like a sequence going left to right rather than up to down is more intuitive, but idk.

    • @einsteinsapples2909
      @einsteinsapples2909 Před 2 měsíci

      @@gabrielmongaras Interesting for me its more intuitive to have a token per line. Probably because I think of it as a Python list. Like a matrix to me, is a list of lists.

    • @gabrielmongaras
      @gabrielmongaras Před 2 měsíci

      @@einsteinsapples2909 I'll keep this in mind for future videos! Idk why, but seeing a token be vertical is the way I've always drawn it. I guess as long as the shape is written out correctly, it's fine.

    • @einsteinsapples2909
      @einsteinsapples2909 Před 2 měsíci

      @@gabrielmongaras Hi Gabriel, I don't know about your background, if you're a student or not. I personally, have never formally studied any of these topics, so I don't know what the conventions are (for all I know, what you're doing is the norm). I know in the "Attention is All You Need" paper they write the function as QK^t (they transpose the Keys matrix), the same way you write it in your video. The only way you can do QK^t and end with an "S by S" matrix is if the Q and K matrices are S x d (each row represents a token). If you want the Q,K,V matrices to be "d x S" (columns representing tokens) then you should do K^tQ instead to get the attention scores.

    • @gabrielmongaras
      @gabrielmongaras Před 2 měsíci

      @@einsteinsapples2909 I don't have a degree in ML, mostly just self study. I think the notation that you're suggesting would be correct from a linear algebra perspective meaning I draw the keys right, but need to flip the queries/values. Nonetheless, Q, K, and V are (S, d) regardless of how they're drawn out, or else the attention formula wouldn't work. I guess it's confusing when I draw the tokens as row vectors 😅

  • @terenceAAA
    @terenceAAA Před 2 měsíci

    For the paper says that they get N next tokens by N heads from hidden states h_t at position t, I just don't know the arrow from Head1 to Head2 in the Figure 2. The distribution of tokens by head2 will never be determined by the output from head1. I just cannot get the connection from head1 and head2.

  • @KevinInPhoenix
    @KevinInPhoenix Před 2 měsíci

    Does this technology make Nvidia's tech and NPUs obsolete?

  • @jonatan01i
    @jonatan01i Před 2 měsíci

    The .log on the mask is there to make the attn_logits' to be masked values negative infinity, that's how you make them disappear to 0 in the attn

    • @gabrielmongaras
      @gabrielmongaras Před 2 měsíci

      ah makes sense, turns a binary mask into a mask of -inf and 0. Usually I just pass a mask of -inf and 0 through the entire model.

    • @jonatan01i
      @jonatan01i Před 2 měsíci

      @@gabrielmongaras btw, thanks for the upload!:)

  • @jonatan01i
    @jonatan01i Před 2 měsíci

    It's not true that they don't have positional encoding "at all", it's very far from true. Every token only gets access to what came before, and that's a lot of info available to rely on.

    • @gabrielmongaras
      @gabrielmongaras Před 2 měsíci

      It still doesn't have information of position though. Position tells the model where in the sequence a certain token is, differentiating the same token in different positions in the sequence. Without positional encodings, the model operates on sets of information, not sequences.

    • @jonatan01i
      @jonatan01i Před 2 měsíci

      @@gabrielmongaras yes I know what you mean but the elements of it are not discrete but very dependent on their positions (what is there before me) in the sequence. If you switch two tokens’ place in it, the first few tokens that come before the leftmost of the two switched tokens will remain the same, but from that point on in the sequence everything changes.. they have contextual positional encodings built into them we could say that in some sense (given you have the future masking, for vanilla encoder it doesn’t work, that one has no position information at all I agree with that but not with the future masked one)

    • @jonatan01i
      @jonatan01i Před 2 měsíci

      ​@@gabrielmongaras if we have a transformer model with let's say 5 tokens and without additional positional encodings, and the sole task of it would be to have a random permutation of these 5 tokens given to it two times (same permutation twice), essentially it's task is to copy the first five token one after the other in the same order.. if you believe a transformer (with future mask!) has no positional information at all, then you realise that you are suggesting that this task is impossible for the transformer to learn. Do you agree with this or are we talking about two different things?

    • @gabrielmongaras
      @gabrielmongaras Před 2 měsíci

      oh yea I think I see what you're saying. Without PEs, the transformer cannot distinguish position of words, however it can still get a sense of the position of the word it is currently generating based on the context of the tokens. I suppose it may be easier to take a look at this through the idea of sets. A normal transformer with PEs is essentially operating on an ordered set where each element is unique due to the ordering. Without PEs, the transformer operates on an unordered set of data, but it can still get an idea of the size of the set and maybe also an idea of the "count" of the number of duplicated items in the set.

    • @jonatan01i
      @jonatan01i Před 2 měsíci

      @@gabrielmongaras and also an idea of [2,3,(1,5,4)][2,3, and now what comes next? How would it know if it was (1,4,5),(1,5,4),(4,1,5),…or(5,4,1)? I get it that the addition operation is commutative and all, but the model will eventually rely on context of previous tokens to get to come up with a useful (relative)positional information for itself because it is possible to do given there is the future mask given to it (if no mask, then the transformer yes becomes commutative truly, you can change the order of the tokens and it won’t make any difference, (given you don’t use convolutions with kernels bigger than 1 and positional embeddings neither)

  • @lienlrac7644
    @lienlrac7644 Před 2 měsíci

    When Gabriel drops I stop everything and listen

  • @marinepower
    @marinepower Před 2 měsíci

    I wonder if this method could be improved by having a new projection matrix of size [hidden_dim x 1] that computes the width of each token, we take the sigmoid, the cumulative sum, we do the interpolation as described, but we add it to the queries and keys, then do normal attention. This requires a new ((small)) matrix but would allow us to use flash attention directly without needing a new cope kernel.

    • @gabrielmongaras
      @gabrielmongaras Před 2 měsíci

      This makes perfect sense to work just as well as the CoPE method to me! The only drawback I see is that CoPE recalculates all positions for each token while positions in this case would not change for past tokens. However, I'm guessing this recomputation is unneeded and that you will get similar performance as CoPE, but it's a lot more efficient computationally. Are you thinking of implementing this?

    • @marinepower
      @marinepower Před 2 měsíci

      @@gabrielmongaras All positions here would be recalculated every layer, so they would affect past tokens. For example, the width for token 1 might be computed as 0.8 in layer 1 and 0.5 for layer 2, etc. Since we're doing the cumulative sum all positions are updated, since every token relies on token 1 (but this applies to every token in the series). As far as actually implementing this, however, I ran into a small issue. What this method uses (floor, ceil) is non-differentiable. Instead, the embeddings would need to be calculated directly from the cumulative embedding position, which I think is fairly expensive since these embeddings use a lot of sines and cosines, etc. It seems possible, it's just that we might want to simplify the embedding function a bit.

    • @Anonn724
      @Anonn724 Před 2 měsíci

      Sounds interesting. Definitely will give it a try to implement this. I quickly implemented some CUDA as Gabriel mentioned if sb is interested gh juvi21/CoPE-cuda

    • @gabrielmongaras
      @gabrielmongaras Před 2 měsíci

      ​@@marinepower That makes sense, though at the same layer, the positions are calculated once, in CoPE, they're recalculated for each token, but I don't think this difference will be very beneficial, meaning this method can probably work just as well! I think a detach can be used on the floor/ceil, the weight on the other hand will be the differentiable part: For a given token, we have a positional value of say p which is the sum of all positional value up to that token. For the positional embedding for a specific token, we would have: e = (1 - (p - floor(p)))*PE[floor(p)] + (p - floor(p))*PE[ceil(p)]) Where PE[i] is the ith absolute positional encoding. In this case, floor(p) and ceil(p) are not differentiable, but p is, so there's still gradient flow So I think this idea should still work!

    • @marinepower
      @marinepower Před 2 měsíci

      @@gabrielmongaras Hmm, yeah, I think that works! I am not really training any transformers right now that need variable positional encodings so I probably won't test this method, but the next time I look into llms I'll try this out! (Although... training LLMs from scratch requires a tremendous amount of compute, so I moreso hope someone else notices this comment chain and tries it out lol)

  • @DataScienceGuy
    @DataScienceGuy Před 2 měsíci

    great job! Very clear explanation indeed. How do you think, why control net approach works worse with sdxl model?

  • @darshanparsana7890
    @darshanparsana7890 Před 2 měsíci

    there is a correction... Q is coming from decoder while K and V is coming from the Encoder. at 51:36 timestamp. ref: attention is all you need paper -> Section 3.2.3 -> first bullet point.

  • @bindiberry6280
    @bindiberry6280 Před 2 měsíci

    Anybody used this to do Cantonese ?!!

    • @gabrielmongaras
      @gabrielmongaras Před 2 měsíci

      I think they trained it only in English, but there's probably a large enough dataset to create one on Cantonese

  • @ahamuffin4747
    @ahamuffin4747 Před 2 měsíci

    Hello, thanks for the explanations! just a few words on the greek letters. its "psi" not "phi" here, and the "rho"_theta you mention is actually a "tau"

  • @objio
    @objio Před 3 měsíci

    Summary: while current RNs work as a universal approximator (probability machines), Kolmogorov networks allow you to leverage a more deterministic function. End point: the axiom of the universal approximator as a pillar of current networks could be replaced by networks of variable geometry. Think about the difference between searching or decoding information in 1 dimension, and now adding 2 dimensions. A data point vs a line of data. Like go from 480p to 8KHD resolution.

  • @SadmanAhmedShanto
    @SadmanAhmedShanto Před 3 měsíci

    Keep up the good work, man!

  • @eddiehead9794
    @eddiehead9794 Před 3 měsíci

    Wow! The extractability of solutions really catches my eye.

  • @mateuszk3210
    @mateuszk3210 Před 3 měsíci

    Interesting paper, I am having similar problem while training on 300 long sequences I need to extend to 1000 and I am using RoPE. Do you know if this interpolation can be used with RoPE, or should I look into something like ALiBI ? I recall I was reading ALiBi also has some issues and accuracy is worse. There is also LongRoPE.

  • @-slt
    @-slt Před 3 měsíci

    constant movement of the screen makes my (and sure many others) head to explode. please move a little less. zoom in and out less. it helps the viewer to focus on the text and your explanation. thanks. :)

    • @gabrielmongaras
      @gabrielmongaras Před 3 měsíci

      Thanks for the feedback! Will keep this in mind next time I'm recording

  • @acasualviewer5861
    @acasualviewer5861 Před 3 měsíci

    Is it slow to train like LSTMs and RNNs are? A major benefit from Transformers is faster parallelized training. I would assume xLSTMs would be constrained by their sequential nature.

    • @gabrielmongaras
      @gabrielmongaras Před 3 měsíci

      Yep, should still be slow to train. I don't see any way to make one of the cells into something parallel like a transformer since the cells are so complicated.