70
226 081

Attending to Topological Spaces: The Cellular Transformer

42:25

Learning to (Learn at Test Time): RNNs with Expressive Hidden States

35:52

WARP: On the Benefits of Weight Averaged Rewarded Policies

52:39

CoDeF: Content Deformation Fields for Temporally Consistent Video Processing

28:52

Mamba 2 - Transformers are SSMs: Generalized Models and Efficient Algorithms Through SSS Duality

1:14:43

CoPE - Contextual Position Encoding: Learning to Count What's Important

38:55

Deterministic Image Editing with DDPM Inversion, DDIM Inversion, Null Inversion and Prompt-to-Prompt

Null-text Inversion for Editing Real Images using Guided Diffusion Models: arxiv.org/abs/2211.09794
An Edit Friendly DDPM Noise Space: Inversion and Manipulations: arxiv.org/abs/2304.06140
Prompt-to-Prompt Image Editing with Cross Attention Control: arxiv.org/abs/2208.01626
00:00 Intro
01:24 Current image editing techniques
11:42 Deriving DDPM and DDIM
23:08 DDIM inversion
32:46 Null inversion
47:15 DDPM inversion
1:01:18 Prompt-to-prompt
1:10:52 Conclusion

zhlédnutí: 587

Video

Attending to Topological Spaces: The Cellular Transformer

42:25

Attending to Topological Spaces: The Cellular Transformer

zhlédnutí 560Před měsícem

Paper here: arxiv.org/abs/2405.14094 Notes: drive.google.com/file/d/12g_KkHqXD6mEDILJzYbCC08i8cDHITfC/view?usp=drive_link 00:00 Intro 01:39 Cellular complexes 07:26 K-cochain 13:26 Defining structure on the cell 20:28 Cellular transformer 34:18 Positional encodings and outro

Learning to (Learn at Test Time): RNNs with Expressive Hidden States

35:52

Learning to (Learn at Test Time): RNNs with Expressive Hidden States

zhlédnutí 1,9KPřed měsícem

Paper here: arxiv.org/abs/2407.04620 Code!: github.com/test-time-training/ttt-lm-pytorch Notes: drive.google.com/file/d/127a1UBm_IN_WMKG-DmEvfJ8Pja-9BwDk/view?usp=drive_link 00:00 Intro 04:40 Problem with RNNs 06:38 Meta learning and method idea 09:13 Update rule and RNN inner loop 15:07 Learning the loss function outer loop 21:21 Parallelizing training 30:05 Results

WARP: On the Benefits of Weight Averaged Rewarded Policies

52:39

WARP: On the Benefits of Weight Averaged Rewarded Policies

zhlédnutí 683Před měsícem

Paper here: arxiv.org/abs/2406.16768 Notes: drive.google.com/file/d/11UK7mEZwNVUMYuXwvOTfaqHhN8zSYm5M/view?usp=drive_link 00:00 Intro and RLHF 17:30 Problems with RLHF 21:08 Overview of their method 23:47 EMA 28:00 Combining policies with SLERP 37:34 Linear interpolation towards initialization 40:32 Code 44:16 Results

CoDeF: Content Deformation Fields for Temporally Consistent Video Processing

28:52

CoDeF: Content Deformation Fields for Temporally Consistent Video Processing

zhlédnutí 711Před měsícem

Paper: arxiv.org/abs/2308.07926 Paper page: qiuyu96.github.io/CoDeF/ Code: github.com/qiuyu96/CoDeF My notes: drive.google.com/file/d/10PMKdd5XBd6Y60HlRB9IW9naR2bWziDT/view?usp=drive_link 00:00 Intro 03:00 Method overview 08:40 Method details 15:24 Tricks done for training and how to actually train this thing 19:24 Flow loss and masking 25:10 Conclusion

Mamba 2 - Transformers are SSMs: Generalized Models and Efficient Algorithms Through SSS Duality

1:14:43

Mamba 2 - Transformers are SSMs: Generalized Models and Efficient Algorithms Through SSS Duality

zhlédnutí 6KPřed 2 měsíci

Paper here: arxiv.org/abs/2405.21060 Code!: github.com/state-spaces/mamba/blob/main/mamba_ssm/modules/mamba2.py Notes: drive.google.com/file/d/1 XGPFeXQyx4CPxgYjzR4qrLd-baLWQC/view?usp=sharing 00:00 Intro 01:45 SSMs 08:00 Quadratic form of an SSM 15:02 Expanded form of an SSM 24:00 Attention - it's all you need?? 29:55 Kernel attention 32:50 Linear attention 34:32 Relating attention to SSMs 38:...

CoPE - Contextual Position Encoding: Learning to Count What's Important

38:55

CoPE - Contextual Position Encoding: Learning to Count What's Important

zhlédnutí 1,2KPřed 2 měsíci

Paper: arxiv.org/abs/2405.18719 My notes: drive.google.com/file/d/1y9VHZc7MLqc6t2SHHdlVTYeW3czmmRbl/view?usp=sharing 00:00 Intro 02:44 Background 09:58 CoPE 24:50 Code 32:16 Results

NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models

45:48

NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models

zhlédnutí 742Před 2 měsíci

Paper: arxiv.org/abs/2403.03100 Demo: speechresearch.github.io/naturalspeech3/ Code: huggingface.co/spaces/amphion/naturalspeech3_facodec My notes: drive.google.com/file/d/1xnzErd_86B6eLwqpLckhoEQKqkxFPyM_/view?usp=drive_link 00:00 Intro 05:34 Architecture overview 18:45 GRL and subspace independence 24:45 Discrete diffusion Model 41:00 factorized diffusion model 44:00 Conclusion and results

43:26

xLSTM: Extended Long Short-Term Memory

zhlédnutí 1,8KPřed 3 měsíci

Paper: arxiv.org/abs/2405.04517 My notes: drive.google.com/file/d/1wFYvU_1oUWcCNuQ91zTpSGAeNUsPjlt3/view?usp=drive_link 00:00 Intro 05:44 LSTM 13:38 Problems paper addresses 14:12 sLSTM 23:00 sLSTM Memory mixing 27:08 mLSTM 35:14 Results and stuff

37:09

KAN: Kolmogorov-Arnold Networks

zhlédnutí 54KPřed 3 měsíci

Paper: arxiv.org/abs/2404.19756 Spline Video: m.czcams.com/video/qhQrRCJ-mVg/video.html My notes: drive.google.com/file/d/1twcIF13nG8Qc10_qeDqCZ4NaUh9tFsAH/view?usp=drive_link 00:00 Intro 00:45 MLPs and Intuition 05:12 Splines 19:02 KAN Formulation 28:00 Potential Downsides to KANs 32:09 Results

LADD: Fast High-Resolution Image Synthesis with Latent Adversarial Diffusion Distillation

30:07

LADD: Fast High-Resolution Image Synthesis with Latent Adversarial Diffusion Distillation

zhlédnutí 784Před 3 měsíci

Paper: arxiv.org/abs/2403.12015 My notes: drive.google.com/file/d/1s1-nnWR_ZR26PNSAoZR1Xj3nuD9UZlvR/view?usp=sharing 00:00 Intro 01:31 Diffusion Models 08:08 Latent Diffusion Models 10:04 Distillation 12:02 Aversarial Diffusion Distillation (ADD) 17:06 Latent Aversarial Diffusion Distillation (LADD) 22:20 Results

Visual AutoRegressive Modeling:Scalable Image Generation via Next-Scale Prediction

37:00

Visual AutoRegressive Modeling:Scalable Image Generation via Next-Scale Prediction

zhlédnutí 1,7KPřed 4 měsíci

Paper: arxiv.org/abs/2404.02905 Demo: var.vision/ Code: github.com/FoundationVision/VAR My notes: drive.google.com/file/d/1qym3JG-0xqEgQhdvkt9N17o-ZzUWy2sn/view?usp=drive_link 00:00 Intro 00:53 DiTs 04:06 Autoregressive Image Transformers 06:23 Tokenization problem with AR ViTs 08:43 VAE 10:47 Discrete Quantization - VQGAN 16:42 Visual Autoregressive Modeling 21:31 Causal Inference with VAR 24:...

Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention

32:49

Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention

zhlédnutí 3,5KPřed 4 měsíci

Paper: arxiv.org/abs/2404.07143 My notes: drive.google.com/file/d/1plWJDwHTZkRK9PDdvaLMnZjFR6fVvNLH/view?usp=drive_link 00:00 Intro 07:17 Model intuition 11:00 Memory retrieval operation 16:29 Hidden state updates 21:58 Delta update 24:10 Is it causal? 25:26 Combining local attention and RNN 27:26 Results 30:25 Sampling and Conclusion

Mixture-of-Depths: Dynamically allocating compute in transformer-based language models

40:14

Mixture-of-Depths: Dynamically allocating compute in transformer-based language models

zhlédnutí 1,9KPřed 4 měsíci

Paper: arxiv.org/abs/2404.02258 My notes: drive.google.com/file/d/1o4v5te1yfuK_FQPvvS8SR55Sysg04dYK/view?usp=drive_link 00:00 Intro 06:02 Mixture of Experts (MoE) 15:12 Mixture of Depths (MoD) 17:04 The gradients must flow! 22:40 Autoregressive Sampling 33:58 Results

4:54

Q* AGI Achieved (Apr Fools)

zhlédnutí 777Před 4 měsíci

Q* paper link: link.springer.com/content/pdf/10.1007/BF00992698.pdf April fools 😏

Stable Diffusion 3: Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

1:02:30

Stable Diffusion 3: Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

zhlédnutí 3,2KPřed 4 měsíci

Stable Diffusion 3: Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection

37:08

GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection

zhlédnutí 965Před 5 měsíci

GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection

The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits and BitNet

46:25

The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits and BitNet

zhlédnutí 5KPřed 5 měsíci

The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits and BitNet

DoRA: Weight-Decomposed Low-Rank Adaptation

31:15

DoRA: Weight-Decomposed Low-Rank Adaptation

zhlédnutí 1,8KPřed 5 měsíci

DoRA: Weight-Decomposed Low-Rank Adaptation

OpenAI Sora and DiTs: Scalable Diffusion Models with Transformers

1:02:38

OpenAI Sora and DiTs: Scalable Diffusion Models with Transformers

zhlédnutí 11KPřed 6 měsíci

OpenAI Sora and DiTs: Scalable Diffusion Models with Transformers

A Decoder-only Foundation Model For Time-series Forecasting

33:55

A Decoder-only Foundation Model For Time-series Forecasting

zhlédnutí 3,4KPřed 6 měsíci

A Decoder-only Foundation Model For Time-series Forecasting

Lumiere: A Space-Time Diffusion Model for Video Generation

37:30

Lumiere: A Space-Time Diffusion Model for Video Generation

zhlédnutí 640Před 6 měsíci

Lumiere: A Space-Time Diffusion Model for Video Generation

Exphormer: Sparse Transformers for Graphs

28:56

Exphormer: Sparse Transformers for Graphs

zhlédnutí 408Před 6 měsíci

Exphormer: Sparse Transformers for Graphs

Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads

25:56

Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads

zhlédnutí 1,4KPřed 6 měsíci

Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads

Boundary Attention: Learning to Find Faint Boundaries at Any Resolution

40:23

Boundary Attention: Learning to Find Faint Boundaries at Any Resolution

zhlédnutí 455Před 7 měsíci

Boundary Attention: Learning to Find Faint Boundaries at Any Resolution

Cached Transformers: Improving Transformers with Differentiable Memory Cache

29:38

Cached Transformers: Improving Transformers with Differentiable Memory Cache

zhlédnutí 835Před 7 měsíci

Cached Transformers: Improving Transformers with Differentiable Memory Cache

Translatotron 3: Speech to Speech Translation with Monolingual Data

39:02

Translatotron 3: Speech to Speech Translation with Monolingual Data

zhlédnutí 760Před 7 měsíci

Translatotron 3: Speech to Speech Translation with Monolingual Data

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

44:02

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

zhlédnutí 9KPřed 8 měsíci

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference

47:32

Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference

zhlédnutí 1,9KPřed 8 měsíci

Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference

28:39

Adversarial Diffusion Distillation

zhlédnutí 1,7KPřed 8 měsíci

Adversarial Diffusion Distillation

Komentáře

@szebike Před 5 dny
Technically impressive that its possible however I only see limited application for this. Pactically most models below 8 bit quantization are way less "aware" of input and context. If I alter a situation ina 8b model it can adjust its output accordingly any model below that is very rigid. That being said maybe you can metigate those effects when you train them on that quantization to begin with and not compress it when it was trained on higher values if it makes sense what I say...
@HenryDeng-o2x Před 13 dny
34:00 wait, how? I don't think "(L。QK)V" is equal to "Q*cumsum(KV)"
@gabrielmongaras Před 13 dny
If L is just the causal mask, then we can write the equation as Q@cumsum(K^T @ V). This is because at each time, t, the query is unique and is multiplied by the sum of all the past keys and values. For example: O1 = Q1 @ K1^T @ V1 = Q1 @ [ K1^T @ V1 ] O2 = Q2 @ K1^T @ V1 + Q2 @ K2^T @ V2 = Q2 @ [ K1^T @ V1 + K2^T @ V2] The Cumsum operation sums all the previous elements from timestep 1 to t, which is the relationship above. It is kind of just a compact way for me to write this recurrence relationship in one equation.
@HenryDeng-o2x Před 12 dny
@@gabrielmongaras Got it, and I kinda feel the relationship between causal map and cumsum(). Thank you :D
@user-fd5qz1wo7e Před 14 dny
How is the equality in DDPM established in 17:49?
@gabrielmongaras Před 13 dny
Looks like I forgot to write out the square root over the first term. As for the inner term that got turned into a fraction, I just multiplied sqrt{1-a_t} by the fraction sqrt{1-a_t}/sqrt{1-a_t}.
@AarreLisakki Před 17 dny
a tiny correction: its phOnEme, not as you say, phEnOme
@gabrielmongaras Před 13 dny
Thanks for letting me know! I have no idea where I got phOnEme from ha!
@TheCrmagic Před 18 dny
Isn't the recurrence A_t@h_(t-1). So a_1 should be multiplied with h_0 and so on?
@gabrielmongaras Před 13 dny
It depends on how you define the hidden state. Since I'm defining it as h_t = h_t-1 + K^T @ V, then the output would be defined at timestep t, not t-1. This means that the output at time t takes into account all past information and the current token. If we had a relation on t-1, then it would only consider previous information.
@coc2912 Před 19 dny
Good video, we need these kind of advanced paper explanation.
@WesleyWilliams-x9y Před 22 dny
You did an excellent job in your explanations! I can tell you recently had to grasp the topic since it's pretty new! I'm curious as to how you think the architecture should be expanded to larger problems?
@EkShunya Před 22 dny
great one, really liked it thanks
@AM-yk5yd Před 23 dny
My only topology knowledge comes from "Experiments in Topology" by Barr where one chapter was something like a court case about making holes and using them and repairing or something like that. Book was written in 1960s and I read it when I was teen in early 2000s. The book covered the most basics of topology and was quite fun even for teenager me as it had fun exercises.
@jahovajenkins5947 Před 27 dny
Huge thanks dude, this explanation has saved me much agony.
@farafr46 Před 27 dny
Спасибо за видно, ничего не понял
@thisismambonumber5 Před 28 dny
this video was so cool thanks ❤ i know its diffusion model specific but it would be cool if you could explain lycoris like this
@palfers1 Před měsícem
I cannot listen to you because you continually interrupt yourself. Practise speaking.
@nozulani Před měsícem
eli5?
@gabrielmongaras Před měsícem
Sure! Typical graph data usually takes on a node-edge form where the data is assigned to each node with connections via edges. This notion of a graph can be generalized by considering a graph with faces formed through a connection of edges, for example a triangle or square. These graphs are called "cell complexes" as each face kind of looks like a cell. Instead of data being put only on vertices, we can also assign data to edges and faces. This makes the data representation and model more expressive since we look at higher-order connections on faces rather than only on edges. The authors form a transformer utilizing the properties of higher order structures and find good resouls on a few datasets.
@nozulani Před měsícem
@@gabrielmongaras thanks, wasn't exactly eli5 but i understand this is a difficult topic
@gabrielmongaras Před měsícem
Is there anything that's particularly confusing?
@nozulani Před měsícem
@@gabrielmongaras thanks i don't want to take your time, i'm going to study this myself
@KaiSyunHou Před měsícem
I have understand that the inner loop will update W_t with the modified reconstruction loss, then how is the theta_K, theta_V, theta_Q being updated in outer loop? specifically, what loss is related to these three parameters?
@gabrielmongaras Před měsícem
The "outer loop" uses the normal training strategy where we have the negative log likelihood or cross entropy loss for next token prediction. This gradient of this loss backpropogated to all layers like normal. So the "outer loop" is just the rest of the network.
@KaiSyunHou Před měsícem
@@gabrielmongaras but the normal cross entropy loss for next token prediction should not incur any gradient flow to theta_K and theta_V? They are for reconstruction loss, so they are not involved in token prediction
@gabrielmongaras Před měsícem
The theta params are trained via the next token prediction loss. This can be thought of as the outer model querying the inner model for information by changing the loss function with these theta params. I think the outer loop actually differentiates the inner loop (including the gradient of the loss) so the K,V there params are updated by the outer loop in this way. The only param that's "trained" using the inner loop is the hidden state.
@adidevbhattacharya9220 Před měsícem
That was indeeed a gr8 explanation. Can you please explain how do we get the hidden dimension d @28:19 For e.g if the img in latent space is 128x128x3 and we consider patch size of 32. Then no. of tokens = (128/32)^2 = 16 Is the number of dimnesion (d) then = p^2 = 32^2 ? Please clarify this
@lexer_ Před měsícem
I really like seeing videos like this that are actually willing to explore the math and break it down to make it easier to catch onto how it actually works. Too many just discuss the abstract or skip past the math because it hampers audience maximization.
@gabrielmongaras Před měsícem
I want to point out that after training, the inner loop in the RNN still needs to be "optimized" since the gradient update rule is the RNN update rule itself. The normal NLL training procedure just trains the model on how to "teach" the internal RNN layers. This isn't normal optimization though. It doesn't require an optimizer since we know the functional form of the gradient wrt. to hidden state so it's very efficient.
@hcp-s4k Před měsícem
Does each pixel in the image have a box? For a pixel, the paper says that the vertex is the center of the pixel, but the video says that the vertex can be outside the box. I don't quite understand.I hope there will be a more detailed description of the training details.
@hi_6546 Před měsícem
Hey, if the frame rate is equal to 50*64, that mean that each second the model need to predict 50*64 values? (without counting the dim)
@hi_6546 Před měsícem
the model works with the index of the codebook no?
@nickackerman8755 Před 20 dny
Where do you get the 64? If we have e.g. 4 codebooks, and there are 50 time steps per one second of audio, we need to predict 4 * 50 values per second (4 codebook indices per time step), I believe.
@lexer_ Před měsícem
There is a strange disconnect between the kinds concepts you explain here. Some of them are like absolute beginner, you have never coded before and don't know anything about machine learning but the majority of the video requires like at least multiple months of crash course if not years of experience in the field to understand. I think your videos could benefit from a more consistent assumption of competence of the audience. Or maybe approach it in sections for difference audience competencies? I really apprechiate these walk-throughs through papers. It makes it so much easier to concentrate for me than only reading on my own so I really want anyone that does this to succeed.
@gabrielmongaras Před měsícem
Thanks for the feedback!! Usually I try to keep most papers I read a little more technical as I'll find myself explaining things like transformers over and over again, which could lead to videos being unnecessarily long (they already feel way too long, been trying to reduce lengths). Some videos, like the stable diffusion one, I try to make a little more beginner friendly assuming knowledge of CNNs, MLPs and basic training. I think I should probably communicate this better and make a clear distinction between the two perhaps in the title somewhere. I think with this video, I wanted it to be somewhere in the middle which led to problems. I like the idea of approaching in sections based on knowledge level. For example, I could've explained REINFORCE a little or point to a resource about it. While those who know it could skip this part, those who don't could watch it or go to the resource. Will think about this some more for future videos!
@lexer_ Před měsícem
@@gabrielmongaras I personally got annoyed at the very beginner level explanations early on which felt like it took forever and I feel lucky that I stuck it out long enough to actually make it to the interesting parts later on. For a while I thought that was really all there would be in this video, explanations of some very basic concepts. But that is of course very different depending on where one might be comming from in terms of familiarity with these topics. I will watch out for future videos of yours for sure.
@gabrielmongaras Před měsícem
@@lexer_ Got it! I didn't realize the intros were annoying 😅 Thanks again for you feedback, it's really helpful! From now, I think I will directly mention if I am explaining a basic concept, saying to skip if one already knows said concept and to provide resources instead of explaining concepts I assume most should know. In this video, I could've probably just skipped the RLHF explanation all together and pointed to one of the many explanations of RLHF. I'm going to experiment with this a little in future videos!
@codylane2104 Před měsícem
@@gabrielmongaras Yeah, pointing to good basic level explanatory material is a great idea! There's a ton of materials on the net, yet only a fraction is really worth studying. 🙂
@SOFTWAREMASTER Před měsícem
Great video. Thank you for covering this! Could you also have a look at 'Terminator' paper and make a video on that? It looks like it's a new architecture.
@gabrielmongaras Před měsícem
Will take a look at it! Not sure if I'll make a video on it yet though. I usually do if the model proposed something really interesting or has really good results!
@nicEDITS_ Před měsícem
Really helpful, thanks!
@hakikatsingh6254 Před měsícem
dayummmmmmmm , pretty awesome dude
@yadav-r Před měsícem
I do not understand it, but can we use it to do stock market forecast, sports forecast. How do I use the tool, is there a tutorial for it? Thank you .
@PhucNguyenXuan-cn8rg Před 2 měsíci
thank you so much for yout clearly explanation!
@danieloh4511 Před 2 měsíci
Great video. BTW, what is the note-taking app in the video?
@gabrielmongaras Před 2 měsíci
Thanks! I'm using the Samsung notes app. Pretty standard, but I like the style and ease of use of the app.
@acasualviewer5861 Před 2 měsíci
If they can convert between Transformers and Mamba2, why not convert GPT2 (which has published weights) to Mamba2 and prove that the converted performs as well as the original? Or do the same with a bigger model such as Llama? Or does their method not allow importing Transformer weights into the equivalent Mamba2?
@gabrielmongaras Před 2 měsíci
GPT2 and models that use the original transformer architecture use softmax attention which cannot be decomposed down into an RNN or SSM due to the nonlinear dependence of the softmax function on the entire row. This forces you to compute an entire row in memory to get the output of that row. However, if you remove that dependence with the kernel trick or a similar method, such as using ReLU on the quereis and keys, then you can decompose the attention mechanism into an RNN/SSM. This is because you no longer have to compute an entire row of the attention matrix to get the output value, rather it's just a sum of value along that row, giving the RNN formulation. So, original softmax transformer models can't be changed into an RNN/SSM, meaning it has to stay quadratic (unless flash attention is used, but this is just a cuda trick). This is why I think we should move away from softmax attention. Changing to an SSM architecture of linear attention is way more efficient and papers such as this show that linear methods are comparable to naive transformers in terms of accuracy.
@acasualviewer5861 Před 2 měsíci
@@gabrielmongaras thank you for responding and for your very valuable videos. You are by far my favorite "Paper explainer" channel. Back to the question: given that the Transformers cannot be converted to Mamba2, I'm not sure its convincing to say that Mamba2 is equivalent to GPT as the paper claims. If it is, then the proof should be in the results. And it should be able to compete with the state of the art. Though I get your point about Softmax. Frankly I find it confusing that Softmax counts as a non-linearity. But in the end, results should speak for themselves.
@gabrielmongaras Před 2 měsíci
Thank you! Glad you're finding my videos helpful! I think it may be better to say that SSMs, GPT, and RNNs are very similar in that changing the nonlinearity in GPT and changing the A block in SSMs result in an equivalent RNN. This RNN formulation is what Mamba 2 is. The softmax "nonlinearity" also confuses me. If you try replacing it with other nonlinearities, you'll find that for some reason it has crazy good properties when used on the attention matrix that other nonlinearities don't have, making it get very good accuracy. It also has the fact that the values sum up to 1 which is very nice for stability reasons. Very happy to see something replace this thing as it feels very naive to just slap a softmax onto a matrix. I think the authors make a convincing argument with their results. Since other papers have shown that using a linear attention block can be comparable to softmax attention in terms of accuracy, Mamba 2 is pretty believable. Also they form a lot of theory and release code making me believe this papers claims more. I guess we'll see how Mamba 2 compares when tested on large scale LLM cases.
@acasualviewer5861 Před 2 měsíci
@@gabrielmongaras Softmax seems to compare values, though layer norm does something similar. Maybe the comparison is what makes it special. With the output of softmax we're always creating a weighted sum of sorts. Like an interpolation of vectors. Relu seems incapable of doing the same.
@elon-69-musk Před 2 měsíci
awesome 👍😎
@Ryu-ix8qs Před 2 měsíci
Thanks for the video. Very helpful to understand how it works.
@gabrielmongaras Před 2 měsíci
Sorry the audio got messed up on this one. Got a new mic and it acts a bit weird with my tablet sometimes. Tried to fix it in post processing, but wasn't able to get the wobbling sound out. Maybe I should just train an audio algorithm that takes my garbage audio in and spits out professional audio via consistency loss, ha!
@noadsensehere9195 Před 2 měsíci
How can I implement this paper
@einsteinsapples2909 Před 2 měsíci
Don't you have the domensions wrong at 2:30? Shouldn't the Q, K, V matrices be Tokens X Channels, with each row representing a single token?
@gabrielmongaras Před 2 měsíci
I usually like to transpose the matrices when drawing the QKV matrices out for attention. I feel like a sequence going left to right rather than up to down is more intuitive, but idk.
@einsteinsapples2909 Před 2 měsíci
@@gabrielmongaras Interesting for me its more intuitive to have a token per line. Probably because I think of it as a Python list. Like a matrix to me, is a list of lists.
@gabrielmongaras Před 2 měsíci
@@einsteinsapples2909 I'll keep this in mind for future videos! Idk why, but seeing a token be vertical is the way I've always drawn it. I guess as long as the shape is written out correctly, it's fine.
@einsteinsapples2909 Před 2 měsíci
@@gabrielmongaras Hi Gabriel, I don't know about your background, if you're a student or not. I personally, have never formally studied any of these topics, so I don't know what the conventions are (for all I know, what you're doing is the norm). I know in the "Attention is All You Need" paper they write the function as QK^t (they transpose the Keys matrix), the same way you write it in your video. The only way you can do QK^t and end with an "S by S" matrix is if the Q and K matrices are S x d (each row represents a token). If you want the Q,K,V matrices to be "d x S" (columns representing tokens) then you should do K^tQ instead to get the attention scores.
@gabrielmongaras Před 2 měsíci
@@einsteinsapples2909 I don't have a degree in ML, mostly just self study. I think the notation that you're suggesting would be correct from a linear algebra perspective meaning I draw the keys right, but need to flip the queries/values. Nonetheless, Q, K, and V are (S, d) regardless of how they're drawn out, or else the attention formula wouldn't work. I guess it's confusing when I draw the tokens as row vectors 😅
@terenceAAA Před 2 měsíci
For the paper says that they get N next tokens by N heads from hidden states h_t at position t, I just don't know the arrow from Head1 to Head2 in the Figure 2. The distribution of tokens by head2 will never be determined by the output from head1. I just cannot get the connection from head1 and head2.
@KevinInPhoenix Před 2 měsíci
Does this technology make Nvidia's tech and NPUs obsolete?
@jonatan01i Před 2 měsíci
The .log on the mask is there to make the attn_logits' to be masked values negative infinity, that's how you make them disappear to 0 in the attn
@gabrielmongaras Před 2 měsíci
ah makes sense, turns a binary mask into a mask of -inf and 0. Usually I just pass a mask of -inf and 0 through the entire model.
@jonatan01i Před 2 měsíci
@@gabrielmongaras btw, thanks for the upload!:)
@jonatan01i Před 2 měsíci
It's not true that they don't have positional encoding "at all", it's very far from true. Every token only gets access to what came before, and that's a lot of info available to rely on.
@gabrielmongaras Před 2 měsíci
It still doesn't have information of position though. Position tells the model where in the sequence a certain token is, differentiating the same token in different positions in the sequence. Without positional encodings, the model operates on sets of information, not sequences.
@jonatan01i Před 2 měsíci
@@gabrielmongaras yes I know what you mean but the elements of it are not discrete but very dependent on their positions (what is there before me) in the sequence. If you switch two tokens’ place in it, the first few tokens that come before the leftmost of the two switched tokens will remain the same, but from that point on in the sequence everything changes.. they have contextual positional encodings built into them we could say that in some sense (given you have the future masking, for vanilla encoder it doesn’t work, that one has no position information at all I agree with that but not with the future masked one)
@jonatan01i Před 2 měsíci
@@gabrielmongaras if we have a transformer model with let's say 5 tokens and without additional positional encodings, and the sole task of it would be to have a random permutation of these 5 tokens given to it two times (same permutation twice), essentially it's task is to copy the first five token one after the other in the same order.. if you believe a transformer (with future mask!) has no positional information at all, then you realise that you are suggesting that this task is impossible for the transformer to learn. Do you agree with this or are we talking about two different things?
@gabrielmongaras Před 2 měsíci
oh yea I think I see what you're saying. Without PEs, the transformer cannot distinguish position of words, however it can still get a sense of the position of the word it is currently generating based on the context of the tokens. I suppose it may be easier to take a look at this through the idea of sets. A normal transformer with PEs is essentially operating on an ordered set where each element is unique due to the ordering. Without PEs, the transformer operates on an unordered set of data, but it can still get an idea of the size of the set and maybe also an idea of the "count" of the number of duplicated items in the set.
@jonatan01i Před 2 měsíci
@@gabrielmongaras and also an idea of [2,3,(1,5,4)][2,3, and now what comes next? How would it know if it was (1,4,5),(1,5,4),(4,1,5),…or(5,4,1)? I get it that the addition operation is commutative and all, but the model will eventually rely on context of previous tokens to get to come up with a useful (relative)positional information for itself because it is possible to do given there is the future mask given to it (if no mask, then the transformer yes becomes commutative truly, you can change the order of the tokens and it won’t make any difference, (given you don’t use convolutions with kernels bigger than 1 and positional embeddings neither)
@lienlrac7644 Před 2 měsíci
When Gabriel drops I stop everything and listen
@marinepower Před 2 měsíci
I wonder if this method could be improved by having a new projection matrix of size [hidden_dim x 1] that computes the width of each token, we take the sigmoid, the cumulative sum, we do the interpolation as described, but we add it to the queries and keys, then do normal attention. This requires a new ((small)) matrix but would allow us to use flash attention directly without needing a new cope kernel.
@gabrielmongaras Před 2 měsíci
This makes perfect sense to work just as well as the CoPE method to me! The only drawback I see is that CoPE recalculates all positions for each token while positions in this case would not change for past tokens. However, I'm guessing this recomputation is unneeded and that you will get similar performance as CoPE, but it's a lot more efficient computationally. Are you thinking of implementing this?
@marinepower Před 2 měsíci
@@gabrielmongaras All positions here would be recalculated every layer, so they would affect past tokens. For example, the width for token 1 might be computed as 0.8 in layer 1 and 0.5 for layer 2, etc. Since we're doing the cumulative sum all positions are updated, since every token relies on token 1 (but this applies to every token in the series). As far as actually implementing this, however, I ran into a small issue. What this method uses (floor, ceil) is non-differentiable. Instead, the embeddings would need to be calculated directly from the cumulative embedding position, which I think is fairly expensive since these embeddings use a lot of sines and cosines, etc. It seems possible, it's just that we might want to simplify the embedding function a bit.
@Anonn724 Před 2 měsíci
Sounds interesting. Definitely will give it a try to implement this. I quickly implemented some CUDA as Gabriel mentioned if sb is interested gh juvi21/CoPE-cuda
@gabrielmongaras Před 2 měsíci
@@marinepower That makes sense, though at the same layer, the positions are calculated once, in CoPE, they're recalculated for each token, but I don't think this difference will be very beneficial, meaning this method can probably work just as well! I think a detach can be used on the floor/ceil, the weight on the other hand will be the differentiable part: For a given token, we have a positional value of say p which is the sum of all positional value up to that token. For the positional embedding for a specific token, we would have: e = (1 - (p - floor(p)))*PE[floor(p)] + (p - floor(p))*PE[ceil(p)]) Where PE[i] is the ith absolute positional encoding. In this case, floor(p) and ceil(p) are not differentiable, but p is, so there's still gradient flow So I think this idea should still work!
@marinepower Před 2 měsíci
@@gabrielmongaras Hmm, yeah, I think that works! I am not really training any transformers right now that need variable positional encodings so I probably won't test this method, but the next time I look into llms I'll try this out! (Although... training LLMs from scratch requires a tremendous amount of compute, so I moreso hope someone else notices this comment chain and tries it out lol)
@DataScienceGuy Před 2 měsíci
great job! Very clear explanation indeed. How do you think, why control net approach works worse with sdxl model?
@darshanparsana7890 Před 2 měsíci
there is a correction... Q is coming from decoder while K and V is coming from the Encoder. at 51:36 timestamp. ref: attention is all you need paper -> Section 3.2.3 -> first bullet point.
@bindiberry6280 Před 2 měsíci
Anybody used this to do Cantonese ?!!
@gabrielmongaras Před 2 měsíci
I think they trained it only in English, but there's probably a large enough dataset to create one on Cantonese
@ahamuffin4747 Před 2 měsíci
Hello, thanks for the explanations! just a few words on the greek letters. its "psi" not "phi" here, and the "rho"_theta you mention is actually a "tau"
@objio Před 3 měsíci
Summary: while current RNs work as a universal approximator (probability machines), Kolmogorov networks allow you to leverage a more deterministic function. End point: the axiom of the universal approximator as a pillar of current networks could be replaced by networks of variable geometry. Think about the difference between searching or decoding information in 1 dimension, and now adding 2 dimensions. A data point vs a line of data. Like go from 480p to 8KHD resolution.
@SadmanAhmedShanto Před 3 měsíci
Keep up the good work, man!
@eddiehead9794 Před 3 měsíci
Wow! The extractability of solutions really catches my eye.
@mateuszk3210 Před 3 měsíci
Interesting paper, I am having similar problem while training on 300 long sequences I need to extend to 1000 and I am using RoPE. Do you know if this interpolation can be used with RoPE, or should I look into something like ALiBI ? I recall I was reading ALiBi also has some issues and accuracy is worse. There is also LongRoPE.
@-slt Před 3 měsíci
constant movement of the screen makes my (and sure many others) head to explode. please move a little less. zoom in and out less. it helps the viewer to focus on the text and your explanation. thanks. :)
@gabrielmongaras Před 3 měsíci
Thanks for the feedback! Will keep this in mind next time I'm recording
@acasualviewer5861 Před 3 měsíci
Is it slow to train like LSTMs and RNNs are? A major benefit from Transformers is faster parallelized training. I would assume xLSTMs would be constrained by their sequential nature.
@gabrielmongaras Před 3 měsíci
Yep, should still be slow to train. I don't see any way to make one of the cells into something parallel like a transformer since the cells are so complicated.

Gabriel Mongaras

Komentáře