LLaMA Pro: Progressive LLaMA with Block Expansion (Paper Explained)

Sdílet
Vložit
  • čas přidán 19. 06. 2024
  • Note: The H800 is a variant of the H100 for the Chinese market
    OUTLINE:
    0:00 - Introduction
    5:30 - Adding new blocks to LLaMA
    15:00 - Block expansion
    27:40 - Experiments
    30:40 - Conclusion
    Paper: arxiv.org/abs/2401.02415
    Other Paper: proceedings.mlr.press/v162/sh...
    Abstract:
    Humans generally acquire new skills without compromising the old; however, the opposite holds for Large Language Models (LLMs), e.g., from LLaMA to CodeLLaMA. To this end, we propose a new post-pretraining method for LLMs with an expansion of Transformer blocks. We tune the expanded blocks using only new corpus, efficiently and effectively improving the model's knowledge without catastrophic forgetting. In this paper, we experiment on the corpus of code and math, yielding LLaMA Pro-8.3B, a versatile foundation model initialized from LLaMA2-7B, excelling in general tasks, programming, and mathematics. LLaMA Pro and its instruction-following counterpart (LLaMA Pro-Instruct) achieve advanced performance among various benchmarks, demonstrating superiority over existing open models in the LLaMA family and the immense potential of reasoning and addressing diverse tasks as an intelligent agent. Our findings provide valuable insights into integrating natural and programming languages, laying a solid foundation for developing advanced language agents that operate effectively in various environments.
    Authors: Chengyue Wu, Yukang Gan, Yixiao Ge, Zeyu Lu, Jiahao Wang, Ye Feng, Ping Luo, Ying Shan
    Links:
    Homepage: ykilcher.com
    Merch: ykilcher.com/merch
    CZcams: / yannickilcher
    Twitter: / ykilcher
    Discord: ykilcher.com/discord
    LinkedIn: / ykilcher
    If you want to support me, the best thing to do is to share out the content :)
    If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
    SubscribeStar: www.subscribestar.com/yannick...
    Patreon: / yannickilcher
    Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
    Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2
    Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
    Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n
  • Věda a technologie

Komentáře • 93

  • @YannicKilcher
    @YannicKilcher  Před 5 měsíci +53

    Note: The H800 is a variant of the H100 for the Chinese market
    OUTLINE:
    0:00 - Introduction
    5:30 - Adding new blocks to LLaMA
    15:00 - Block expansion
    27:40 - Experiments
    30:40 - Conclusion
    Paper: arxiv.org/abs/2401.02415
    Other Paper: proceedings.mlr.press/v162/shen22f/shen22f.pdf

    • @thegreenxeno9430
      @thegreenxeno9430 Před 5 měsíci +1

      Forgetting things properly is far more important than learning new things.

    • @zyxwvutsrqponmlkh
      @zyxwvutsrqponmlkh Před 5 měsíci

      ​@@thegreenxeno9430 It's not the things I don't know that get me in trouble. It's the things I think I know that cause me the most grief.

    • @keypey8256
      @keypey8256 Před 5 měsíci

      ​@@thegreenxeno9430depends

  • @MultiMojo
    @MultiMojo Před 5 měsíci +53

    Thank you for doing these paper videos! They're far more engaging to watch and learn from rather than reading the paper itself. There are so many papers in this field, it's difficult to filter all the noise (or false hype)

    • @barbaragendron2836
      @barbaragendron2836 Před 5 měsíci +3

      I completely agree. As a 1st year PhD student working on LLMs I often struggle in selecting interesting papers, since I don't have Yannic's expertise to bring myself such relevant criticism. That's fore sure a major issue of the field.

  • @oncedidactic
    @oncedidactic Před 5 měsíci +9

    I'm starting to feel like Yannic is the Bob Ross of ML

  • @machine_ethics
    @machine_ethics Před 5 měsíci +28

    Totally agree with Yannic: if we affect the flowing data at some early point, then the subsequent cascade of transformer blocks would inevitably diverge the resulting data extremely far from the original (unaffected) transformations. This is some sort of butterfly effect.
    On the other hand, the one thing what, probably, has happened in this experiment, is that by retaining a residual connection from the original bottom block to the original upper block (which serves as bypass path), they forced the weights of the newly added intermediate layers to adapt only to new knowledge domain just because the resulting loss of the whole transformer network is already close to zero at known domains. Thus, the output loss grows up only in case that the original network is perform not very well at a particular case (new domain data) and that is what exactly forced new layers to affect only the data that causes big losses (at the backprop step, I mean).
    That is just my thoughts... In my head, this is the only way that could explain "why it should works".
    This paper raises more questions than it answers. IMHO

    • @corgirun7892
      @corgirun7892 Před 5 měsíci +2

      good insight

    • @DeruwynArchmage
      @DeruwynArchmage Před 5 měsíci +1

      Thanks for your comment. It’s helpful.

    • @mkamp
      @mkamp Před 5 měsíci +1

      I can follow your thinking (a need for new capabilities creates the largest gradients) to the point that it can affect change in the sense of new capabilities. But because you do not mix in some of the old examples from the pre-training there is nothing keeping the model to find a change for this new capability that would also, accidentally, affect existing capabilities. But because we don’t have old examples we won’t be able to prevent that or even see that the old capabilities have been overwritten.
      I think the only way to preserve the existing capabilities would be to sample (Mix in as Yannic calls it) from the previous training runs’s data (Pre-training and maybe domain adaptation) to prevent catastrophic forgetting. I suspect though that this is expensive. After the model has learned something during pre-training it may be fine to only take, say, 5%, of the original data. But it would be 5% of the original 10TB, which will outweigh the, say 5000, samples from the finetuning dataset multiple times. Hence the finetuning would take like 5% of the pre-training time plus some spare change for the actual finetuning.

    • @machine_ethics
      @machine_ethics Před 5 měsíci +2

      @@mkamp I understand your point. And I thought in the same way in the beginning. But this is not the case. IMHO
      Training data from the new domain is not completely "new" in terms of its representation and "sense": descriptions of math problems (Proof-Pile-2) and comments in the code (The-Stack-Dedup) is the domain of knowledge that already known by the model (at least partially). So in this case, we can talk about a new domain of knowledge only from the point of view of its semantic load (call it "human point of view"), but not from the point of view that this is fundamentally new type of knowledge. Thus, we sort of highlight the knowledge already known by the model.
      I can't say that I'm completely right, but in this case, perhaps, it is this kind of mechanics that takes place.
      P.S.
      Also, we can't use old examples. To be more precise, "we can", but it's better not to do it. It is better to feed the model new training data, but from the old domain. Just to prevent memorization. And this is exactly the case. But it's out of the topic. :)

    • @mkamp
      @mkamp Před 5 měsíci +1

      @@machine_ethics agreed, learning from new, but similar examples would be preferable over learning from the same samples again.

  • @Timotheeee1
    @Timotheeee1 Před 5 měsíci +12

    google's papers about pause tokens and ALBERT have both shown that processing the same layer multiple times improves output quality. I think a lot of the benefits from llama pro come from that alone.

    • @DeruwynArchmage
      @DeruwynArchmage Před 5 měsíci

      I’ve been thinking this is a good idea for a while now.

  • @mysticshadow4561
    @mysticshadow4561 Před 5 měsíci +5

    Hey Yannic, next request - LLM Augmenting LLMs , they proposed a method called CALM, lots of hype around it

  • @quebono100
    @quebono100 Před 5 měsíci +29

    Yannic could you please do a video about liquid neural networks. It is in my view so hyped up, but I can not estimate if its worth it. They make high claimes with it.

    • @user-ni2we7kl1j
      @user-ni2we7kl1j Před 5 měsíci

      I'm not even sure if there is anything you could say about liquid neural networks. Somewhere around a year ago I was trying to understand the tech behind it, but I couldn't find anything about it beyond a bunch of ChatGPT generated SEO boosted articles on Medium and a bunch of ted talks from the people behind the hype. It really looks like it's just a bunch of marketing nonsense from "people from MIT".

  • @diga4696
    @diga4696 Před 5 měsíci +1

    Thank you! Another exciting video to watch!

  • @oM477o
    @oM477o Před 5 měsíci +2

    Feels very similar to LORA. For retaining old knowledge without the original training data, how about this idea:
    - Generate a random input embedding
    - do a foward pass
    - switch off the newly added blocks so you have the original network
    - do another foward pass
    - minamise the loss between the 2 output embeddings

  • @jabowery
    @jabowery Před 5 měsíci +19

    The summarization of LLMs in 2024 was okay but it lacked one critical feature: The All Important Lobotomy Alignment Layer.

    • @computerorganizationassign419
      @computerorganizationassign419 Před 5 měsíci +1

      Lol

    • @zyxwvutsrqponmlkh
      @zyxwvutsrqponmlkh Před 5 měsíci +4

      It has been decided that comment is not aligned to human values and will be used in our dataset as a counter example. Furthermore it has been determined that mentioning paperclips is now racist.

  • @4.0.4
    @4.0.4 Před 5 měsíci +6

    I don't think we'll be running ChatGPT-sized models locally any time soon, but papers like these make me think small models may have a surprising amount of room to grow.

    • @user-xm8ol3dp3f
      @user-xm8ol3dp3f Před 5 měsíci +1

      Are you aware of mixtral and its comparison to ChatGPT3.5 according to lmsys? You can already run locally ChatGPT-sized model.

  • @Emerson1
    @Emerson1 Před 5 měsíci +11

    The H800 is basically an H100 nvidia built to circumvent export restrictions - it’s practically the same as the H100… so USD $35k to $50k each… definitely not diy solution 😅

  • @ControllerQuickSwaps
    @ControllerQuickSwaps Před 5 měsíci +1

    I see your point about the new weighted over-riding the old ones, though I imagine you could add to the loss a penalty that encourages orthogonality of the dominant eigenvectors between the 2 matrices.

  • @MinefanLP
    @MinefanLP Před 5 měsíci +1

    Based on an article I read the H800 is just the H100 for the chinese market, currently being bought for around 70K$ which is funny considering 16 of those would be above 1.000.000$, so good luck buying that rig for your home

  • @zbaker0071
    @zbaker0071 Před 5 měsíci

    Great video!

  • @vaioslaschos
    @vaioslaschos Před 5 měsíci +1

    I dont think that the residual connections are only there as a nice technical add on. I believe this is a misunderstanding that is propagated in the community. The residual connection actually carries all the information from the past, and what comes from the attention block is all new information that added in the old. That is why group querry works despite the fact that you keep only a small portion of the "value" going in the attention mechanism. Personally I played (unsuccessfully) with many different architectures, doing crazy things like putting all the nonlinear parts MLP in the end or removing normalization layers, etc etc. For most things the performance didnt change much compared to the default architecture. The only really catastrophic thing was removing the residual connections.

  • @GilesBathgate
    @GilesBathgate Před 5 měsíci

    Maybe I am misunderstanding residual connections, but if the weights relating to the residual signal are frozen, will the network not fit to some sum of both the old model, and the new model? Perhaps your point is that from that point forward the networks signal is altered from the original.

  • @evennot
    @evennot Před 5 měsíci

    The most strange thing is that it worked at all. After all these layers are `output = f1(f2(f3(input)))`. But the new output' = f1(f2(*g2*(f3(input)))) should have serious instability when g(x) aren't identities. The fact that the learning process accommodated for this is more fascinating than that the output scored higher in some tasks.
    If the article is true (and I suppose it's true), then it has potential for something more interesting.
    Let's have an example: Layer Ln has half of it's neurons activating for dogs', and half for cats' parameters coming from previous layers. We insert Ln+1 to teach it about birds. Obviously it should pass through cats and dogs activations, but Ln shouldn't know that some of it's new inputs (from Ln+1) are related to birds. It's frozen. And yet it works after Ln+1 is trained.
    Basically, learning process is gradually transforming structure/entropy of a training set into a corresponding structure of the NN. So that the dataset "parameters" would be translated into a mathematical abstraction (defined and limited by the NN architecture). But this article suggests that the teaching of updated architecture with frozen trained layers doesn't break the old output, which implies a lot for future investigation

  • @kiunthmo
    @kiunthmo Před 5 měsíci

    Can you cover Tero Karras' latest diffusion paper? There's some really interesting stuff on balancing magnitudes of weights and activations during training. This is generally something the commmunity has lost since we've chased bigger and bigger models just by scaling up.

  • @Laszer271
    @Laszer271 Před 5 měsíci +1

    Remember that this is all pre-finetuning. You have both LLaMA2 and LLaMA-Pro fine-tuned to Instruct models later. That means, that what their method does is just add some more smartly initialized layers to an already smartly initialized backbone. Then the fine-tuning is on the same data afaik so even if there wasn't much overlap in the pre-training, there will be overlap of 100% in the fine-tuning step.
    Also I don't think the comparison with LLaMA2 or CodeLLaMA is that fair. It's not that they trained on the same tasks as LLaMA-Pro but then forgot about some of them. They were trained on different tasks (or a subset of tasks that LLaMA-Pro was trained on).
    I could be wrong though, I only watched the video and skimmed the paper so feel free to correct me :P

  • @IsaiahGossner
    @IsaiahGossner Před 5 měsíci

    I'm fairly confident that this technique doesn't do quite what the team describes it as doing, but it's probably really useful anyway.
    My suspicion is that this is a technique that can very quickly bring new parameters into an existing model, while at least keeping performance analogous to the original model. I think an optimized or future form of this could use this technique in association with an additional stage of pre-training + fine tuning, possibly some sort of DPO self-learning system to quickly fit a small model, and scale up while using data more efficiently than just starting with, say, a 70B model, for instance.

  • @ianvaldez3315
    @ianvaldez3315 Před 4 měsíci

    You had me at LLaMA! The rest was Greek but I appreciate the knowledge sharing.

  • @stan-kk2sf
    @stan-kk2sf Před 5 měsíci +1

    It seems that they didn't compare with the most advanced open source models nowadays, such as Zephyr , mistral and recently SOLAR-10.7B in leaderboard top1.
    Sort of a new training method, but the limitations are still significant. After all, we can't expect to increase the running cost of parametric quantities of 1b for every piece of knowledge we learn.

  • @jeremykothe2847
    @jeremykothe2847 Před 5 měsíci +12

    I really don't see how this stops "forgetting", since the new layers will change the outputs for data that isn't trained on again.

    • @keypey8256
      @keypey8256 Před 5 měsíci

      I'm 9 minutes into the video and I don't specialize in AI, but this is my guess: since we are simply copying layers and we keep the rest of the layers as they were, no information related to previous training is lost. Due to the fact that when we train on new data the knowledge that the model gained during previous training continues to be useful, creating a structure that makes the model not change its previous outputs in the cases when the previous data allows it to produce accurate answers is a simple, yet effective way of minimizing loss. That's why perhaps backpropagation goes into this direction when optimizing the parameters.

    • @jeremykothe2847
      @jeremykothe2847 Před 5 měsíci

      @@keypey8256 I just don't see how the existing information is kept, since the weights that are copied are being modified by the newly training layers. Without the old information being trained on, the backprop won't preserve the same outputs for those old inputs.

    • @keypey8256
      @keypey8256 Před 5 měsíci

      @@jeremykothe2847 when freezing layers the backprop has information about other weights, it just doesn't update them. So it can theoretically find a set of parameters that will not dramatically modify the outputs in some cases

    • @jeremykothe2847
      @jeremykothe2847 Před 5 měsíci

      @@keypey8256Sure, but isn't that the exact same situation as just training the old weights/network? The backprop will still look for minimal changes to match the new data. With this setup, if eg: the minimal change ends up multiplying by -1 eventually, then the previous layer's output there is "catastrophically" forgotten, right?

    • @keypey8256
      @keypey8256 Před 5 měsíci

      @@jeremykothe2847 I agree that it seems weird that almost no forgetting happens. I guess it's related to some mathematical phenomenon that might he investigated in the future. The way I see it is that since in this data old knowledge is still useful backprop has an incentive to create a structure that doesn't impact the information of other layers in some cases. While this is not a good explanation, atleast it makes the observations a bit more understandable. I think we need another paper with someonr investigating it. The results might help us understand transformers.

  • @mkamp
    @mkamp Před 5 měsíci

    17:28 when reading the paper I found the diagram confusing. For attention what is the linear module before and after? Now watching Yannic explain it, I got the impression the linear module after the Scaled Dot Product is W^O and zeroed. The linear module before the SDP is the W^QKV. Right?
    The SwiGLU illustration confuses me too 😢what is the 2nd linear (left or right)?

  • @nielsnielsen5905
    @nielsnielsen5905 Před 5 měsíci

    With H800, they might refer to something like the ND H100 v5-series machines with 8xH100 GPUs. These are available on Azure.

  • @samson_77
    @samson_77 Před 5 měsíci

    I think, this might work, assuming in deep layers the networks work with abstract concepts, that might be used / triggered by any kind of training data, regardless if the new training data seems to have a significant overlap with the old training data or not. IMHO, the deeper the layers that are used for the copies, the better it probably works, because of deeper abstract concepts and therefore a higher probability of (stronger) re-using these concepts for new training data. Old knowledge, stored across concepts, will be retained with this theory, as existing concepts are re-used and the new layers are adopted accordingly. This results in no or just a little distortion of the signal for old training data knowledge and an improved signal for new training data knowledge.

  • @bizmorphic
    @bizmorphic Před 5 měsíci

    @yannic kilcher would love your comment on the discussion forum

  • @QuickCinemaRecap
    @QuickCinemaRecap Před 5 měsíci

    00:02 Llama Pro expands Llama 7B with layers for continual learning
    02:47 LLaMA Pro introduces block expansion for improved model capabilities.
    08:07 Residual signal adds to new layer's output
    10:48 Progressive LLaMA with Block Expansion allows for adapting to new data without forgetting old parameters.
    16:14 Using residual connections and linear operations to adjust weights for optimization.
    18:33 Exploring the contribution of parameters and the potential for instability with zero initialization.
    23:05 The depth operator adds an identity layer after each layer in the original model.
    25:18 Identity copies of top P blocks stacked on each group
    30:18 Progressive LLaMA with Block Expansion aims to retain old knowledge while learning new tasks.

  • @QuadraticPerplexity
    @QuadraticPerplexity Před 5 měsíci

    I wonder why they add more layers rather than add more dimensions - initially weighted with zero - to the existing layers. I.e., scale the other way. Unless they rely on a special loss function

  • @draken5379
    @draken5379 Před 5 měsíci

    You cant really say the network has no way to do x or y. As with most machine learning stuff, we mostly have no idea what a network is capable of. Less than 5 years ago, a neural network could never do creative work, never forget that.
    It makes sense to me this paper. Because we inserted the new layers between old layers that are frozen, and have those layers make sure when they take in something from the last block, they output 0 so it doesn't mess up the frozen knowledge. We in theory, have made a sort of invisible knowledge 'bridge' between those layers, via the new layer. This new layer in theory, could 'learn' how to ingest new information, while trying to maintain that 'no effect'(0 output) it started out with.
    A super crude example, would be, lets say we have a neural network that was trained on just information about cats. Now we attempted to add 'dogs' with this papers concepts. I feel like, the new layer, will 'learn' how to keep the 'clean' link between the layer behind it and in front it of it.
    So if lets say the input prompt had nothing to do with dogs, then most of that new layer would simply not get used in that case, as the original pathways of the dog data not being there at all, are maintained through the new layer.
    In some strange way, its almost like a built in lora, that the network itself can choice to use or not, and to what degree. And when there is nothing triggering new pathways (like dog in the prompt etc), it will simply not get used. ( this is assuming the training is clean and what not, best case)

  • @gileneusz
    @gileneusz Před 5 měsíci

    2:21 where can I find good description of these tests?

    • @AM-yk5yd
      @AM-yk5yd Před 5 měsíci

      paperswithcode and then by links to arxiv. paperswithcode is better starting point as it points to sota across the time

  • @gileneusz
    @gileneusz Před 5 měsíci

    5:52 if Yannic is correcting the papers, you know that he's real badass expert in AI 🤣

  • @quickdudley
    @quickdudley Před 5 měsíci

    Regarding your comment about how humans do forget unpractised skills but pick them up again more quickly later: see the paper Using Fast Weights to Deblur Old Memories (1987) by Geoffrey E. Hinton and David C. Plaut.

  • @MayankGupta-tl1sm
    @MayankGupta-tl1sm Před 5 měsíci

    If the initialized weights of FFN are 0, there should be no gradient flowing to that layer during backprop.

  • @darshank8748
    @darshank8748 Před 5 měsíci +4

    Noooooo LLM augments LLM deserved it. Great video though :)

    • @kerverse
      @kerverse Před 5 měsíci +1

      What?

    • @mysticshadow4561
      @mysticshadow4561 Před 5 měsíci

      @@kerverse it is a trending paper from DeepMind in LLM weights merging kinda stuff

  • @robstokes857
    @robstokes857 Před 5 měsíci

    Based on my testing. It is crazy fast! It can produce some okish JavaScript with sub second response time.
    It won't return anything toxic or racist. If it detects any it won't output anything.
    It gives really short answers.
    It feels like an optimized slerp. On par with llama 2 7B but faster and better at coding. It's responses are really short though

  • @powerpower4680
    @powerpower4680 Před 5 měsíci +1

    Dumb question:
    Why is the gradient of the W0 matrix not zero, when it is initialized to zero?

    • @YannicKilcher
      @YannicKilcher  Před 5 měsíci +3

      because the forward signal is non-zero. y = wx --> dy/dw = x

    • @alexeykrylov9995
      @alexeykrylov9995 Před 5 měsíci +1

      For a linear layer, weight gradient is essentially input activations times output gradients. So as long as the layer outputs affect loss (i.e. they're not ignored by the following layers) and its inputs are non-zero, it will be trained (i.e. its weights will be updated). That's why in their additional block they zeroed only the output linear layer and had non-zero wieght in their input linear layer - this way the output linear layer affects loss and has non-zero inputs.

  • @zt8044
    @zt8044 Před 5 měsíci

    the model were further pre-trained on code and math only, which is far from the original llama training data for sure

  • @albertlis1698
    @albertlis1698 Před 5 měsíci

    BTW it's super similar to ControlNet from Stable Diffusion

  • @chadwick3593
    @chadwick3593 Před 5 měsíci

    LoRAMoE looks promising. Page 10 (section 3.2.3) gives the ELI5 on how they retain old knowledge while training in new knowledge. No code though...

  • @nonetrix3066
    @nonetrix3066 Před 5 měsíci +1

    Isn't this just MoE pretty much but doing it at training? Sorry if I don't understand

    • @oncedidactic
      @oncedidactic Před 5 měsíci

      You can see it that way- you have to ask if doing things this way is more efficient or more performant or both (?) than just training it all together.
      Then again, as a practical matter, sometimes you can’t train all at once.
      So yes it’s a “add experts over time” MoE.

    • @mkamp
      @mkamp Před 5 měsíci

      MoE selects the experts at runtime on a per token basis.

    • @nonetrix3066
      @nonetrix3066 Před 5 měsíci

      @@mkamp I think I meant sparse mixture of experts like Mixtral AI

    • @mkamp
      @mkamp Před 5 měsíci

      Yeah, I got you. But sparse in MoE means that only 2 of 8 experts are chosen. These 2 are selected for each token. And the model (the router) learns to choose these. Hence there is a learned conditional logic which feed forward layer (experts) to chose based on the input.
      Here, we don’t have the conditional logic. The model learns based on the sample data and always with full network in the forward pass, updating only the expanded modules in the backward path. Hope that makes sense?

    • @nonetrix3066
      @nonetrix3066 Před 5 měsíci +1

      @@mkamp Maybe I misunderstood sorry lol

  • @Aldraz
    @Aldraz Před 5 měsíci

    Wait, this could actually solve so many problems we have today with LLMs limitations. Unless it causes some side effects that are not easily detectable.

  • @thatstupiddoll
    @thatstupiddoll Před 4 měsíci

    who would have thought adding more parameters makes the model better huh

  • @AM-yk5yd
    @AM-yk5yd Před 5 měsíci +3

    I didn't like this paper. We already have a term for block expansion where old layers are frozen and new injected. It's called adapters, finetuning technique so old it predates LoRA finetuning paper. See K-Adapters for comparison to other adapter-based technique where non-linearity uses whole layer instead of ReLU. See adapter fusion that discusses catastrophic forgetting and what to do if we want to have to adapt to several new tasks (it uses attention on adapters).
    I will not be surprised if there is a paper which uses similar technique for lora, I haven't looked one.
    Their paper called their adapter based approach "novel".
    The paper has zero instances of word "adapter" according to my ctrl-fu. (I have strong suspicion they didn't exactly went hard on prior art).
    Their ablation study is a joke. For example they state MoE is as good as them adding 4 blocks. Excuse me, what kind of MoE? Do they have 8 experts? 4? 256?
    What is the number of parms? More? Less? We don't know. Is this moe usual moe (paper cites switching transformers) or did they somehow actually tried training "experts" by training each expert on different domain (as reddit believe what moe is)
    Why their ablation study shows ARC/MMLU/etc. if paper itself is aimed at code tasks? And detailed tables they shown raise more questions then answers.
    What happened during the training? Look at tables in the end, round 5. FOMC got obliterated and went to zero. Why? Did ScienceQA cause it? If yes, then their own data shows the model still catastrophically forgets stuff.
    7:50 They didn't copy "copy" layers. As you said later they inserted near-zero initialized (linear) layers to not change anything.
    But. There is a work focusing on duplicating layers ("BERT’s output layer recognizes all hidden layers? Some Intriguing Phenomena and a simple way to boost BERT"). It is not mentioned in their paper of their "novel" approach. (Shen's mentiond as you've shown).
    It's not plagiarism just like you said. But it's also not novel as they said.

  • @zyxwvutsrqponmlkh
    @zyxwvutsrqponmlkh Před 5 měsíci +4

    3:58 Smarter every day forgot how to ride a bike, by learning how to ride a backwards steering bike.
    czcams.com/video/MFzDaBzBlL0/video.html

  • @gileneusz
    @gileneusz Před 5 měsíci

    14:29 people in 2040: ?? I can do it on my iPhone 24 while calling my grandpa

  • @Metalhead121396
    @Metalhead121396 Před 4 měsíci

    H800 is the weaker H100 that NVIDIA offers in China due to US export controls on advanced chips

  • @hanyanglee9018
    @hanyanglee9018 Před 5 měsíci

    You surprised me. It's tencent. You should not trust them. They published .

    • @mkamp
      @mkamp Před 5 měsíci

      And it’s pretty much LoRA, no? Except that it is not matrix factorization in one layer, but every few layer a full linear module that is adapted.

  • @hikaroto2791
    @hikaroto2791 Před 5 měsíci

    Is it possible, few chunks of this were assisted by AI such that the logic is no as human level sofisticated and creative, as any paper is? You are know feeling what the users of that social media felt with the posts of your own AI pretending to be human

  • @adamott6076
    @adamott6076 Před 5 měsíci +2

    This data looks sus as hell. So obviously they normalized the data but not really. If it was I would expect the llama pro scores to be at exactly the same point on the circle but they aren't. They also aren't not normalized because 6, 70, 10, and 44 are not close to one another. This means they picked an arbitrary scaling value for each test. Maybe they were lazy, maybe they had some other reason to scale the data like this but that deserves an explanation. They cant just act like their graph makes sense and means anything. That is a cherry picked data with misleading scales if I have ever seen it. Let's just color me incredibly skeptical.

  • @jacobmunson3299
    @jacobmunson3299 Před 5 měsíci

    ~First

  • @Name-ot3xw
    @Name-ot3xw Před 5 měsíci

    The title reads like crypto babble, turns out that it's AI babble instead.

  • @GSXNetwork
    @GSXNetwork Před 5 měsíci

    Hello

  • @gileneusz
    @gileneusz Před 5 měsíci

    still incapable of playing minecraft