Are ChatBots their own death? | Training on Generated Data Makes Models Forget - Paper explained

Sdílet
Vložit
  • čas přidán 13. 09. 2024

Komentáře • 39

  • @Jamaleum
    @Jamaleum Před rokem +18

    My guess is that OpenAI themselves fear that recursion will lead to bad models and that this is the prime motivator for them to work on and implement watermarking. They try to disguise it as a regulation to protect the users and readers, but I bet it is to filter their future training data.

    • @oncedidactic
      @oncedidactic Před rokem +2

      If so, do you expect that watermarking will become widespread, even for FOS models? Otherwise, there will still be reams of unidentifiable LLM generated text online, leaving out the watermarked chatGPT artifacts.

    • @Jamaleum
      @Jamaleum Před rokem +1

      @@oncedidactic Hmmm, very good point, thank you.
      Gotta think about it

  • @volotat
    @volotat Před rokem +15

    I think this is great research, but it shows not a flaw of AI generated content as a thing, but a flaw in current models and the toy example shows it the best. We clearly see that data distributions shift and degrade after multiple retraining on the synthetic data. What we should do is to use it as a benchmark and when new architecture is proposed, beyond all other standard benchmarks, show how robust it is on the retraining and how many iterations it takes to destroy the original distribution. If this number is big enough it might indicate that the proposed architecture is worth using even if it is not the best in other tasks.

  • @alexamand2312
    @alexamand2312 Před rokem +4

    i think the paper is not adressing the issue at all, it's only demonstrating some sort of overfitting.
    I believe training a model recursively is bad since there is no new data but if the model can generate new better data like using a tool, or a chain of thoughtcor something else then it can be a positive recursive loop instead of a negative one.

  • @miriamramstudio3982
    @miriamramstudio3982 Před rokem +2

    Very interesting indeed. Thanks

  • @rockapedra1130
    @rockapedra1130 Před 11 měsíci +2

    Well, we're going to be very careful about selecting training data moving forward. Looks like a big problem!

  • @irisdominguez3996
    @irisdominguez3996 Před rokem +4

    Nice video, it's a good overview on the paper and the discussion.
    a) Yup, I'm worried, not that much for the models but also for the humans "trained" (learning) on this garbage content...
    b) The paper is not at all exagerated. It doesn't matter at all if 90% of the content is AI generated, but wether 90% of the content *available for training* is. And with companies suing each other and most content behind walls (social networks are evermore closed), we are getting there fast. I guess we'll have to keep models frozen on old data...

  • @theosalmon
    @theosalmon Před rokem +3

    Thank you so much for an important and understandable overview. I appreciate your helping us to go into the future with our eyes open.

  • @gordoneldest8462
    @gordoneldest8462 Před měsícem +1

    By my estimate the number of human authors will not substantially decrease
    1 authors wish to publish
    2 the humanity will self feed back by subclassing stupid or obvious or wrong answers
    We are at the beginning of a cycle.
    The key part will not only be to tag AI generated contents , which seems more a wish than something feasible for money reasons
    But to tag human made content
    A way to do this is by rewarding human authors, like the Brave browser model
    Maybe a dream but when LLM vendors will lack good quality data they may enter into this approach, the first one to do this at scale will become the leader because the outcome will be considered as a better source of conversation, hence a wider audience

  • @brad6742
    @brad6742 Před rokem +2

    I suspect this is a specific issue with the auto-encoders they used. What we need is a good validator and a diversity metric, this will prevent mode collapse. My hypothesis is that the higher dimensional the generated content is, the easier it will be to validate. Ilya Sutskever made a recent comment claiming that knowing the distribution is enough to give you exact samples as sample dimensions increases. (Dimensionality in this case, I guess, could be sentence/document length or picture resolution). Think of it as a discount model for test grading - the more criteria you have to take points off for, the lower the average score will be.

  • @jolieriskin4446
    @jolieriskin4446 Před rokem +4

    I don't think having ML models learn from each other inherently leads to degradation. But, that hinges on the core understanding of the model. If you have a teacher and a student you can successfully transfer knowledge. But, if you have students who have incomplete understanding teach another student with incomplete information you get the classic game of telephone and data will degrade over time. So yes, i fyou blindly teach ML models from every source of data on the Internet and you have students filling that data up with garbage, you will train worse models.
    There is PLENTY of data out there (any anyone who argues with that lacks imagination). The Internet was just low hanging fruit to jump-start intelligence in these models. The future will be more about refining high quality datasets and finding out how to train with smaller and smaller sets. It will also be about mulit-modality. There is an infinite supply of data from the real world (audio/video/touch/etc) that can be trained from.
    ML models are inherently lossy, the reason we write things down is because over 1000's of years stories drift, human brains are lossy as well and without rigorous systems will lose information. Finally, I also think ML can spiral in the other direction, purely speculation, but I believe that once ML models have a certain level of intelligence (which may also include less lossy memory storage to augment their NNs) they will be able to teach themselves or each other and spiral towards the singularity. I just think we're not quite past the tipping point yet, although GPT-4 feels very close.

  • @fejfo6559
    @fejfo6559 Před rokem +6

    My intuition is that you don't need to assume 90% of the internet will be AI generated. You only need to assume that the models will need more data then can be scraped from humans. AI generated data could be intentionally added by the devs even if it is not on the internet.
    My feeling is that recursive training won't end up being a big problem, and that some trick, like adding in human data will be enough to avoid forgetting the original distribution

    • @AICoffeeBreak
      @AICoffeeBreak  Před 11 měsíci +2

      I tend to agree, especially since this paper did not make a lot of effort in mitigating the effects. They rather wanted to highlight the worst case scenario.

  • @Skinishh
    @Skinishh Před rokem +2

    Learning from human feedback is even more important now!

    • @AICoffeeBreak
      @AICoffeeBreak  Před 10 měsíci +1

      Yes! :) But isn't that eventually a bottleneck given how slowly humans act and how fast computers can process / learn? 🙂

    • @Skinishh
      @Skinishh Před 10 měsíci +1

      @@AICoffeeBreak manually labelled data is always gold 😄

    • @Skinishh
      @Skinishh Před 10 měsíci +1

      Especially after model pretraining

  • @arbybc7188
    @arbybc7188 Před rokem +4

    GPT-3 unrestricted api access: Nov 2021
    ChatGPT training data up to: Sept 2021
    🤔🤔🤔
    I wonder why openai hasn’t updated chatgpt to include any data newer than when GPT3 was released…
    😂 what a mystery
    The year 2021 will be known for the data singularity, where new written work cannot be distinguished from ai generated text.

  • @_bustion_1928
    @_bustion_1928 Před rokem +3

    I think one does not need to be a Phd to logically deduce that mistakes made by something or someone will be propagated untill noticed and corrected.

  • @DerPylz
    @DerPylz Před rokem +5

    Ms Coffee Bean is back!

  • @TimScarfe
    @TimScarfe Před rokem +4

    Very interesting! Thanks Letitia! by the way, I’ve always been highly sceptical of synthetic data and this "Chinese whispers" method of discussing the problem I think really hits home

  • @uprobo4670
    @uprobo4670 Před rokem +2

    The internet is flooded with AI generated content since 2016 btw, its not recursive training for LLMs that I'm worried about, its recursive training for the new generations of humans that worries me more, those kids are literally being trained on AI generated content too and the older generation is using GPTs so much these days they are forgetting how to be creative on their own so the original high quality content that you talk about that will be produced by humans (and there is barely any) is also influenced by AIs.

  • @flamboyanta4993
    @flamboyanta4993 Před rokem +3

    The link to the model dementia paper is broken :)

    • @flamboyanta4993
      @flamboyanta4993 Před rokem +1

      It also links to the curse of recursion paper not the model dementia one

    • @AICoffeeBreak
      @AICoffeeBreak  Před rokem +3

      Thanks, fixed it! :) There was a whitespace that somehow creeped in there.
      "Model Dementia: Generated Data Makes Models Forget." is the 1st version of the paper: arxiv.org/abs/2305.17493v1
      The second version on arXiv is called "The Curse of Recursion: Training on Generated Data Makes Models Forget" arxiv.org/abs/2305.17493v2

    • @flamboyanta4993
      @flamboyanta4993 Před rokem +1

      Makes sense! Thanks Ms Coffee Bean!@@AICoffeeBreak

  • @tildarusso
    @tildarusso Před rokem +3

    Chatbots get more dumb by learning from each other, while human is on the contrary. Therefore, the LLM learning process is still fundamentally wrong.

    • @AICoffeeBreak
      @AICoffeeBreak  Před 11 měsíci +1

      Maybe they are just not diverse enough? They are all trained on the same kind of data. Name a chatbot that has not read the entire Wikipedia. 😅

  • @Skinishh
    @Skinishh Před rokem +2

    Do you think current datasets that have less AI generated data will become gold?

    • @AICoffeeBreak
      @AICoffeeBreak  Před 11 měsíci +2

      In one way, certainly yes. But they will also be outdated (knowledge cap at 2021?).

  • @Neomadra
    @Neomadra Před rokem +1

    The results are really not surprising, but at the same time, I don't really see the issue. It's just a matter of proper data engineering. Also: Who trains a model without any quality control and benchmarks? If the model gets worse by your metrics, then just go back to the data engineering step. I practice, models will never get worse in the long term. But of course, it might get harder and harder to train better models. Maybe. Maybe not, if we're building self-improving models which are able to run the entire MLOps training pipelines on their own.

  • @fxsignal1830
    @fxsignal1830 Před 11 měsíci +2

    sei così tenera

  • @Y0UT0PIA
    @Y0UT0PIA Před rokem +1

    Ohnononono...
    Singularitybros, we got too cocky...

  • @__--JY-Moe--__
    @__--JY-Moe--__ Před rokem +2

    now if the companies knew this? good luck on those dll's

  • @nitinss3257
    @nitinss3257 Před 7 měsíci

    MISS COFFEE BEAN THIS IS REAL! i tried to generate an object on my already generated image data and it struggled to produce quality output! same experiment was conducted on real image and the model had no problem generating variations of what i asked it to generate. i guess the AI bubble will burst when the internet is 90% gen data alone, watermarking wont work because people want to fool other people to think it's their hard work and not some easy gen ai data, also if this come up there will be other models(random repo on github) just to DETECT AND REMOVE this watermarks. haha