My guess is that OpenAI themselves fear that recursion will lead to bad models and that this is the prime motivator for them to work on and implement watermarking. They try to disguise it as a regulation to protect the users and readers, but I bet it is to filter their future training data.
If so, do you expect that watermarking will become widespread, even for FOS models? Otherwise, there will still be reams of unidentifiable LLM generated text online, leaving out the watermarked chatGPT artifacts.
I think this is great research, but it shows not a flaw of AI generated content as a thing, but a flaw in current models and the toy example shows it the best. We clearly see that data distributions shift and degrade after multiple retraining on the synthetic data. What we should do is to use it as a benchmark and when new architecture is proposed, beyond all other standard benchmarks, show how robust it is on the retraining and how many iterations it takes to destroy the original distribution. If this number is big enough it might indicate that the proposed architecture is worth using even if it is not the best in other tasks.
i think the paper is not adressing the issue at all, it's only demonstrating some sort of overfitting. I believe training a model recursively is bad since there is no new data but if the model can generate new better data like using a tool, or a chain of thoughtcor something else then it can be a positive recursive loop instead of a negative one.
Nice video, it's a good overview on the paper and the discussion. a) Yup, I'm worried, not that much for the models but also for the humans "trained" (learning) on this garbage content... b) The paper is not at all exagerated. It doesn't matter at all if 90% of the content is AI generated, but wether 90% of the content *available for training* is. And with companies suing each other and most content behind walls (social networks are evermore closed), we are getting there fast. I guess we'll have to keep models frozen on old data...
By my estimate the number of human authors will not substantially decrease 1 authors wish to publish 2 the humanity will self feed back by subclassing stupid or obvious or wrong answers We are at the beginning of a cycle. The key part will not only be to tag AI generated contents , which seems more a wish than something feasible for money reasons But to tag human made content A way to do this is by rewarding human authors, like the Brave browser model Maybe a dream but when LLM vendors will lack good quality data they may enter into this approach, the first one to do this at scale will become the leader because the outcome will be considered as a better source of conversation, hence a wider audience
I suspect this is a specific issue with the auto-encoders they used. What we need is a good validator and a diversity metric, this will prevent mode collapse. My hypothesis is that the higher dimensional the generated content is, the easier it will be to validate. Ilya Sutskever made a recent comment claiming that knowing the distribution is enough to give you exact samples as sample dimensions increases. (Dimensionality in this case, I guess, could be sentence/document length or picture resolution). Think of it as a discount model for test grading - the more criteria you have to take points off for, the lower the average score will be.
I don't think having ML models learn from each other inherently leads to degradation. But, that hinges on the core understanding of the model. If you have a teacher and a student you can successfully transfer knowledge. But, if you have students who have incomplete understanding teach another student with incomplete information you get the classic game of telephone and data will degrade over time. So yes, i fyou blindly teach ML models from every source of data on the Internet and you have students filling that data up with garbage, you will train worse models. There is PLENTY of data out there (any anyone who argues with that lacks imagination). The Internet was just low hanging fruit to jump-start intelligence in these models. The future will be more about refining high quality datasets and finding out how to train with smaller and smaller sets. It will also be about mulit-modality. There is an infinite supply of data from the real world (audio/video/touch/etc) that can be trained from. ML models are inherently lossy, the reason we write things down is because over 1000's of years stories drift, human brains are lossy as well and without rigorous systems will lose information. Finally, I also think ML can spiral in the other direction, purely speculation, but I believe that once ML models have a certain level of intelligence (which may also include less lossy memory storage to augment their NNs) they will be able to teach themselves or each other and spiral towards the singularity. I just think we're not quite past the tipping point yet, although GPT-4 feels very close.
My intuition is that you don't need to assume 90% of the internet will be AI generated. You only need to assume that the models will need more data then can be scraped from humans. AI generated data could be intentionally added by the devs even if it is not on the internet. My feeling is that recursive training won't end up being a big problem, and that some trick, like adding in human data will be enough to avoid forgetting the original distribution
I tend to agree, especially since this paper did not make a lot of effort in mitigating the effects. They rather wanted to highlight the worst case scenario.
GPT-3 unrestricted api access: Nov 2021 ChatGPT training data up to: Sept 2021 🤔🤔🤔 I wonder why openai hasn’t updated chatgpt to include any data newer than when GPT3 was released… 😂 what a mystery The year 2021 will be known for the data singularity, where new written work cannot be distinguished from ai generated text.
Very interesting! Thanks Letitia! by the way, I’ve always been highly sceptical of synthetic data and this "Chinese whispers" method of discussing the problem I think really hits home
The internet is flooded with AI generated content since 2016 btw, its not recursive training for LLMs that I'm worried about, its recursive training for the new generations of humans that worries me more, those kids are literally being trained on AI generated content too and the older generation is using GPTs so much these days they are forgetting how to be creative on their own so the original high quality content that you talk about that will be produced by humans (and there is barely any) is also influenced by AIs.
Thanks, fixed it! :) There was a whitespace that somehow creeped in there. "Model Dementia: Generated Data Makes Models Forget." is the 1st version of the paper: arxiv.org/abs/2305.17493v1 The second version on arXiv is called "The Curse of Recursion: Training on Generated Data Makes Models Forget" arxiv.org/abs/2305.17493v2
The results are really not surprising, but at the same time, I don't really see the issue. It's just a matter of proper data engineering. Also: Who trains a model without any quality control and benchmarks? If the model gets worse by your metrics, then just go back to the data engineering step. I practice, models will never get worse in the long term. But of course, it might get harder and harder to train better models. Maybe. Maybe not, if we're building self-improving models which are able to run the entire MLOps training pipelines on their own.
MISS COFFEE BEAN THIS IS REAL! i tried to generate an object on my already generated image data and it struggled to produce quality output! same experiment was conducted on real image and the model had no problem generating variations of what i asked it to generate. i guess the AI bubble will burst when the internet is 90% gen data alone, watermarking wont work because people want to fool other people to think it's their hard work and not some easy gen ai data, also if this come up there will be other models(random repo on github) just to DETECT AND REMOVE this watermarks. haha
My guess is that OpenAI themselves fear that recursion will lead to bad models and that this is the prime motivator for them to work on and implement watermarking. They try to disguise it as a regulation to protect the users and readers, but I bet it is to filter their future training data.
If so, do you expect that watermarking will become widespread, even for FOS models? Otherwise, there will still be reams of unidentifiable LLM generated text online, leaving out the watermarked chatGPT artifacts.
@@oncedidactic Hmmm, very good point, thank you.
Gotta think about it
I think this is great research, but it shows not a flaw of AI generated content as a thing, but a flaw in current models and the toy example shows it the best. We clearly see that data distributions shift and degrade after multiple retraining on the synthetic data. What we should do is to use it as a benchmark and when new architecture is proposed, beyond all other standard benchmarks, show how robust it is on the retraining and how many iterations it takes to destroy the original distribution. If this number is big enough it might indicate that the proposed architecture is worth using even if it is not the best in other tasks.
i think the paper is not adressing the issue at all, it's only demonstrating some sort of overfitting.
I believe training a model recursively is bad since there is no new data but if the model can generate new better data like using a tool, or a chain of thoughtcor something else then it can be a positive recursive loop instead of a negative one.
Very interesting indeed. Thanks
Well, we're going to be very careful about selecting training data moving forward. Looks like a big problem!
Nice video, it's a good overview on the paper and the discussion.
a) Yup, I'm worried, not that much for the models but also for the humans "trained" (learning) on this garbage content...
b) The paper is not at all exagerated. It doesn't matter at all if 90% of the content is AI generated, but wether 90% of the content *available for training* is. And with companies suing each other and most content behind walls (social networks are evermore closed), we are getting there fast. I guess we'll have to keep models frozen on old data...
Thank you so much for an important and understandable overview. I appreciate your helping us to go into the future with our eyes open.
By my estimate the number of human authors will not substantially decrease
1 authors wish to publish
2 the humanity will self feed back by subclassing stupid or obvious or wrong answers
We are at the beginning of a cycle.
The key part will not only be to tag AI generated contents , which seems more a wish than something feasible for money reasons
But to tag human made content
A way to do this is by rewarding human authors, like the Brave browser model
Maybe a dream but when LLM vendors will lack good quality data they may enter into this approach, the first one to do this at scale will become the leader because the outcome will be considered as a better source of conversation, hence a wider audience
Great point!
I suspect this is a specific issue with the auto-encoders they used. What we need is a good validator and a diversity metric, this will prevent mode collapse. My hypothesis is that the higher dimensional the generated content is, the easier it will be to validate. Ilya Sutskever made a recent comment claiming that knowing the distribution is enough to give you exact samples as sample dimensions increases. (Dimensionality in this case, I guess, could be sentence/document length or picture resolution). Think of it as a discount model for test grading - the more criteria you have to take points off for, the lower the average score will be.
I don't think having ML models learn from each other inherently leads to degradation. But, that hinges on the core understanding of the model. If you have a teacher and a student you can successfully transfer knowledge. But, if you have students who have incomplete understanding teach another student with incomplete information you get the classic game of telephone and data will degrade over time. So yes, i fyou blindly teach ML models from every source of data on the Internet and you have students filling that data up with garbage, you will train worse models.
There is PLENTY of data out there (any anyone who argues with that lacks imagination). The Internet was just low hanging fruit to jump-start intelligence in these models. The future will be more about refining high quality datasets and finding out how to train with smaller and smaller sets. It will also be about mulit-modality. There is an infinite supply of data from the real world (audio/video/touch/etc) that can be trained from.
ML models are inherently lossy, the reason we write things down is because over 1000's of years stories drift, human brains are lossy as well and without rigorous systems will lose information. Finally, I also think ML can spiral in the other direction, purely speculation, but I believe that once ML models have a certain level of intelligence (which may also include less lossy memory storage to augment their NNs) they will be able to teach themselves or each other and spiral towards the singularity. I just think we're not quite past the tipping point yet, although GPT-4 feels very close.
My intuition is that you don't need to assume 90% of the internet will be AI generated. You only need to assume that the models will need more data then can be scraped from humans. AI generated data could be intentionally added by the devs even if it is not on the internet.
My feeling is that recursive training won't end up being a big problem, and that some trick, like adding in human data will be enough to avoid forgetting the original distribution
I tend to agree, especially since this paper did not make a lot of effort in mitigating the effects. They rather wanted to highlight the worst case scenario.
Learning from human feedback is even more important now!
Yes! :) But isn't that eventually a bottleneck given how slowly humans act and how fast computers can process / learn? 🙂
@@AICoffeeBreak manually labelled data is always gold 😄
Especially after model pretraining
GPT-3 unrestricted api access: Nov 2021
ChatGPT training data up to: Sept 2021
🤔🤔🤔
I wonder why openai hasn’t updated chatgpt to include any data newer than when GPT3 was released…
😂 what a mystery
The year 2021 will be known for the data singularity, where new written work cannot be distinguished from ai generated text.
I think one does not need to be a Phd to logically deduce that mistakes made by something or someone will be propagated untill noticed and corrected.
🧠
Ms Coffee Bean is back!
Very interesting! Thanks Letitia! by the way, I’ve always been highly sceptical of synthetic data and this "Chinese whispers" method of discussing the problem I think really hits home
The internet is flooded with AI generated content since 2016 btw, its not recursive training for LLMs that I'm worried about, its recursive training for the new generations of humans that worries me more, those kids are literally being trained on AI generated content too and the older generation is using GPTs so much these days they are forgetting how to be creative on their own so the original high quality content that you talk about that will be produced by humans (and there is barely any) is also influenced by AIs.
The link to the model dementia paper is broken :)
It also links to the curse of recursion paper not the model dementia one
Thanks, fixed it! :) There was a whitespace that somehow creeped in there.
"Model Dementia: Generated Data Makes Models Forget." is the 1st version of the paper: arxiv.org/abs/2305.17493v1
The second version on arXiv is called "The Curse of Recursion: Training on Generated Data Makes Models Forget" arxiv.org/abs/2305.17493v2
Makes sense! Thanks Ms Coffee Bean!@@AICoffeeBreak
Chatbots get more dumb by learning from each other, while human is on the contrary. Therefore, the LLM learning process is still fundamentally wrong.
Maybe they are just not diverse enough? They are all trained on the same kind of data. Name a chatbot that has not read the entire Wikipedia. 😅
Do you think current datasets that have less AI generated data will become gold?
In one way, certainly yes. But they will also be outdated (knowledge cap at 2021?).
The results are really not surprising, but at the same time, I don't really see the issue. It's just a matter of proper data engineering. Also: Who trains a model without any quality control and benchmarks? If the model gets worse by your metrics, then just go back to the data engineering step. I practice, models will never get worse in the long term. But of course, it might get harder and harder to train better models. Maybe. Maybe not, if we're building self-improving models which are able to run the entire MLOps training pipelines on their own.
sei così tenera
Ohnononono...
Singularitybros, we got too cocky...
now if the companies knew this? good luck on those dll's
MISS COFFEE BEAN THIS IS REAL! i tried to generate an object on my already generated image data and it struggled to produce quality output! same experiment was conducted on real image and the model had no problem generating variations of what i asked it to generate. i guess the AI bubble will burst when the internet is 90% gen data alone, watermarking wont work because people want to fool other people to think it's their hard work and not some easy gen ai data, also if this come up there will be other models(random repo on github) just to DETECT AND REMOVE this watermarks. haha