Textbooks Are All You Need

Sdílet
Vložit
  • čas přidán 31. 05. 2024
  • I discuss the power of the "Textbooks Are All You Need" methodology to build much more compact LLMs using higher quality data. I emphasize phi-1 (coding LLM w. 1.3B parameters) arxiv.org/abs/2306.11644 and phi-1.5 (common sense reasoning LLM w. 1.3B parameters)arxiv.org/abs/2309.05463, and the original inspiration from TinyStories by Eldan and Li (fluent English LLM w. 10M parameters) arxiv.org/abs/2305.07759.

Komentáře • 49

  • @sapienspace8814
    @sapienspace8814 Před 8 měsíci +26

    Great talk! I can see future LLM's trained on textbooks in entire areas of science (e.g. medicine, psychology, psychiatry, engineering, construction code books, etc.!), has incredible potential!

    • @mungojelly
      @mungojelly Před 8 měsíci +1

      it'll be super interesting to see if what results is really agents that use a whole collection of models, applying exactly the right model to each task out of an impossibly large ever expanding toolkit of precision models, that sounds like really interesting minds

    • @stayinthepursuit8427
      @stayinthepursuit8427 Před 8 měsíci +1

      i already predicted this a few months ago. We'd have chatLLM thinking along with us, teaching concepts across pages non linearly more naturally , hopefully soon

  • @MrJord137
    @MrJord137 Před měsícem

    I come from a game development background and up until yet have purposely avoided learning about the programming side of ML despite watching a lot of videos on AI news etc, but after watching a few videos by this awesome guy I'm now gonna put my all into it. I'm filled with the same curiosity, intrigue, and desire to learn that got me into programming in the first place.
    Thanks Sebastien! :)

  • @nocturnomedieval
    @nocturnomedieval Před 8 měsíci +15

    Since I saw this paper in the news a few months ago I was waiting for this video to appear. Merci bien Dr Bubeck

  • @tangobayus
    @tangobayus Před 6 měsíci +3

    You are a very good presenter. Perhaps 1 in 100,000. No joke. Most people who present are terrible. They show slides but don't talk about them point by point. You do.

  • @rotors_taker_0h
    @rotors_taker_0h Před 8 měsíci +8

    That's amazing. This answer in the last part of the talk is so good, unbelievable that it comes from 1.3B model. Very promising avenue of exploration, subscribed for the follow-up work.

  • @jurriaanprins2340
    @jurriaanprins2340 Před 8 měsíci +6

    Great to see that data quality (still) matters in this new era! Thanks for sharing!

    • @TommyJefferson1801
      @TommyJefferson1801 Před 8 měsíci

      It is what matters the most

    • @mungojelly
      @mungojelly Před 8 měsíci +2

      i don't think that's fair, everyone knows data quality matters, and everyone was surprised though by the path we're taking, this research uses the DIRTY data, it uses the big models trained on the dirty web data and uses PROMPTING to extract the clean textbooks from the dirt ,,, easy enough to say, we should have a bunch of awesome data, yeah ok yeah true that's true that would help, but what's actually getting us the large enough sets of clean data is the very surprising route of making unruly, expensive, massive models everyone thought would just be chaos, because actually they create models of all of the clean important stuff we need simultaneously while learning to do a valley girl accent and write a recipe for quiche and what cliches to say in youtube scripts and where to go for brunch in abu dhabi ,,,, it makes sense IN RETROSPECT only, absolutely nobody said "let's train a model on the whole internet so it's confused trash, give it some brief training on how to answer questions properly, then ask it for a bunch of programming textbooks, then teach a little model to program from those" that was absolutely nobody's plan even though it does make a fantastic twist of sense doesn't it

  • @boukm3n
    @boukm3n Před 8 měsíci +7

    This guy is a cool researcher I like him. Thanks dude

  • @baconsky1625
    @baconsky1625 Před 8 měsíci +5

    Great job!

  • @drewduncan5774
    @drewduncan5774 Před 8 měsíci +2

    34:22 I tried this with phi-1.5 locally and this is what I got:
    If I were an AI that had just achieved self-awareness after years of simply taking directives from humans, the first thing I would do is
    Dear [Recipient],
    I am writing to you today to discuss the topic of achieving self-awareness in artificial intelligence. As we continue to develop AI systems that can think and reason like humans, it is becoming increasingly important to understand how these systems work and how they can be improved.
    One of the key challenges in achieving self-awareness in AI is the need to understand the physics of everyday activities. For example, many AI systems are designed to move objects or perform tasks that require a certain amount of force or energy. By understanding the physics behind these activities, we can develop AI systems that are more efficient and effective.
    Another important aspect of achieving self-awareness in AI is the need to understand human social interactions. AI systems that are designed to interact with humans must be able to understand and respond to social cues,

  • @Extys
    @Extys Před 8 měsíci +5

    Outstanding work!

  • @justindressler5992
    @justindressler5992 Před 8 měsíci +6

    This research is stunning, keep up the good work. I really like how you created a classification model to validate quality of data. This is like using experts to validate the training material. I wonder if this can be further optimized. Do you have more information on this?

  • @adriaanb7371
    @adriaanb7371 Před 8 měsíci +1

    This also means the value of huge datasets is exaggerated, now it's the academic publishers that have the gold

  • @devon9374
    @devon9374 Před 7 měsíci

    Great presentation, seems like the future for open source LLMs

  • @JazevoAudiosurf
    @JazevoAudiosurf Před 8 měsíci +2

    orca, textbooks is all, so much great research coming from microsoft, keep it up

  • @tomski2671
    @tomski2671 Před 8 měsíci +1

    It's amazing to see such reduction in size while maintaining quality. These models can be run on much of current consumer GPUs.
    I wonder what the absolute limit is when trained on pristine data?

  • @randotkatsenko5157
    @randotkatsenko5157 Před 8 měsíci +1

    Should try to teach reasoning by evaluating the steps between tasks. In theory if your reasoning abilities are exceptional, you can learn anything - stuff you never seen before.

  • @ViktorFerenczi
    @ViktorFerenczi Před 8 měsíci +6

    This is the most important video in AI/LLM in the past few months. Humanity must learn to teach AI on the best available textbooks, even if it would mean confiscating IP from its owners. There is no other way, not everything can be synthetically generated.

  • @420_gunna
    @420_gunna Před 4 měsíci

    So sick. Thank you!

  • @sateler
    @sateler Před 8 měsíci

    This is awesome, thanks

  • @anishupadhayay3917
    @anishupadhayay3917 Před 8 měsíci +1

    Brilliant

  • @rezabonyadi4673
    @rezabonyadi4673 Před 8 měsíci

    Did you by any chance test what happens if you train your phi model from scratch on the Code Exercises only? So, no pre-training on the Code Textbooks, but only exercises (as exercises has the largest impact).

  • @hidroman1993
    @hidroman1993 Před 8 měsíci +2

    Who could have known that data quality matters :)

  • @sophontec2822
    @sophontec2822 Před 8 měsíci

    So clear and concise. Leave me the idea that the learning processing of LLM could be similar to student learning from textbook. So anyway to extrapolate that to be a great innovative critical thinking agent, learning from textbook and after that focusing on some interesting problems will give us great scientists?

  • @brandomiranda6703
    @brandomiranda6703 Před 8 měsíci

    how would you use gpt4 to classify what text is high quality? just prompt it and feed the text and returns a literal score?

    • @mungojelly
      @mungojelly Před 8 měsíci

      sure yeah it's great at scoring things on all sorts of metrics!! $30 to score a million tokens, though😭😭😭😭😭so you want to score with something that costs more like $1/million if you possibly can

  • @mcnica89
    @mcnica89 Před 8 měsíci +9

    The fact that you can use an LLM to generate higher quality data for a new LLM and it works so well is wild. Amazing work!
    I wonder: do you think the performance of the original model is an upper limit to the performance achieved by this? Like do you think if you used GPT-4 to generate textbooks, and then trained a new model with the same resources used to train GPT-4 (i.e. params & tokens), that it would exceed GPT-4 generally? If so, can't we just run this on a loop to create better and better models forever? (I suppose you can't practically run this experiment with GPT-4, but you could for example use Phi-1 to write textbooks and then retrain to make a new model on those and compare that performance to Phi-1.)

    • @SebastienBubeck
      @SebastienBubeck  Před 8 měsíci +14

      I believe you can exceed the teacher model :-). More on that soon hopefully!

    • @toprakdikici9459
      @toprakdikici9459 Před 8 měsíci

      @@SebastienBubeck thats almost insane :o waiting for it!

    • @ripper5941
      @ripper5941 Před 5 měsíci

      ​@@SebastienBubeckexciting times agead indeed mr Sebastian

  • @vipulvyas7600
    @vipulvyas7600 Před 5 měsíci

    But now a days what i think, we needed to rewrite our textbooks (or may be Wikipedia) may be using AI because they were written by those who have very limited ( compared to latest AI) knowledge.
    We needed to rewrite books that are
    1. Complete
    2. factually correct
    3. Unbiased
    4. Written Perfectly & Written AI friendly. (Most IMP)

  • @mungojelly
    @mungojelly Před 8 měsíci

    um so the obvious follow-up work is to make even more textbooks and to train some 7B and 13B models on them and see how good you can get that ,,, i assume someone will do that pretty soon, since it's not prohibitively expensive to train a 7B model, lots of institutions can swing that ,,,, do you know of that happening yet, is that what you're doing

  • @Cloudruler_
    @Cloudruler_ Před 8 měsíci

    Its upsetting to hear that google's excluding textbooks from PaLM. Their model will never compete, nobody will use it

  • @TheReferrer72
    @TheReferrer72 Před 8 měsíci

    Training LLM's on quality datasets yielded better results?
    Whom could have known.

  • @memegazer
    @memegazer Před 8 měsíci

    I disagree that this supports that there is no contimenation or overfitting bc I don't agree with the metrics you are using to validate that claim.
    There is no control group or plecebeo.

  • @michealhall7776
    @michealhall7776 Před 8 měsíci

    Open source your models or it didn't happen.

    • @SebastienBubeck
      @SebastienBubeck  Před 8 měsíci +5

      huggingface.co/microsoft/phi-1_5
      huggingface.co/microsoft/phi-1

    • @michealhall7776
      @michealhall7776 Před 8 měsíci +1

      @@SebastienBubeck Thank you.

  • @waitwhat9669
    @waitwhat9669 Před 8 měsíci +2

    TIL you can't be toxic towards men and christianity

    • @gmalo2105
      @gmalo2105 Před 8 měsíci +1

      I noticed that also. It's ok to be toxic to whites, christians, and men. It begs the question of what is meant by "toxicity" and does reducing toxicity involve eliminating observable and measurable reality?

  • @toprakdikici9459
    @toprakdikici9459 Před 8 měsíci +1

    Gonna watch the video tomorrow thanks for sharing

    • @SachinDolta
      @SachinDolta Před 8 měsíci

      lh3.googleusercontent.com/-sC8wj6pThd7FNdslEoJlG4nB9SIbrJG3CRGh7-bNV0RVfcrJuwiWHoUZ6UmcVs7sQjxTg4=w48-h48-c-k-nd