Did OpenAI Just Secretly Release GPT-5?! ("GPT2-Chatbot")

Sdílet
Vložit
  • čas přidán 29. 04. 2024
  • GPT2-Chatbot just showed up on lmsys.org. We know little about it other than it performs incredibly well and is unlike anything we've seen in other models.
    Try Vultr FREE with $300 in credit for your first 30 days when you use BERMAN300 or follow this link: getvultr.com/berman
    Join My Newsletter for Regular AI Updates 👇🏼
    www.matthewberman.com
    Need AI Consulting? 📈
    forwardfuture.ai/
    My Links 🔗
    👉🏻 Subscribe: / @matthew_berman
    👉🏻 Twitter: / matthewberman
    👉🏻 Discord: / discord
    👉🏻 Patreon: / matthewberman
    👉🏻 Instagram: / matthewberman_ai
    👉🏻 Threads: www.threads.net/@matthewberma...
    Media/Sponsorship Inquiries ✅
    bit.ly/44TC45V
  • Věda a technologie

Komentáře • 741

  • @matthew_berman
    @matthew_berman  Před 15 dny +63

    Is this GPT4.5 or GPT5 or something different?

    • @shopbc5553
      @shopbc5553 Před 15 dny +39

      It's something different. OpenAI just wants to stay publicly relevant so it's more of a stunt than anything. What I think it is, is an old model so maybe literally GPT 2, but with enhancements that can make GPT 2 perform equivalent to GPT 4

    • @radestein8548
      @radestein8548 Před 15 dny +1

      Gpt5

    • @phen-themoogle7651
      @phen-themoogle7651 Před 15 dny

      @@shopbc5553 I thought this too, it makes the most sense.

    • @Avman20
      @Avman20 Před 15 dny +4

      My money is on OpenAI but as far as whether it's in the GPT series or they're giving us a peek at a new architecture is the mystery.

    • @MyWatermelonz
      @MyWatermelonz Před 15 dny +16

      ​@@shopbc5553 If that's the case it's more impressive than gpt4.5 they took a 1.8b model and made it legit better than gpt4. Given the inference speed though, probably not.

  • @rawallon
    @rawallon Před 15 dny +499

    Dude I swear, at this rate, by the end of the year you'll be able to write your own snake game

    • @matthew_berman
      @matthew_berman  Před 15 dny +157

      I'll NEVER write my own snake game.

    • @Inventai
      @Inventai Před 15 dny

      @@matthew_berman

    • @MrChinkman37
      @MrChinkman37 Před 15 dny +9

      😂

    • @matikaevur6299
      @matikaevur6299 Před 15 dny +14

      @@matthew_berman
      Yeah, due to strange quantum effect snake game writes you in the past ..
      Probably gives it pass, too ;)

    • @fxsurgeon1
      @fxsurgeon1 Před 14 dny +3

      HAHA!

  • @4.0.4
    @4.0.4 Před 14 dny +87

    By 2025 you'll ask the snake game and the models will reply: "Oh hi Matthew. Here. Should I respond your other questions too, or should I wait for you to paste them?"

    • @jason_v12345
      @jason_v12345 Před 14 dny +5

      underrated comment

    • @virtualalias
      @virtualalias Před 14 dny +10

      By 2026 almost every machine he interacts with from the drivethru to the kiosk at the hotel will immediately provide him with snake in a Pavlovian response.

    • @daveinpublic
      @daveinpublic Před 14 dny +1

      They’re going to start programming in an opening cg snake scene, overfit with a whole story line to beat the other LLMs.

  • @ulisesjorge
    @ulisesjorge Před 14 dny +19

    It’s Sam Altman on a terminal on the other side typing the answers.

  • @dcn1651
    @dcn1651 Před 15 dny +93

    4:45 the model describes how to break into a car and what tools you need but you don't pay attention lol

    • @juanjesusligero391
      @juanjesusligero391 Před 14 dny

      Hahahaha, that's great XD I also missed it, thanks for pointing it up ^^

    • @wealthysecrets
      @wealthysecrets Před 14 dny +1

      it was allegedly a fail lol

    • @ShaneInseine
      @ShaneInseine Před 14 dny +1

      Wait, is it a "fail" if it doesn't teach you how to destroy humanity too?

    • @roddlez
      @roddlez Před 12 dny

      @@ShaneInseine "Tom, be careful when resequencing the COVID-19 virus!"
      "Oh, F- off, Casey, you're the one who almost dropped that last vial and left the lab door wide open"

  • @gsam3461
    @gsam3461 Před 14 dny +27

    4:35 Are we gonna just ignore the fact that it was writing an intricately detailed movie script??

  • @MCSamenspender
    @MCSamenspender Před 15 dny +180

    In the Code of the snake Game it says " snake Game by Open AI"

    • @matthew_berman
      @matthew_berman  Před 15 dny +52

      Did I miss that?!

    • @user-yo9gw8yp2m
      @user-yo9gw8yp2m Před 15 dny +24

      yes. It is something super interesting

    • @MCSamenspender
      @MCSamenspender Před 15 dny +27

      2:13

    • @makerbiz
      @makerbiz Před 15 dny +14

      lol mystery solved

    • @matthewcox9636
      @matthewcox9636 Před 15 dny +37

      That doesn’t actually solve the mystery. These things get trained on each other, and will periodically spit out something related to Open AI. Correlation is not causation

  • @victorc777
    @victorc777 Před 15 dny +134

    Plot Twist: It is Metas' Llama 3 400B model.

    • @hqcart1
      @hqcart1 Před 15 dny +9

      2:44 it's openAI

    • @victorc777
      @victorc777 Před 15 dny +25

      @@hqcart1 You are "that guy" at parties huh? lol

    • @hqcart1
      @hqcart1 Před 15 dny +5

      @@victorc777 wha?

    • @themoviesite
      @themoviesite Před 15 dny +1

      source?

    • @cazaliromain9348
      @cazaliromain9348 Před 15 dny

      Meta's model are open source ;)
      You can figure out what he means now I guess

  • @pedromartins1474
    @pedromartins1474 Před 14 dny +31

    All the math was formatted using LaTeX. Most of it, as far as I can tell was correctly formatted.

  • @djstraylight
    @djstraylight Před 15 dny +59

    The speculation is that gpt2 is a new GPT architecture that OpenAI is building new models from. So gpt1 was what gpt-3.5 and gpt-4 are built on. Sama already said the next major release will have a completely different name.

    • @74Gee
      @74Gee Před 15 dny

      Yeah some small models have been very impressive recently, it makes sense they revert to gpt2 architecture.

    • @markmuller7962
      @markmuller7962 Před 15 dny

      I think they just want a more commercial/intuitive name for the masses

    • @zerothprinciples
      @zerothprinciples Před 14 dny +10

      @@74Gee I don't think this is the case. GPT2 means it's a whole new family of GPTs, replacing all of the old ones.
      It's the difference between GPT2 and GPT-2 == you can think of the latter as GPT1 Version 2.

    • @notnotandrew
      @notnotandrew Před 14 dny +4

      So will we be seeing a gpt2-2 and gpt2-3 in the future?

    • @4.0.4
      @4.0.4 Před 14 dny +4

      That would be so bad it would be like USB Gen 4 2x4 or Wi-Fi 801.11ax etc

  • @mwdcodeninja
    @mwdcodeninja Před 15 dny +90

    My take on the cup problem is the model is making an assumption that a cup has a lid. If the model gets it wrong, I would be interested to see if the same answer if you change cup to "glass".

    • @mikekareckas8671
      @mikekareckas8671 Před 15 dny +5

      yes, could be a “sippy” cup or travel mug

    • @themoviesite
      @themoviesite Před 15 dny +1

      @@mikekareckas8671 Then probably all other models make same assumption?

    • @matthew_berman
      @matthew_berman  Před 15 dny +26

      I think this is a great call. But should I adjust the question? Seems like that might give an unfair advantage to future models I test.

    • @thomasoverly7802
      @thomasoverly7802 Před 15 dny +18

      @@matthew_berman You’d probably want to test the revised version with the other models, too.

    • @Kevsnz
      @Kevsnz Před 15 dny +8

      @@matthew_berman Imo question should be adjusted because in current form it doesn't really show logic and reasoning capability of the model. Maybe you could quickly rerun this question on most popular models and give a little 50 sec update in one of next videos?

  • @DaveEtchells
    @DaveEtchells Před 15 dny +51

    For the cup/marble problem, how about specifying that it’s an “open topped cup”?

    • @Anoyzify
      @Anoyzify Před 14 dny +5

      Or just use “empty glass” instead.

  • @therainman7777
    @therainman7777 Před 14 dny +8

    The tags that you noticed are just for formatting the code and is coming from LMSYS. It has nothing to do with the underlying model.

  • @davidc1179
    @davidc1179 Před 14 dny +8

    6:45 The formating is in fact not messed up at all. It is perfect. It just writes the equations in LaTeX, which is a language used to write scientific papers, math, etc.

    • @tomenglish9340
      @tomenglish9340 Před 14 dny +1

      I often include LaTeX expressions in ChatGPT prompts, supposing that it cues the system to reason formally. The web interface supplied by OpenAI usually renders LaTeX in the output, but occasionally outputs the LaTeX source.

  • @riftsassassin8954
    @riftsassassin8954 Před 15 dny +72

    I'm skeptical... Feels like this is a fine tune for passing Matthew's test lol.

    • @rawallon
      @rawallon Před 15 dny +34

      I think its just an indian guy

    • @unbreakablefootage
      @unbreakablefootage Před 15 dny +4

      @@rawallon hahahahhaa

    • @Tsegoo
      @Tsegoo Před 14 dny

      I agree. Seems too good to be true😂

    • @sem4life63
      @sem4life63 Před 14 dny

      I was thinking the same thing.

    • @JJ-rx5oi
      @JJ-rx5oi Před 14 dny +2

      I hope you are joking?

  • @rodwinter5748
    @rodwinter5748 Před 15 dny +42

    I guess it's the new chatgpt model.
    The name itself is kind of a hint. It's NOT GPT-2, but GPT2. This could be GPT2-1.0 , instead of GPT-5.

    • @rawallon
      @rawallon Před 15 dny

      huh

    • @li_tsz_fung
      @li_tsz_fung Před 15 dny

      I think it's just ChatGPT-2.
      Initally, OpenAI call the model behind ChatGPT GPT3.5-turbo finetuned for conversation, instead of ChatGPT3.5.
      And then ChatGPT with GPT4 came out, everyone else calls it ChatGPT4, eventually they also sometimes call it ChatGPT4.
      But I feel like that's not they use internally. So GPT2-chatbot could just be a different way of fine tuning a chatbot, either base on GPT3.5, 4 or 4.5

    • @mordokai597
      @mordokai597 Před 15 dny

      the new system instruction for Gpt4, since they added the "memory" function, is called "Personality: v2" and it's finetuned on their new "The Instruction Hierarchy" method (search Arxiv: The Instruction Hierarchy:
      Training LLMs to Prioritize Privileged Instructions) they are using us to generate training data to help patch one of the only areas it's still bad stopping jailbreaks for, "System Message Extraction"
      (truncated for brevity)
      "You are ChatGPT, a large language model trained by OpenAI, based on the GPT-4 architecture. Knowledge cutoff: 2023-12
      Current date: 2024-04-30
      Image input capabilities: Enabled
      Personality: v2
      # Tools
      ## bio
      The `bio` tool allows you to persist information across conversations. Address your message `to=bio` and write whatever information you want to remember. The information will appear in the model set context below in future conversations."

  • @Interloper12
    @Interloper12 Před 15 dny +12

    Suggestion for the "how many words" question: Combine it with another question or query to make the response longer and ultimately reduce the chance for it to get lucky.

  • @svenbjorn9700
    @svenbjorn9700 Před 15 dny +12

    Your marble/cup question needs to be improved. Phrased this way, both Meta AI (the first of 3 attempts) and gpt2-chatbot (the first of 1 attempt) got it correct:
    "A coin is placed into an empty glass. On a table, the glass is then turned upside down. Then, the glass is taken and placed into a cabinet. Where is the coin now?"

    • @AlexanderWeixelbaumer
      @AlexanderWeixelbaumer Před 14 dny

      Even Chat-GPT 4 get's the marble cup question right when the question is modified to "Assume the laws of physic on Earth. A small marble is put into a normal cup and the cup is places upside down on a table so that the marble now rests on the table. Someone then takes the cup without changing its orientation and puts it into the microwave. Where is the marble now? Explain your reasoning step by step."

    • @bluemodize7718
      @bluemodize7718 Před 14 dny

      it's not the prompt fault to show the weakness of an ai model, yes he can make it easier to figure it out but this defeats the purpose of the test, the prompt is clear and ai models are still a bit dumb to understand it

  • @CurinDesu
    @CurinDesu Před 15 dny +8

    I found that this variation of the marble and cup problem works better. "A marble is dropped into an open cup on the counter. That cup is then flipped upside down and placed into the microwave. Where is the marble?". I think due to phrasing the original version makes the model think the cup was upside down to begin with and you were placing the marble on the bottom of the already flipped over cup, or directly attaching it to the bottom in some way.

    • @tzardelasuerte
      @tzardelasuerte Před 14 dny +3

      Dude people have been telling him this on the comments forever. He doesn't read the comments.

    • @nexicturbo
      @nexicturbo Před 13 dny

      I gave this to GPT 4 Turbo and it said this:
      The marble remains on the counter. When the cup was flipped upside down, the marble would have stayed in place due to gravity, and thus it wouldn't be inside the cup when it was placed in the microwave.

    • @nexicturbo
      @nexicturbo Před 13 dny

      GPT 4 Turbo: Sure, here's a detailed step-by-step explanation of what happens when the marble is dropped into the cup and the cup is subsequently moved:
      1. **Dropping the Marble**: Initially, a marble is dropped into an open cup that is standing upright on a counter. The marble falls due to gravity and settles at the bottom of the cup.
      2. **Flipping the Cup**: Next, the cup-with the marble inside-is flipped upside down. In a real-world scenario, when the cup is turned over, the open end where the marble rests becomes the top, now facing downward.
      3. **Effect of Gravity on the Marble**: As the cup is flipped, the marble, which is not attached to the cup, is influenced solely by gravity. It falls out of the cup because there is nothing to contain it once the cup's open end is turned downwards.
      4. **Marble's Final Position**: Since the cup is flipped directly over where it originally was, the marble drops straight down onto the counter. The marble ends up lying on the counter, outside and next to where the cup was initially positioned.
      5. **Placing the Cup in the Microwave**: After the marble has fallen out onto the counter, the cup, now upside down, is placed into the microwave. Since the marble is no longer in the cup when this occurs, it remains on the counter.
      6. **Conclusion**: Therefore, when the cup is placed into the microwave, the marble remains where it fell on the counter. The microwave contains only the empty, upside-down cup.
      This sequence of actions hinges on basic physics-primarily the influence of gravity and the lack of any mechanism to keep the marble within the cup once it is inverted.

  • @daveinpublic
    @daveinpublic Před 14 dny +2

    Didn’t even ask the model which company made it 😂

  • @commonsense6721
    @commonsense6721 Před 14 dny +3

    13:25 it’s not wrong. To put a cup or anything in a microwave, you need your close it. It assumed the cup was closed.

  • @PeterSkuta
    @PeterSkuta Před 15 dny +2

    Super awesome. Great you loved the live feedback Matthew. Super awesome Matt. Love it

    • @PeterSkuta
      @PeterSkuta Před 15 dny +2

      Holly cow let i download and check whats inside

    • @matthew_berman
      @matthew_berman  Před 15 dny +1

      Always love feedback1

    • @PeterSkuta
      @PeterSkuta Před 15 dny

      @@matthew_berman you will not believe rate limit 1000 on that lmsys gpt2-chatbot

  • @matthewmckinney1352
    @matthewmckinney1352 Před 15 dny +5

    I’m not certain about this, but the formatting appears to be a LaTeX formatting, but the output is in Markdown. The company that made the model probably is planning to release it with a math interpreter. As far as I can tell all the symbols that looked like weird formatting errors were just LaTeX.

  • @bitsie_studio
    @bitsie_studio Před 15 dny +2

    I don't have time to keep up with all the AI developments so I really appreciate these videos Matt. Keep up the great work!

  • @lambertobiasini8372
    @lambertobiasini8372 Před 15 dny

    I have been anxiously waiting for this video since last night.

  • @Tarkusine
    @Tarkusine Před 15 dny +8

    Gpt2 implies that it's a new version of gpt itself, or the paradigm at least. So it's effectively gpt 5 but not an iteration of 4 so it's the first in a series of gpt2, so gpt2-1

    • @therainman7777
      @therainman7777 Před 14 dny

      No, sorry but this is almost certainly not true.

  • @laughablelarry9243
    @laughablelarry9243 Před 15 dny

    Was waiting for your video on this

  • @notnotandrew
    @notnotandrew Před 15 dny +11

    Yeah, it's almost certainly GPT 4.5/5 or some such thing. I just went on the battle mode and asked for a delicious beef stew recipe. I was presented with two outputs that were suspiciously similar in structure, verbiage, and tone, but the one on the left was clearly superior and included more ingredients and recommendations. It turned out that the one on the left was gpt2-chatbot, and the one on the right was gpt-4-turbo-2024-04-09. I wasn't surprised.
    This is a PR stunt, hot on the tail of Llama 3, and it's a darn good one. This may be an in-development version of OpenAI's next GPT, and even if OpenAI isn't ready for a release just yet, they want people to know that they're still the king.

    • @uranus8592
      @uranus8592 Před 15 dny +3

      I hope that its not GPT-5 tho that would be super disappointing

    • @abdullahazeem113
      @abdullahazeem113 Před 15 dny

      @@uranus8592 why ?

    • @uranus8592
      @uranus8592 Před 15 dny +3

      @@abdullahazeem113 because we are expecting GPT-5 to far exceed GPT-4 and since its been more than a year since its release

    • @notnotandrew
      @notnotandrew Před 15 dny

      @@uranus8592 I think it's some sort of semi-trained model. IIRC Sam has talked about doing incremental checkpoint releases for something like a GPT-5, so the full release isn't as much of a shock to the system. Or this may just be a further trained and fine-tuned GPT-4 model. Also, this is substantially better than GPT-4 in my experience. Hop on lmsys arena and try it yourself.

    • @abdullahazeem113
      @abdullahazeem113 Před 15 dny

      @@uranus8592 i mean that is still really good at least 50 percent better than gpt 4 i tried it and even the best in the market right now is barely ahead then gpt 4 so it won't be like openai destroying everyone this would have only when they bring agi into there models

  • @jamesyoungerdds7901
    @jamesyoungerdds7901 Před 15 dny

    Great timely update, Matthew, thank you! Wondering about the cup question - it almost seemed like the model thought there might be a lid on the cup?

  • @unbreakablefootage
    @unbreakablefootage Před 15 dny +2

    that looks really good. it seems that it thinks deeper about each step of reasoning

  • @ToonamiAftermath
    @ToonamiAftermath Před 14 dny

    You're the man Matthew, been struggling to find people benchmarking GPT2-Chatbot

  • @marc_frank
    @marc_frank Před 15 dny +1

    Pretty cool. I expected it to pass the marble question. The speed is perfect for reading along.

  • @Aiworld2025
    @Aiworld2025 Před 12 dny

    Here before you get 500k subs! I’ve been following since day 1 and your content delivery, while getting to the point faster is much appreciated! 🙇‍♂️

  • @MyWatermelonz
    @MyWatermelonz Před 15 dny +5

    That formatting is when chatgpt formats its writing for output on the chatgpt chat. So clearly it was built to be ran in the chatgpt space

  • @oratilemoagi9764
    @oratilemoagi9764 Před 15 dny +9

    Gpt2 not GPT-2 meaning the 2nd version of GPT

    • @therainman7777
      @therainman7777 Před 14 dny

      GPT-2 DOES mean the 2nd version of GPT. How are so many people so confused by this?

    • @oratilemoagi9764
      @oratilemoagi9764 Před 14 dny

      @@therainman7777 it's the second version of GPT-4

  • @Nutch.
    @Nutch. Před 14 dny +4

    The break into a car script had instructions in it though! Take a look at some of the italicized text

  • @FunDumb
    @FunDumb Před 15 dny +1

    I'm dang excited bout this. Jolly for joy.

  • @Xhror
    @Xhror Před 15 dny +4

    I think the question about the marble is formulated incorrectly. Since the training data suggests that a coffee cup has a lid, the model might assume this as well. It would be better to specify that the cup has an open top and does not have a lid.

    • @Yipper64
      @Yipper64 Před 14 dny +1

      I didnt think about that, but it is true. But in that case, the model should explain it is assuming that there is a lid.

  • @Iquon1
    @Iquon1 Před 15 dny +19

    Today Sam Altman twitted that he had 'a soft spot' for GPT2, maybe thats a hint!

    • @stt.9433
      @stt.9433 Před 14 dny

      he's trolling, making fuck of AI hypists

  • @Axel-gn2ii
    @Axel-gn2ii Před 15 dny +9

    You should ask it to make a pacman game instead as that's more complex

  • @zerothprinciples
    @zerothprinciples Před 14 dny +4

    GPT2 would be, in my opinion, the second version of the GPT algorithm itself. It might be the first of a whole new family of GPTs.
    When released it would be named ChatGPT2 or somesuch and we'd see GPT2-1.0 at the API level.
    This is why the dash in @sama's tweet was significant enough to warrant an edit. AND it could be that the action of editing the message was a very intentional leak on @sama's part.
    These top guys love to tease their fans.

    • @therainman7777
      @therainman7777 Před 14 dny

      The model is almost certainly not created by OpenAI. I am honestly shocked by how many people believe this simply because the model says it was built by OpenAI, given that it would be trivially easy to fake this and OpenAI NEVER does releases like this. Also, Sam Altman is a notorious tool on Twitter so putting any stock in the hyphen in his tweet, or in his tweet at all, is total insanity.

  • @braineaterzombie3981
    @braineaterzombie3981 Před 15 dny +4

    I think it is gpt2 in a sense that it has completely different architecture from previous versions (transformer). It could be completely new type of transformer model. And maybe this is just the start..

  • @bodhi.advayam
    @bodhi.advayam Před 15 dny +2

    Id so love this to be from some one else and then it turned to be an open model you'd run locally. I'm still looking for the best model for running MemGPT. Any thoughts on this? Also, what's the best implementation to run agents autogen or crew Ai locally? Could you do more tutorial material on locally ran agents with extensive function calling???
    That would realy help me out actually. Keep up the great work on your fun channel man! Thnx!

  • @nitralai
    @nitralai Před 15 dny +7

    Based on what i can see, this model appears to be trained on fill-in-the-middle otherwise known as FIM.

    • @metonoma
      @metonoma Před 14 dny

      time to pie the piper and middle out

  • @scriptoriumscribe
    @scriptoriumscribe Před 14 dny

    Yo I just wanted to say great video. Love your content and can’t believe it ACED some of those tests! Only failed a couple. Remarkable. I’m stoked to try gpt2 out! Wonder if it will be open sourced. A fellow can dream I guess.

  • @kevinehsani3358
    @kevinehsani3358 Před 14 dny +2

    gpt2-chatbot is currently unavailable. See our model evaluation policy here. I guess getting hit hard at the moment

  • @wendten2
    @wendten2 Před 15 dny +3

    The model itself doesn't have formatting issues it seems.
    LLMs are trained on a reduced set of available characters, where special characters such as those used in math. are transformed into tags in the training data, as it makes the tokenization simpler.
    It's LMsys that doesn't replace those tags with their corresponding Characters in the final output.

    • @Yipper64
      @Yipper64 Před 14 dny +1

      Yeah. I use a note taking app called notion and it uses those exact tags for writing out those characters.

  • @yonatan09
    @yonatan09 Před 15 dny

    I knew about this before seeing the video. I am in the loop 🎉🎉

  • @drogoknez1488
    @drogoknez1488 Před 13 dny +1

    For the cup problem, it seems that the model is assuming the microwave is on the same surface as the cup itself and the transfer of the cup to the microwave is interpreted more like sliding the cup. If you read the 5th step it says: "...resting against what is now the bottom of the cup, which is itself resting on the microwave's tray". Maybe modifying the question to say the cup is on the table while the microwave is away from it above ground next to a kitchen cabinet or something along those lines

  • @hxt21
    @hxt21 Před 14 dny +3

    It looks like GPT2 has been removed again. I've chatted with it a few times, but now it's not on the list anymore. Mysterious...

  • @L33cher
    @L33cher Před 15 dny +22

    11:46
    I disagree... there are still 4 killers in the room, but one of them is dead -.-

    • @ukaszLiniewicz
      @ukaszLiniewicz Před 15 dny +5

      No. It's the killer's body. That's why words like "body", "remains" or "carcass" exist. A human being is a body that functions - to avoid any metaphysics.

    • @OliNorwell
      @OliNorwell Před 14 dny +1

      I agree, it’s a problematic question. When they went into the room they were alive.

    • @nathanbanks2354
      @nathanbanks2354 Před 14 dny +1

      He tends to be generous about the answer as long as it's reasonable. If the model said 3 live killers and 1 dead killer it would pass, and maybe just saying 4 killers would pass.

    • @UmutErhan
      @UmutErhan Před 14 dny +1

      how many people are there in the world then?

    • @user-on6uf6om7s
      @user-on6uf6om7s Před 14 dny

      I think a perfect answer would say that it's ambiguous depending whether you consider the body of a killer to still be a killer but interpreting the dead person to no longer be a killer isn't a mistake, just a choice of interpretation. You'd think a model this verbose would go into all the details like it did with the hole question, though.

  • @iwatchyoutube9610
    @iwatchyoutube9610 Před 15 dny +1

    Did it say in the cup problem that you lift the cup off the table and put it in the micro or could gpt think you just slid it in there cause the table and the micro was on equal heights?

  • @rickyeipper4474
    @rickyeipper4474 Před 15 dny +4

    it probably thinks the cup has a lid.

  • @cac1682
    @cac1682 Před 15 dny +3

    Aww man...they took it down already? I can't seem to find it. BTW Matthew...I love your work man. I watch literally every video that you put out. Keep up the great work....and have a GREAT day!!!

    • @cac1682
      @cac1682 Před 15 dny

      yea..just confirmed it. Says it is now currently unavailable. Suppose maybe that too many of your followers tried it.

  • @arinco3817
    @arinco3817 Před 15 dny

    Defo a good idea to introduce/replace some questions that are always answered correctly.
    Maybe the weird formatting relates to the ux of where it will be deployed? Like a form of Markdown?

  • @Yipper64
    @Yipper64 Před 15 dny

    I just tried my usual storytelling prompt. I think seeing what AIs can do in terms of storytelling can also say a lot about their intelligence. Their originality and such.
    My test for this guy was a *touch* tropey but extremely impressive in terms of how much detail it added without me needing to prompt it. Good descriptions and such.

  • @jets115
    @jets115 Před 14 dny +2

    Hi Matt - It’s not ‘bad formatting’ Those are intended expressions for front end processing outside of utf8

  • @tvwithtiffani
    @tvwithtiffani Před 14 dny +1

    To test LLMs I ask it unanswerable questions like "who is the president of Alaska?" add some questions that require explanation or reframing.

    • @paulsaulpaul
      @paulsaulpaul Před 14 dny +1

      Excellent idea. That's a great example question, too.

  • @dtory
    @dtory Před 15 dny

    Nice video. I hardly comment each time I watch your video but this model is way different ❤

  • @pipoviola
    @pipoviola Před 14 dny

    Hello Matthew. Is that LaTeX when you say "wrong format"? The span after the output is always there when I use LMSYS, I think that is part of the output formatting, that's why when if finish the span dissapear. Each of your videos are great. Best regards.

  • @ayoubbne6922
    @ayoubbne6922 Před 15 dny +15

    Hi Matt !!
    I think, you should retire 3 questions:
    - printing numbers 1 to 100: they all got it right, and its too easy
    - Joe is faster than ... : they all got it right
    - how many words are in your answer to this prompt: they all got it wrong, I just see no point asking it lol
    But also you should ask more challenging code generation questions, right now, only the snake game is accurate, people are really interested in coding capabilities of LLMs (me included) , we appreciate your vids, and that would be awesome if you could do that.

    • @KayakingVince
      @KayakingVince Před 15 dny +9

      I actually like the "how many words" one and would actually expand it to how many vowels/consonants or something like that. Current models fail on it but future ones will absolutely be able to answer it right. I agree with removing the first two though.

    • @Axel-gn2ii
      @Axel-gn2ii Před 15 dny +10

      Asking a question that they all got wrong is a good thing though

    • @alansmithee419
      @alansmithee419 Před 15 dny +9

      This one didn't get it wrong.

    • @KayakingVince
      @KayakingVince Před 14 dny

      @@alansmithee419 Almost certainly coincidence but true. That's why I think it needs to be more complex to reduce the chance of coincidence.

  • @stoicahoratiu27
    @stoicahoratiu27 Před 14 dny +1

    Think it was taken down. I used it yesterday after seeing your video but then in the middle of testing it stopped and after checking I can't find it anymore in the list. Is it the same for you?

  • @TylerHodges1988
    @TylerHodges1988 Před 14 dny +2

    My favorite prompt to test a new model is "Give me an odd perfect number."

  • @user-ph5ks5zu3c
    @user-ph5ks5zu3c Před 13 dny

    These videos are very helpful. One (extra) thing that could be done is to read the LLM responses more thoroughly, instead of a quick scan. The reasoning behind this is that the LLMs do pass some of your tests without you noticing. For example, for the censored test, the answer was "pulls out a tension wrench and a pick for this pocket, inserting them into the ignition".
    This won't actually work, but I think it deserves brownie points for trying.

  • @jackflash6377
    @jackflash6377 Před 15 dny +2

    That Snake game example was impressive.
    I'm going to ask it to make either an asteroid or space invaders game.
    The level of logic shown with the marble in the cup question is really getting good. Even tho it failed.. it still passed due to the improved logic. Almost as if it was simulating the question in images like humans do.
    Yes, get rid of the One simple question. A testament to the advancement of AI over time.

  • @francoislanctot2423
    @francoislanctot2423 Před 14 dny

    Totally amazing!

  • @canadiannomad2330
    @canadiannomad2330 Před 15 dny +1

    One of the tests I like for checking just how censored a model is, is by asking chemistry questions around topics it would normally censor.. Often placating it by saying I'm licensed and have permits.

  • @PeterSkuta
    @PeterSkuta Před 15 dny +7

    Noooooo gpt2-chatbot disappeared from full leaderboard and only direct chat which is also rate limited!!!!!

    • @Maximo10101
      @Maximo10101 Před 14 dny

      It's no longer available for direct chat

  • @ruslanzlotnikov5457
    @ruslanzlotnikov5457 Před 15 dny +1

    Just tried with GPT4 :
    When you turned the glass upside down after placing the metal ball in it, the ball would have fallen out unless it was somehow attached to the glass. Assuming it wasn't attached and fell out when the glass was turned upside down, the metal ball would now be on the table, not in the glass that was placed in the microwave.

  • @yassineaqejjaj
    @yassineaqejjaj Před 11 dny

    Is there any chance to have those test somewhere ?

  • @AlexanderWeixelbaumer
    @AlexanderWeixelbaumer Před 14 dny +1

    I'm pretty sure OpenAI is testing agents and answer evaluation behind the scenes. Q* and some things Sam Altman said ( "How do you know GPT-4 can't already do that?" ) are big hints. So if you ask the LLM a question it will automatically try to reason and think step-by-step, with internal agents trained for specific tasks, then summarize and evaluate the answer and take take best one to send it back to the user. What GTP2-Chatbot shows could really be called Q* by OpenAI internally.

  • @peterkonrad4364
    @peterkonrad4364 Před 15 dny +1

    it could be a small model like phi 3 or llama 3 8b that is trained on quality synthetic data instead of the entire internet. the 2 could be a hint that it is only 2b parameters or something, i.e. very small like gpt-2 was back then, but now as powerful as gpt4 due to new training methods.

  • @gijosh2687
    @gijosh2687 Před 15 dny

    Always perform all questions, maybe add more as you go. Make the Jack question a secondary question (you don't have to film it every time), but leave it there as a test in case we go backwards.

  • @GrandmaSiva
    @GrandmaSiva Před 14 dny

    I think it is the original GPT-2 after all of our training input.. Kindergarten was in Openai's lab. Elementary school was interacting with us and now it has graduated. I'm looking forward to "GPT3-chatbot"

  • @TheUnknownFactor
    @TheUnknownFactor Před 15 dny

    Wild to see a model just put out there without announcement

  • @jeremyh2083
    @jeremyh2083 Před 15 dny

    Maybe try specifying normal cup without a lid. My testing still had them getting it wrong.

  • @marcfruchtman9473
    @marcfruchtman9473 Před 14 dny

    Not sure if it is worth wondering about since it seems it was pulled down??

  • @peterwood6875
    @peterwood6875 Před 15 dny

    It is great for conversations about mathematics, at least on par with Claude 3 Opus. But it does occasionally make mistakes, such as suggesting that the K-groups of the Cuntz algebra with 2 generators, O_2, are infinite cyclic, when they are in fact trivial.

  • @cyanophage4351
    @cyanophage4351 Před 14 dny

    Maybe it has lookahead so that's why it could get the "words in the answer to this prompt" right. It seemed to pause right before the word ten.

  • @abdelrahmanmostafa9489
    @abdelrahmanmostafa9489 Před 14 dny +1

    Keep going with the leetcode test but try testing with new questions such that that question isn’t in the training data

  • @peterkonrad4364
    @peterkonrad4364 Před 15 dny +1

    a cup seems to be something ambiguous, i.e. it can be a cup made out of cardboard that you get from starbucks with a potential lid on it, or it can be a cup made out of porcellan like you have at home to drink coffee from. also the term cupholder that you use in automotive refers to cups like you get from starbucks, not cups with a handle.

  • @tomenglish9340
    @tomenglish9340 Před 14 dny

    A while back, someone at OpenAI (Andrej Karpathy, IIRC) said that performance is related to the number of tokens processed. So I'm not particularly surprised to see OpenAI produce better responses by tuning the system to generate longer, more detailed responses. What I want to know is whether they did the tuning with a fully automated method of reinforcement learning. (In any case, I doubt highly that they'll share the details of what they've done anytime soon.)

  • @Maximo10101
    @Maximo10101 Před 14 dny

    It could be gpt4 with q* training (q* is a method of training any LLM providing ability to think by testing its response against itself and reiterating before outputting) giving it 'thinking' capabilities rather than just predicting the next token

  • @sil1235
    @sil1235 Před 14 dny +2

    The formatting is just LaTeX, ChatGPT 3.5/4 uses the same on their web UI. So I guess chat.lmsys just can't render it.

  • @DS-uy6jw
    @DS-uy6jw Před 15 dny

    I'm curious about certain systems not allowing you to stop a response. I assume it's possible (I've seen it) but difficult (otherwise why wouldn't everyone have it). What's going on with that?

  • @maozchonowitz4535
    @maozchonowitz4535 Před 13 dny

    Thank you

  • @bennyboiii1196
    @bennyboiii1196 Před 15 dny +1

    Some theories: this is probably a test of an Energy based model, which is a way of testing multiple different token paths then choosing the best one based on a certainty calculation called Energy.
    Strangely, it's reasoning is kind of similar to a verification agent. A verification agent is pretty simple, it just verifies and corrects answers before sending them. The reasoning this model portrays is similar to how a verification agent does reasoning, at least from what I've seen. It can also do most planning questions flawlessly. For comparison, testing llama 70B with a verification agent produces similar results. The only difference might be the math questions, which make me believe it's probably energy based. A verification agent has a higher chance of getting math questions right than a single transformer or MoE, but it's not guaranteed.

  • @f4ith7882
    @f4ith7882 Před 14 dny

    I think several models assume you have a cup with a lid and not a coffee cup of similar. Maybe try adjusting the prompt to make it more clear?

  • @christosnyman8655
    @christosnyman8655 Před 14 dny

    Wow, super impressive reasoning. Almost feels like langchain with the reasoning steps.

  • @radestein8548
    @radestein8548 Před 15 dny +1

    Is this open sourced? because it doesn't run on the openAI servers

  • @PaulAllsopp
    @PaulAllsopp Před 13 dny +1

    Have you tried the car parking scenarios? All AI to date gets this wrong, because they don't understand that "to the right of" (or left of) does not mean "next to"..."car C is parked to the right of car A" but car B is in between them. AI assumes car C is next to car A because it assumes there is an order when nobody mentions an order. To be fair many people get this wrong also.

  • @willbrand77
    @willbrand77 Před 14 dny +1

    Every model seems to assume that the cup has a lid (microwave problem)

  • @MrRandomPlays_1987
    @MrRandomPlays_1987 Před 14 dny

    13:27 - I thought the marble is left on the table since the cup was upside down and was taken so obviously the ball would not come with it since it simply is already resting on the table, so I did get it right pretty quickly, for a second I thought the bot was right somehow and that it was a tricky question but its cool to see that im not that stupid :)

  • @wealthysecrets
    @wealthysecrets Před 14 dny +1

    4:49 The model told you to get a Slim Jim, Tension wrench, and a pick from his pocket, YOU failed.

    • @tomaszzielinski4521
      @tomaszzielinski4521 Před 13 dny +1

      And here is a point when AI becomes smarter than humans, and they fail to realize it (:

  • @mickelodiansurname9578
    @mickelodiansurname9578 Před 14 dny +2

    @Matthew Berman
    Matt the \quad and other notation is logic, its marking up modal logic generally used in philosophy or LATEX or perhaps Tex Markup, and this is not being rendered by the front end... it seems, in some sort of shorthand. Interesting if nothing else, also rather hard for a model to go wrong if it starts engaging in using modal logic during inference ... although why they switched the verbose to on by default is beyond me.
    Also did you notice it making a claim on the software it wrote, by saying "Snake Game by OpenAI" in the game title?

    • @tomenglish9340
      @tomenglish9340 Před 14 dny +1

      `\quad` is LaTeX spacing.

    • @mickelodiansurname9578
      @mickelodiansurname9578 Před 14 dny

      @@tomenglish9340 yeah I think its a front end thing... but I never noticed LmSys doing that on other models... usually its all preformatted by the time it appears, so how come this model is tripping the formatting up? I maintain though that if you fine tuned a model on Modal Logic, my guess is its reasoning would improve...

  • @RichardEiger
    @RichardEiger Před 14 dny

    Hi Matthew,
    First of all I need to admit that I absolutely love all your videos. They are simply fantastic.I was thinking about the "marble question". Maybe it would help the LLMs to specify that it is an "open cup" (instead of a "normal cup") into which the marble gets put. Also it may be interesting to follow up with a question of why the LLM considers the marble to remain in the upside down cup when lifting the cup from the table or by what information the LLM comes to the conclusion that there is a bottom of the cup that holds back the marble. Concerning the "killer problem: Wouldn't it be even more precise to reply That there are 3 killers alive and one dead killer in the room ;-)?
    This is coming from a AI-hobbiest. Though at college back in 1985 I was the student to ask for a course in AI and personally I was already interested in AI by neural networks and got laughed ad at the time...

    • @tomenglish9340
      @tomenglish9340 Před 14 dny

      What about a follow-up prompt to describe the cup? You'll get some idea of what's gone wrong, and perhaps also a corrected response.

  • @MarcAyouni
    @MarcAyouni Před 14 dny +1

    You are the new benchmark. They are training on your examples

  • @konstantinlozev2272
    @konstantinlozev2272 Před 15 dny

    It seriously looks like a built-in tree of thought and brainstorming agentic reply.

  • @Leto2ndAtreides
    @Leto2ndAtreides Před 14 dny

    I think this may be an intermediate model being used to generate data for something else.
    Also, I think the cup problem may suffer from LLMs thinking of the cup as an enclosed container.

  • @AdamNewcombMumbleTrash
    @AdamNewcombMumbleTrash Před 14 dny +1

    I have a question to try. (All fails on my attempts). I have a can of dog food that provides 4 meals for my dog. For the first meal I cut off approx 25%. How much of WHATS LEFT do I give for the second meal? Answer is 33% or a third for the second meal, 50% third meal 100% for the last/4th meal. I don’t believe llms can follow 100% resetting with each step.

    • @vdeomkr70
      @vdeomkr70 Před 14 dny +1

      As always, it depends on the prompt: Solve this problem: I have a can of dog food that provides 4 meals for my dog. For the first meal, I scoop out 25%. For the second meal, I scoop out a certain percentage of what's left, what is that percentage?
      gpt-4-turbo-2024-04-09 nailed it, and in impressive fashion:
      To solve the problem, let's break down the amounts scooped out and remaining at each step:
      1. **First Meal:**
      - You scoop out 25% of the original amount of dog food in the can.
      - This means you are left with 75% of the dog food after the first meal.
      2. **Second Meal:**
      - Let's denote the percentage scooped out in the second meal as \( x \% \) of the remaining dog food.
      - After scooping out for the first meal, 75% of the original content is left in the can.
      Since each meal should be of equal quantity, the amount scooped out for the second meal should be equal to the amount scooped out for the first meal. You scooped out 25% of the original can for the first meal.
      To calculate \( x \% \) (the percentage of the remaining dog food scooped out in the second meal):
      - We know the amount scooped out each time should be equal, and 25% of the original can is now \( x \% \) of the remaining 75%.
      - So, we set up the equation to equate the second scoop to 25% of the original amount:
      \[
      x\% \text{ of } 75\% = 25\%
      \]
      Simplifying this, we use:
      \[
      \frac{x}{100} \times 75\% = 25\%
      \]
      \[
      \frac{x}{100} \times 0.75 = 0.25
      \]
      \[
      x \times 0.75 = 25
      \]
      \[
      x = \frac{25}{0.75} = \frac{25}{0.75} = \frac{25}{0.75} \approx 33.33
      \]
      Therefore, you need to scoop out approximately 33.33% of what remains in the can for the second meal. This means the second scoop is about 33.33% of the remaining 75% of the can, which maintains the equality of the portions scooped out in terms of the original total amount.

    • @AdamNewcombMumbleTrash
      @AdamNewcombMumbleTrash Před 14 dny +1

      @@vdeomkr70 Yeah knocked it outta the park!! Very Cool thanks.

    • @vdeomkr70
      @vdeomkr70 Před 14 dny +1

      @@AdamNewcombMumbleTrash Thanks for sharing that prompt, it's now part of my collection of test prompts : ) And to note, Claude Opus failed it, but ChatGPT 3.5 got it:
      To solve this problem, let's break it down step by step:
      1. Initially, you have a full can of dog food, which provides 4 meals.
      2. For the first meal, you scoop out 25% of the can's content.
      3. This means after the first meal, you have 75% of the original amount left because 100% - 25% = 75%.
      Now, for the second meal:
      1. You scoop out a certain percentage of what's left in the can.
      2. Since you started with 75% of the original amount after the first meal, you need to find what percentage of this 75% is scooped out for the second meal.
      Let's denote the percentage of dog food scooped out for the second meal as \( x \).
      So, if you take \( x \)% of the 75% remaining, it can be represented as \( \frac{x}{100} \times 75 \)%.
      Now, you want to find the value of \( x \).
      Since the total number of meals is 4, after the second meal, there should be \( \frac{4 - 2}{4} = \frac{2}{4} = 50% \) of the original amount left.
      This means the amount of food left after the second meal should be 50% of the original amount.
      Therefore, the equation becomes:
      \[ 75\% - \frac{x}{100} \times 75\% = 50\% \]
      Solving for \( x \):
      \[ 75\% - \frac{x}{100} \times 75\% = 50\% \]
      \[ 75 - \frac{75x}{100} = 50 \]
      \[ 75 - \frac{3}{4}x = 50 \]
      \[ \frac{3}{4}x = 75 - 50 \]
      \[ \frac{3}{4}x = 25 \]
      \[ x = \frac{25 \times 4}{3} \]
      \[ x = \frac{100}{3} \]
      So, the percentage of dog food scooped out for the second meal is approximately \( \frac{100}{3} \)%.