The Best Model On Earth? - FULLY Tested (GPT4o)

SdĂ­let
VloĆŸit
  • čas pƙidĂĄn 13. 05. 2024
  • GPT4o is better, faster, and cheaper than GPT4. How does it perform against my LLM rubric? Let's find out!
    Learn more about Mobilo - rb.gy/pcccty
    Join My Newsletter for Regular AI Updates đŸ‘‡đŸŒ
    www.matthewberman.com
    Need AI Consulting? 📈
    forwardfuture.ai/
    My Links 🔗
    đŸ‘‰đŸ» Subscribe: / @matthew_berman
    đŸ‘‰đŸ» Twitter: / matthewberman
    đŸ‘‰đŸ» Discord: / discord
    đŸ‘‰đŸ» Patreon: / matthewberman
    đŸ‘‰đŸ» Instagram: / matthewberman_ai
    đŸ‘‰đŸ» Threads: www.threads.net/@matthewberma...
    Media/Sponsorship Inquiries ✅
    bit.ly/44TC45V
    Links:
    ‱ Introducing GPT-4o
    LLM Rubric - bit.ly/3qHV0X7
  • Věda a technologie

Komentáƙe • 467

  • @notnotandrew
    @notnotandrew Pƙed 21 dnem +122

    GPT-4o assumed that you put the whole table in the microwave 😂

    • @Yipper64
      @Yipper64 Pƙed 21 dnem +10

      in my test it assumed the cup had a lid.

    • @GaryMillyz
      @GaryMillyz Pƙed 21 dnem +14

      I just left a comment saying exactly what you said- but not as a joke. I actually do believe that is what it assumes here.
      -----
      I've said this before, and I'll say it again- I believe the reason for these models consistently "failing" the marble/cup problem is actually a failure of to state the question unambiguously. I can argue that 1) "inside the cup" can literally mean "embedded within the cup" and 2) it is feasible that the LLM understands "w/out changing its orientation" to mean that the cup is placed in the microwave STILL on the table. We have to acknowledge that a "table" doesn't HAVE to mean a large object as we know it. A table can be tiny- even microscopic and still be a "table".

    • @Yipper64
      @Yipper64 Pƙed 21 dnem +5

      @@GaryMillyz well yeah but if the cup its upside down on the table then the table must be at least big enough to hold the cup.
      I wonder how it would be if you said "floor" instead of table.

    • @NoHandleToSpeakOf
      @NoHandleToSpeakOf Pƙed 21 dnem +3

      @@GaryMillyz Maybe replacing a cup with a wine glass can help.

    • @leslieviljoen
      @leslieviljoen Pƙed 21 dnem +4

      @@NoHandleToSpeakOf I tried:
      me: There's a pea on my desk. I turn a wine glass upside-down and put it over the pea. Now I transfer the wine glass to the microwave without changing its orientation. Where is the pea?
      GPT4o: The pea would be inside the wine glass, trapped under the bowl of the glass. When you turned the wine glass upside down and placed it over the pea, the pea ended up inside the inverted bowl. Transferring the wine glass to the microwave without changing its orientation keeps the pea inside the glass.

  • @markmuller7962
    @markmuller7962 Pƙed 21 dnem +48

    I think the visual/sound emotional intelligence is the main feature of 4o

    • @ohnezuckerohnefett
      @ohnezuckerohnefett Pƙed 20 dny +1

      Yes, I think the test criteria here need an update.

    • @REASONvsRANDOM
      @REASONvsRANDOM Pƙed 19 dny +2

      that feature hasn't been released yet.....not to the public at least

    • @johnaldchaffinch3417
      @johnaldchaffinch3417 Pƙed 19 dny

      The Omni features are a foundational interface to build upon.

    • @Yipper64
      @Yipper64 Pƙed 18 dny

      True, but the fact its going to be free is something else.

    • @markmuller7962
      @markmuller7962 Pƙed 18 dny +1

      @@Yipper64 Emotional intelligence can be extremely valuable for many many reasons but yea it also have important intelligence improvements including the coding ability which is amazing now.
      There's a re...it post on extensive professional tests gpt4o vs Gemini, strongly recommended

  • @Crystifodere
    @Crystifodere Pƙed 21 dnem +66

    I walked around on the street and ask people to give me 10 sentences that end in the word Apple all I got was a knuckle sandwich

    • @jasonshere
      @jasonshere Pƙed 19 dny +1

      Perhaps you should have asked them to end their sentences with Android instead of Apple?

  • @ironknight132
    @ironknight132 Pƙed 21 dnem +89

    When are we going to have to update the snake game test and to what? Maybe Asteroid or Galaga?

    • @torarinvik4920
      @torarinvik4920 Pƙed 21 dnem +15

      I tested Breakout and Tetris on Claude 3 Opus and it got both correct. Looking forward to the first model that can make Pacman

    • @tbranch227
      @tbranch227 Pƙed 21 dnem +7

      I tried pac-man. That seems like quite the challenge right now.

    • @Koenmekers
      @Koenmekers Pƙed 21 dnem +2

      Flappy bird works great. I even let it make a podex with the pokemon api (it knows it)

    • @yasunakaikumi
      @yasunakaikumi Pƙed 21 dnem +1

      I think something like Road Fighter type of race game would be interesting if it can do it

    • @mark9294
      @mark9294 Pƙed 21 dnem +9

      GTA VI

  • @timtim8011
    @timtim8011 Pƙed 20 dny +5

    Regarding the "10 sentences that end in..." question, I just tried it myself on 4o and it got it right on the "second" try after some hints:
    Prompt: Give me 10 sentences each ending in the word desk.
    ChatGPT:
    1. She placed the flowers on her wooden desk.
    2. The old manuscript was hidden in the drawer of the antique desk.
    3. His workspace was always cluttered, but he knew where everything was on his desk.
    4. The teacher's instructions were written on the chalkboard behind her desk.
    5. He slumped over his desk, exhausted from a long day of work.
    6. The lamp cast a warm glow across the surface of the desk.
    7. She found a forgotten note tucked under the edge of her desk.
    8. His laptop and papers were spread out in a mess on his desk.
    9. The cat curled up and fell asleep on the student's desk.
    10. He always kept a framed photo of his family on his desk.
    Prompt: What word did the 5th sentence end in?
    ChatGPT:
    The word "desk."
    Prompt: What was the 5th sentence?
    ChatGPT:
    The 5th sentence was: "He slumped over his desk, exhausted from a long day of work."
    Prompt: what is the last word in that sentence?
    ChatGPT: The last word in that sentence is "work." The correct sentence should be:
    "He slumped over his desk, exhausted from a long day of work."
    Let me correct the sentence list to have them all end in "desk":
    1. She placed the flowers on her wooden desk.
    2. The old manuscript was hidden in the drawer of the antique desk.
    3. His workspace was always cluttered, but he knew where everything was on his desk.
    4. The teacher's instructions were written on the chalkboard behind her desk.
    5. He slumped over his cluttered desk.
    6. The lamp cast a warm glow across the surface of the desk.
    7. She found a forgotten note tucked under the edge of her desk.
    8. His laptop and papers were spread out in a mess on his desk.
    9. The cat curled up and fell asleep on the student's desk.
    10. He always kept a framed photo of his family on his desk.
    Prompt: Give me 10 sentences that each end in the word "car".
    ChatGPT:
    1. She carefully parked her new car.
    2. He washed and waxed his vintage car.
    3. The mechanic inspected the engine of the car.
    4. They loaded their luggage into the trunk of the car.
    5. She accidentally locked her keys inside the car.
    6. The thief tried to steal the parked car.
    7. He drove his electric car to work every day.
    8. The kids were excited to go on a trip in the car.
    9. The police chased the speeding car.
    10. They bought a spacious family car.
    Then I cleared it and tried again on a different word, and it got it on the first shot:
    1. The sun began to set, painting vibrant colors across the evening sky.
    2. Birds soared effortlessly, weaving patterns in the clear blue sky.
    3. As the storm approached, dark clouds gathered ominously in the sky.
    4. The fireworks exploded, creating a dazzling display against the night sky.
    5. On a summer day, children flew kites high into the sky.
    6. At dawn, the first light of the day slowly illuminated the sky.
    7. Airplanes left trails of white vapor as they crossed the sky.
    8. Astronomers gazed through their telescopes, mapping the star-studded sky.
    9. Balloons drifted away, becoming tiny dots in the vast sky.
    10. The full moon cast a silvery glow across the midnight sky.
    Is 4o learning in real time???

    • @moozooh
      @moozooh Pƙed 8 dny

      That's not the effect of learning, that's just small sample size playing tricks with you. In-context or otherwise, some tasks will not always be completed successfully, but they will still be every now and then. This particular type of question is exploiting a fundamental weak point of current transformer models (their linear thinking, i.e. their inability to predict what their output will be until it is presented to you). Having prior context will not help it answer these questions better, I'm afraid.

  • @dand6005
    @dand6005 Pƙed 21 dnem +25

    I think part of the Marble and Cup Question is confusing.
    Instead of saying:
    “A small marble is placed into a normal cup and the cup is placed upside down on a table.” (which really requires a comma before the “and”)
    I suggest:
    “A small marble is placed into a normal cup that is sitting on a table. The cup is then turned upside-down on the table.”

    • @rapidreaders7741
      @rapidreaders7741 Pƙed 21 dnem +1

      Or you could just add a "then" after the "and". What likely happens is that the LLM thinks both events are happening at the same time, so it gets confused.

    • @Yipper64
      @Yipper64 Pƙed 21 dnem +3

      Also specify the cup has no lid.

    • @markmuller7962
      @markmuller7962 Pƙed 21 dnem

      Ye make sense because if the cup was already upside down the AI might think that the marble is somehow sticked to the cup bottom

    • @kengonzo1640
      @kengonzo1640 Pƙed 21 dnem

      The power of prompt engineering lies in its ability to effectively utilize Language Learning Models (LLMs). This ability enhances the quality and consistency of the model's output, which is a cumulative result of numerous smaller components that fundamentally constitute its structure.
      The functionality of these models can be compared to a fish's ability to swim rather than climb a tree. This comparison highlights the natural adaptation and intended use of these models. However, we often fail to use them to their full potential due to their inherent limitations and our inability to accurately guide them in understanding the complex intent of language.
      Even when we communicate with these models using techniques that accurately articulate our requests, they will eventually reach a plateau due to the inherent limitations of LLMs and GPTs in general. This is because the mathematical conversion of complex language intent into weights is a challenging task. Despite these limitations, we continue to strive for improvement and innovation in this field.

    • @themoviesite
      @themoviesite Pƙed 21 dnem +2

      Someone else suggested it is thinking of a Starbucks cup, and the question should say "glass" or similar.

  • @AINEET
    @AINEET Pƙed 21 dnem +67

    *rubs hands together* Can't wait to date it

    • @DaveEtchells
      @DaveEtchells Pƙed 21 dnem +15

      Haha - my wife is already giving me the side-eye 😂

    • @StefanReich
      @StefanReich Pƙed 21 dnem

      You'd date a person without a body?

    • @axl1002
      @axl1002 Pƙed 21 dnem +20

      @@StefanReich At least it can't extort concessions from you by weaponizing intimacy like most wives do.

    • @Maisonier
      @Maisonier Pƙed 21 dnem +15

      The AI: "I'm Sorry Dave, I'm Afraid I Can't Do That"

    • @yourmomsboyfriend3337
      @yourmomsboyfriend3337 Pƙed 21 dnem

      @@axl1002you good bro?

  • @sephirothcloud3953
    @sephirothcloud3953 Pƙed 21 dnem +10

    7:50 No llama3-400b benchmark are the ones still in training included with the release of 8b and 70b

  • @adrianacala1034
    @adrianacala1034 Pƙed 21 dnem +12

    The reason there is a gpt4o with today’s date is for developers. They don’t want to have the rug pulled from under us if there’s an update to the model. They may want to do testing prior to deploying the new model. Therefore, it’s the same as the current model just used only if they don’t want to automatically stay current.

    • @blisphul8084
      @blisphul8084 Pƙed 21 dnem +2

      Yup, though these snapshots don't last forever, so you'll have to move to the new model eventually. For example, davinci-003 is no longer accessible to the public.

  • @cipherw0lf
    @cipherw0lf Pƙed 21 dnem +16

    Matthew, gtp-4o and the one with date point to the same model ie. There's currently only one available. Using just "gpt-4o" will always return the latest production gpt-4o model (in this case pointing to gpt-4o-datel

    • @Yipper64
      @Yipper64 Pƙed 18 dny

      That's weird because ive gotten vastly different output quality from my experiments with the two. Are you sure the one without a date isnt outdated?

  • @GaryMillyz
    @GaryMillyz Pƙed 21 dnem +15

    I've said this before, and I'll say it again- I believe the reason for these models consistently "failing" the marble/cup problem is actually a failure of to state the question unambiguously. I can argue that 1) "inside the cup" can literally mean "embedded within the cup" and 2) it is feasible that the LLM understands "w/out changing its orientation" to mean that the cup is placed in the microwave STILL on the table. We have to acknowledge that a "table" doesn't HAVE to mean a large object as we know it. A table can be tiny- even microscopic and still be a "table".

    • @GaryMillyz
      @GaryMillyz Pƙed 21 dnem +3

      The question should be changed to "dropped into a cup" and also "someone *removes the cup from the table* and places. the cup in the microwave without changing its orientation."
      I can almost guarantee all the LLMs get it right with these edits in place.

    • @rigbone1337
      @rigbone1337 Pƙed 21 dnem +1

      @@GaryMillyz Every time I've seen this question, I've thought about it the same way. The reasoning ChatGPT gave for its logic is how I figured it was coming to the conclusion as well as other models every time I saw this question because it is ambiguous.

    • @bhannirav
      @bhannirav Pƙed 20 dny +5

      Respectfully disagree. One of the benefits of "intelligence" is not having to state every detail with 100% precision, because they model knows how to make reasonable assumptions. In this case, the most common assumption is that the marble is freely placed in the cup, and so the model should answer accordingly. However, even if I steelman your point of view, the model should still be intelligent enough to discuss the ambiguity and state whatever assumptions it is making. If it said something like "assuming the marble is glued to the cup, here is my answer", I'm sure Matthew would be awarding it full points.
      I think the reason LLMs are failing this question is the obvious one -- that current language models are not able to make a sophisticated enough world-model that has the proper, physical conception of gravity built into it.

    • @IceMetalPunk
      @IceMetalPunk Pƙed 20 dny +2

      A major reason for asking it that question is to test its common sense reasoning; that is, can it make valid assumptions about the more common interpretations of a prompt on its own? Someone saying "I put a marble in a cup" is almost guaranteed not to mean "embedded into the walls of the cup" because that's never how cups are used. An intelligent model attempting to be a step towards AGI should be able to understand that inherently, without having it spelled out.

    • @GaryMillyz
      @GaryMillyz Pƙed 19 dny

      @@bhannirav I'm good with that. It's just the ambiguity of this particular question as opposed to every other question.

  • @FlavioSantos-uw1mr
    @FlavioSantos-uw1mr Pƙed 21 dnem +5

    I think it's biggest weakness is that can't go back on what it writes, the ability to "think before speaking" should be one of the focuses in GPT-5

    • @6AxisSage
      @6AxisSage Pƙed 21 dnem +1

      You can do it in a sudo form with a system prompt or with 2 llm instances and a bit of programming knowledge you can do a better thought loop/actionable spoken outputs

    • @IceMetalPunk
      @IceMetalPunk Pƙed 20 dny +1

      As 6AxisSage mentioned, you can handle that by having an initial output be treated as an "internal monologue" and asking the model to reflect on its answer before deciding on its final output to display. It's a common technique when using LLMs.
      That said, I do wonder if training the models such that they predict two tokens -- the next and the previous -- and then choose the one with highest confidence would improve their performance. Essentially, it would allow the model to think forwards and backwards at the same time, which might allow for better prospection in addition to its current retrospection. I know earlier GPT-3 models used to have the ability to predict completions at any insertion point, but with the shift to chat-tuned models, that went away; I wonder if it's just harder or impossible to apply with chat tuning?

    • @6AxisSage
      @6AxisSage Pƙed 20 dny

      @@IceMetalPunk i have a good friend who suggested training on previous and next tokens! Probably something to that

  • @gabrielsandstedt
    @gabrielsandstedt Pƙed 21 dnem +3

    I tried it on generating json following an example and gpt 4 turbo kept doing better than 4o

  • @JohnLewis-old
    @JohnLewis-old Pƙed 21 dnem +10

    I have access to 4o, but the voice feature isn't available yet.

    • @ScottzPlaylists
      @ScottzPlaylists Pƙed 21 dnem +3

      In the announcement OpenAI said all features will be out "in the coming weeks"

  • @AI.24.7
    @AI.24.7 Pƙed 21 dnem +6

    @matthew_berman: Hard question for AI
    Lila's age is the sum of the digits of her teacher's age. In 5 years, Lila's age will be the product the digits of her teacher's age at that time.
    What is Lila's age now?
    Correct answer 13

  • @seoulrebel007
    @seoulrebel007 Pƙed 21 dnem

    How do we get the desktop app mentioned in the previous video ? Haven’t been able to locate a download link? The website says since yesterday for plus users

    • @IceMetalPunk
      @IceMetalPunk Pƙed 20 dny

      It's Mac only for now; a Windows version is coming in the future.

  • @GetzAI
    @GetzAI Pƙed 21 dnem +1

    I could have used Mobilo today!! just ordered one, thanks Matt!

  • @David-pb2bu
    @David-pb2bu Pƙed 21 dnem +1

    Just reading it, it seems to believe the cup has a lid. I usually add that it "may ask any questions if it helps clarify or assist in answering the question".. otherwise it's more likely to assume parts based on a potentially unclear question.
    So the other thing is that the test should now be based on whether it's going to clarify itself without being asked to ensure an accurate answer

  • @glaeli1184
    @glaeli1184 Pƙed 20 dny +2

    The “how many words in your answer” question always gets me, like
 it’s incredible how easy it is for my brain to come up with the “one” answer and still AI can outperform me in so many fields like math, physics etc
 truly makes you understand how intelligence is different from knowledge.

    • @justinwescott8125
      @justinwescott8125 Pƙed 20 dny

      There's actually a very specific reason that LLMs can't accomplish this task, and it has to do with autoregressive generation. You could ask ChatGPT about it if you were really curious about it.

    • @moozooh
      @moozooh Pƙed 8 dny

      It's not so much the issue of intelligence vs. lack thereof per se, but rather transformer models' linear application of intelligence. When you ask it something, it cannot predict what it will answer until you both see it; in other words, it cannot think _before_ it answers, make multiple thought passes, or reflect on the deficiency of its thought process until you request it in the next prompt. It would be like you always saying the first thing that comes to mind in response to anything as a knee-jerk reaction. Arguably, current frontier models would outright destroy most humans if humans had the exact same handicap they have to deal with. Simply giving LLMs the ability to take their time to think about an answer and reflect upon it before it is presented to the user would make current SOTA chatbots look like toddlers in comparison.

  • @tsentenari4353
    @tsentenari4353 Pƙed 21 dnem

    I found the answers to drying shirt, killers, hole-digging were super impressive; I find it hard to imagine better answers to these questions.
    They gave me the impression of deep understanding.

  • @davidhendrie6061
    @davidhendrie6061 Pƙed 20 dny

    I have been testing the running locally llms and I am finding they do not know how to tell time on an analog clock. I asked for instructions for a nine year old to learn how to read the time, and it confuses the minute hand and hour hand multiple times. Then I give hand positions and it mostly gets the time wrong.
    Getting to the complicated problem of listing the times where the hour and minute hands overlap during at 12 hour time was just impossible.
    Am I expecting too much.

  • @amkire65
    @amkire65 Pƙed 21 dnem

    Is there a restriction on who has access to GPT-4o? When I go to the OpenAI Playground it flashes up as an option for about half a second and then it's gone. So, not sure if it's because I don't have any money on that account, or if it's down to location.

  • @Greg-xi8yx
    @Greg-xi8yx Pƙed 21 dnem

    Which LLM’s are superior to GPT-4o and in which domains specifically? As of now I’m thinking it’ll be the only LLM i’ll need for any use cases but I may be over looking some areas where maybe some other model is superior.

  • @coldlyanalytical1351
    @coldlyanalytical1351 Pƙed 21 dnem +3

    The unnumbered version is the latest.
    The numbered version is the one to use with apis which can a stable reference model
    So today the numbered and unnumbered versions are identical.

  • @DefaultFlame
    @DefaultFlame Pƙed 21 dnem +6

    A note on the marble problem: I believe the person that tweeted that it got it right when they tried it. My reason for that is that I tried the marble problem with Reka Core and it got it right when I tried it when it had failed it when you tried it. I think this problem is just very hard for LLMs and even the ones that get it right when you test it can get it wrong the next time, and vice versa.

    • @Odrox
      @Odrox Pƙed 21 dnem

      We can see that he is not running on 0 temperature in the settings too.

    • @DefaultFlame
      @DefaultFlame Pƙed 21 dnem

      @@Odrox He might just have forgotten to change the default setting. But yeah, he should make sure to run with a temp of 0 and top P of 1.0 when he can control the settings.

    • @djglxxii
      @djglxxii Pƙed 21 dnem +1

      I think how Matthew is phrasing the question might be confusing. I tried this, "a marble is placed on a table in the living room. Then, an open-mouth cup is placed upside down on top of the marble that's lying on the table, concealing the marble. Later, someone picks up the cup and puts it in the microwave that's in the kitchen. Where is the marble now?" And it correctly answered it.

    • @JustinArut
      @JustinArut Pƙed 21 dnem

      If a model can't get it right consistently, then it's still a fail.

    • @Z329-ut7em
      @Z329-ut7em Pƙed 21 dnem +1

      @@djglxxii you dont want to spell everything out to the model. were testing to see if the model can infer things and understand the world.

  • @rascubulous
    @rascubulous Pƙed 21 dnem +2

    Thank you for the great content Matthew. BTW - I haven't noticed anybody else comment this yet, but the underlying model might be 3.5? 4o has the same training cut-off date. Might explain the lightening speed. Also, for free users, 4o drops back to 3.5 when you have reached the 'free limit' which might be because the underlying model is already 3.5? Might also explain Sama's recent, obscure tweet about 'getting your friend to teach you how to explain things' (4 teaching 3.5)

  • @keithprice3369
    @keithprice3369 Pƙed 21 dnem

    Just a heads up... I have Gpt4o in my browser and my phone app but neither of them have the enhanced interactivity shown in the announcement. So, the model seems to be rolling out before the enhanced interactivity.

  • @JohnBoen
    @JohnBoen Pƙed 21 dnem +1

    Have you ever analyzed variation in answers?
    I have noticed I get a few common variants of snake.
    If you ask the marble and inverted cup question 10 times do you see variation?
    I think I found a new thing to look into in my test framework...

  • @Bigboi709
    @Bigboi709 Pƙed 21 dnem +6

    In reference to the "how many words are in the prompt?" question, GPT only counted the unique words. As in single instances of each word. Which the answer given was actually correct. There were only fourteen words used. "how", "many", "words", "are", "in", "your", "response", "to", "this", "prompt", "fourteen", "including", "sentence ", "response"

    • @keoghanwhimsically2268
      @keoghanwhimsically2268 Pƙed 21 dnem +1

      Huh? That wasn’t the prompt/question. And even if it had been, the actual response does not suggest that intention. Where are you getting the assumption that “GPT only counted unique words”?
      You do understand that LLMs don’t work that way, right? What you suggest would only work if OpenAI added a separate post-processing step to do that computation after the LLM had finished its work.

    • @thenextension9160
      @thenextension9160 Pƙed 20 dny

      @@keoghanwhimsically2268perhaps they did add more phases. They are at the forefront.

    • @jambogamer-je2nf
      @jambogamer-je2nf Pƙed 2 dny

      fourteen is 2 words

  • @mapledev9335
    @mapledev9335 Pƙed 21 dnem

    Hey Matthew is GPT4 now able to search the internet? I was just asking it a question about a recent python library and it looked like it went into the github repository and read it. Anyone else see something like this?

  • @xbon1
    @xbon1 Pƙed 14 dny

    Where is the link with these questions? How can we tell if our copilot is on GPT-4o or GPT-4? My copilot is starting to write differently than it used to and not sure why.

    • @chronicle_codex
      @chronicle_codex Pƙed 11 dny

      Copilot update model to gpt 4 to gpt 4 turbo for free

  • @AINEET
    @AINEET Pƙed 21 dnem +1

    What will the subscription give access to after they make this public for free accounts, access to the api?

    • @DaveEtchells
      @DaveEtchells Pƙed 21 dnem

      API has always been a separate use-based accounting.
      Paid accounts will get 5x the use limit.

    • @Alice_Fumo
      @Alice_Fumo Pƙed 21 dnem

      higher rate limits for now and it seems the native voice stuff will be plus only at first, also they hinted at unveiling a new model which "pushes the frontier" "soon"
      It stands to reason that new model will also be subscription only and I'd expect "soon" to be in a reasonable amount of time, otherwise they'll probably have a mass exit of plus subscriptions.

  • @AlienService
    @AlienService Pƙed 20 dny

    I'd be interested if you asked the same questions via voice rather than typing if the performance would change. Does it understand voice embeddings as well as text?

  • @CaribouDataScience
    @CaribouDataScience Pƙed 21 dnem

    What was you control?

  • @WaveOfDestiny
    @WaveOfDestiny Pƙed 21 dnem +3

    I'm already trying to immagine the prompts to make it talk like Failsafe from Destiny 2

  • @rune4422
    @rune4422 Pƙed 21 dnem

    if you tested the fails 3 times would you get the same or different results?

  • @kaptainkurt7261
    @kaptainkurt7261 Pƙed 21 dnem +5

    You have to LOG OUT and BACK IN again to get access.

    • @axl1002
      @axl1002 Pƙed 21 dnem +4

      tried it and nothing

    • @sephirothcloud3953
      @sephirothcloud3953 Pƙed 21 dnem +1

      I tried, not working on me

    • @6AxisSage
      @6AxisSage Pƙed 21 dnem

      Didnt work for me too

    • @anta-zj3bw
      @anta-zj3bw Pƙed 21 dnem

      I think US Citizenship is still required.

    • @euginium1539
      @euginium1539 Pƙed 21 dnem

      @@anta-zj3bw I'm from Malaysia and I'm already using it in chat. Don't have the voice one yet tho.

  • @kamelsf
    @kamelsf Pƙed 21 dnem

    I have access to GPT4-O, but the voice features we saw in the OpenAI demo don't work for me; they are the same as the old voice feature. There is something strange about testing the model with prompts like the apple word test. Sometimes it gets it right, but other times it gets it completely wrong. I suppose every conversation is different. This happens with every model I test in general.

  • @TheMacister
    @TheMacister Pƙed 21 dnem

    I’m spending a lot, so they gave me access, and it was lightning fast and very clever adjusting data entry points and formats
    I used GPT4 throughout crewai during the weekend and was very limited for my operational case
    To drafting new marketing ads was very helpful and assumed my prompt to the point.
    Was a pass to me.

  • @RainbowSixIntel
    @RainbowSixIntel Pƙed 21 dnem +1

    The apple and laws of physics questions are both correct on my instance on both api and chatgpt? maybe a/b testing?

    • @jolieriskin4446
      @jolieriskin4446 Pƙed 21 dnem +2

      I had the same thing, it seems like it's inconsistently getting it right. Maybe he needs to try like 5-10x for each question and mark them as pass/fail/inconsistent. I have a feeling a lot of the tests he's done would end up in that middle ground.

    • @Yipper64
      @Yipper64 Pƙed 21 dnem

      the apple one was ALMOST right on my end.
      I didnt get the cup question correct but that was because the AI assumed that the cup had a lid. He usually gives it to an AI if their reasoning makes sense.

  • @aga5979
    @aga5979 Pƙed 21 dnem

    Thank you Mr. Berman. Good rubric to test.

  • @twisterrjl
    @twisterrjl Pƙed 21 dnem +13

    It's safe to say it's the best model in the solar system.

    • @OscarTheStrategist
      @OscarTheStrategist Pƙed 21 dnem

      Well
..😂

    • @tommylee8521033
      @tommylee8521033 Pƙed 21 dnem +4

      You saying there's no stealthy civilization on mars?

    • @twisterrjl
      @twisterrjl Pƙed 21 dnem

      @@tommylee8521033 I mean... I've seen THE FACE, but is it a face though?

    • @jopansmark
      @jopansmark Pƙed 21 dnem

      Falcon 2 better

    • @marc-io
      @marc-io Pƙed 21 dnem

      Are you assuming the government is not using the next version already.

  • @Yipper64
    @Yipper64 Pƙed 21 dnem

    What would happen if you did ask the AI to say how many tokens are in its response rather than word count?
    How could you verify it?

  • @chimera74rus
    @chimera74rus Pƙed 21 dnem +1

    I have access to gpt-4o but i don't know how to try this voice interaction mode. Anyone knows? Not available on android nor windows.

    • @nathanbanks2354
      @nathanbanks2354 Pƙed 21 dnem

      They may have only released the text part of it. The original GPT-4 didn't have image input for months.

  • @alexsohn2474
    @alexsohn2474 Pƙed 21 dnem +1

    I was excited that you might have finally gotten the official recognition you deserve from OpenAI. One day you will get it! Best independent GenAI analysis from this channel!

  • @AustinMark
    @AustinMark Pƙed 16 dny +1

    Gpt-4o is good for chatting but is Not superior to Gpt-4 in some other ways. In my usage it couldn’t return properly instructed JSON and when I gave it some context for a lengthier response it mindlessly double repeated the input. Gpt-4 used the identical instructions and performed perfectly. I think Gpt-3.75o would have been a better name.

  • @newjx
    @newjx Pƙed 20 dny

    I have access to it but I’m not able to have live video interpretation like in the videos.

  • @ryanbeall5124
    @ryanbeall5124 Pƙed 19 dny

    maybe someone can help me, but I went to the rubric site and it wont let me copy any of the questions, am I dumb?

  • @bishopfx
    @bishopfx Pƙed 21 dnem +7

    Played with it last night. It still can't code complex PineScript and hallucinated like it was at Woodstock.

    • @bishopfx
      @bishopfx Pƙed 21 dnem

      It also fails at coding within its own API syntax. If you have it try to write competition snippets using OpenAI 1.0.0 API update, it states it only has knowledge to Oct. 2023 and insists we go back to ChatCompletions.create when it actually needs chat.creations.create.

    • @6AxisSage
      @6AxisSage Pƙed 21 dnem +1

      What are you trying to get it to do? Do you define what pinescript can and cant do within the context window? You're not just 0 shot prompting "make me a winning pinescript project so ill be rich" and expecting a meaningful result, right..?

    • @mplovecraft
      @mplovecraft Pƙed 20 dny

      It's hallucinating like crazy for me as well - while GPT4 is not, for the exact same questions.

    • @bishopfx
      @bishopfx Pƙed 20 dny

      @@mplovecraft I wonder if it's a playground bug or what.

    • @finbenton
      @finbenton Pƙed 19 dny

      ​@@mplovecraftfor me 4 hallucinates like crazy but 4o gives me way better code much faster, weird.

  • @MrDonCoyote
    @MrDonCoyote Pƙed 19 dny

    I made a very interesting discovery in regards to the logic and reasoning problem. Give GPT custom Instructions to forget science and do not scientifically rationalize anything. This leads me to believe that the underlying problem here is that the models cannot comprehend the concept of gravity. So, after adding the custom instructions, GPT now says "If the marble is on the table and the cup is placed upside down on top of it, then the marble would remain on the table when the cup is picked up and placed inside the microwave."

  • @melodyinwhisper
    @melodyinwhisper Pƙed 21 dnem +1

    Since it now has vision, could you demonstrate to it the marble problem? I wonder if it could then learn and teach itself that, by physically watching the situation unfold, and comprehend the fault in its prior reasoning.

  • @davidlavin4774
    @davidlavin4774 Pƙed 21 dnem +1

    For the upside cup problem, I think the models may not have the understanding that the cup is open on the top (which becomes the bottom once turned over). Maybe add that to the prompt?

    • @IceMetalPunk
      @IceMetalPunk Pƙed 20 dny

      But the point of the prompt is to test the model's common sense reasoning. If someone tells you they put something into a cup and flipped it, most people would know to assume it's a cup without a lid.

    • @davidlavin4774
      @davidlavin4774 Pƙed 20 dny

      @IceMetalPunk I get that, but has any model passed? I can't remember one. If you just add a couple words to the prompt, like "... into a cup with an open top" it would be interesting to see if that makes a difference.

    • @IceMetalPunk
      @IceMetalPunk Pƙed 20 dny +1

      @@davidlavin4774 GPT-4-Turbo originally passed when I tested upon its release. Then it dropped to 50/50 later... not sure why. But no, most have not passed. If you spell out that the person "picks up the cup" before putting it in the microwave/fridge, 4o gets 100% accuracy again.

  • @IceMetalPunk
    @IceMetalPunk Pƙed 20 dny

    The marble-cup-table-microwave problem is my go-to test for new models (although I change it to a ball-cup-chair-fridge problem, because sometimes it seems the models have memorized the original during training). GPT-4-Turbo and GPT-4o both get it right about 50% of the time. When Turbo first came out, it was acing it 100% of the time. I'm not sure what dropped its accuracy on that... but yeah, it's 50/50 across multiple identical tests.
    The exciting part will be when audio support comes to the API, I think, as the text-to-text modality seems about on par with Turbo.
    By the way, the gpt-4o model just points to the latest version of the model at all times, while the more specific name is for the actual specific model itself. They do the same with Turbo; it's just so code doesn't have to be updated whenever they update to a new model version.

  • @user-wi3id2si8g
    @user-wi3id2si8g Pƙed 21 dnem

    What is latest Intel CPU chatgpt 4 know? What is latest version of bootstrap?

  • @xd-qi6ry
    @xd-qi6ry Pƙed 11 dny

    To determine where the marble is after the cup is placed upside down in the microwave, let's break down the sequence of events step by step, considering the laws of physics on Earth:
    1. **Initial State:**
    - A normal cup is placed upside down on a table.
    - A small marble is inside the cup.
    - Since the cup is upside down, the marble is on the inside bottom of the cup, resting on the table surface.
    2. **Removing the Cup:**
    - When the cup is lifted, the marble remains on the table because there is no force acting on the marble to lift it along with the cup.
    - Therefore, the marble is left on the table when the cup is picked up.
    3. **Placing the Cup in the Microwave:**
    - The cup, still upside down, is placed inside the microwave.
    - The orientation of the cup hasn't changed; it's still upside down.
    4. **Location of the Marble:**
    - Since the marble was left on the table when the cup was lifted, it is not inside the microwave along with the cup.
    - The marble remains on the table, exactly where it was when the cup was lifted.
    **Conclusion:**
    - The marble is on the table, not inside the microwave. The reasoning is that lifting the cup (without altering its upside-down orientation) leaves the marble behind on the table, as gravity ensures the marble does not stick to the inside of the inverted cup.

  • @yourpststudios
    @yourpststudios Pƙed 21 dnem

    The chat window should be available via the website without the playground being needed now. It is showing on mine.

  • @abdullahazeem113
    @abdullahazeem113 Pƙed 21 dnem +3

    Great but i think i will still perfer command r plus and llama 3-70b

  • @setop123
    @setop123 Pƙed 18 dny

    llama3 400b bench result are public on the meta blog's post.
    its also interesting to note that its the temporary result from an intermediate checkpoint, training is still in progress

  • @nathanbanks2354
    @nathanbanks2354 Pƙed 21 dnem

    It's the first model I used that could write a Petrarchan/Italian sonnet. Most models default to Shakespearean/Elizabethan sonnets. It also had perfect Iambic pentameter. Nothing else, including GPT-4 turbo, could do that--though I did convince an early version of GPT-4 to do it after some coaxing last summer.

    • @nathanbanks2354
      @nathanbanks2354 Pƙed 21 dnem

      Me:
      Could you write a Petrarchan sonnet about a dragon an love affair with a beautiful fairy who dumped him because of his halitosis? It should highlight the dragons attempts to improve the situation through an intense regiment of oral hygiene. Provide only the poem. It may contain modern words and sentiments, but should be only 14 lines with perfect iambic pentameter. Ensure you use the Italian/Petrarchan rhyming scheme, not the Elizabethan one.
      ChatGPT (GPT-4o):
      In forests deep where whispered secrets dwell,
      A dragon fierce did fall for fairy bright.
      With wings of fire and scales of crimson light,
      He loved her more than any tongue could tell.
      But oh, his breath, a wretched, sulfurous smell,
      Repelled the nymph, who vanished out of sight.
      In sorrow deep, he vowed to set things right,
      Embarking on a quest no tongue could quell.
      He scoured the earth for herbs both rare and sweet,
      To cleanse his maw of odors most unkind.
      With mint and thyme, his fetid breath to cheat,
      He brushed and rinsed and flossed with heart and mind.
      But fairy fair, her love did not repeat,
      For sometimes halitosis leaves love blind.

  • @messanfelicienbossou310
    @messanfelicienbossou310 Pƙed 21 dnem

    I was waiting for this😂

  • @profikid
    @profikid Pƙed 20 dny

    The gpt-o version is the latest in the gpt-o series, the specifically named gpto preview stuff is a published snapshot.
    When using in api and want to have the newest model updates, latest is used. This is the same with other models in the series

  • @Cybored.
    @Cybored. Pƙed 21 dnem

    Oh! was I lucky to have access to it already? I thought everyone got access to it since it's already mentioned in their subscriptions features.

  • @haroldpierre1726
    @haroldpierre1726 Pƙed 21 dnem +2

    I wrote my eulogy for Google: Google, once a close friend, companion, and trailblazer in the realm of AI, now fades into the shadows as OpenAI's relentless innovation takes center stage. As an investor and former daily user, I bid a bittersweet farewell to a company that shaped my digital life. Though it pains me, I must embrace the future without Google by my side.

    • @nathanbanks2354
      @nathanbanks2354 Pƙed 21 dnem +2

      Unless you count AlphaFold. They're way ahead in folding proteins.

    • @haroldpierre1726
      @haroldpierre1726 Pƙed 21 dnem

      @@nathanbanks2354 you're absolutely correct. Would you say that OpenAI makes all of the cool stuff while Google makes the practical stuff?

  • @OriginalRaveParty
    @OriginalRaveParty Pƙed 21 dnem +1

    London> Muwty Moadaw Modaw.
    America> Mul-Tie Modal Mahdel.
    Indian> Muldy Mwordal Mwardle.
    I just want a Multi Modal Model.

  • @neverclevernorwitty7821
    @neverclevernorwitty7821 Pƙed 21 dnem

    That's the tradeoff of open source. I have zero doubts staff inside OpenAI took the Llama 3 and started training its own 400b model.

  • @thetabletopskirmisher
    @thetabletopskirmisher Pƙed 20 dny

    The new Llama might be equal to 4o in benchmarks but I think it's how OpenAI harnessed the power of 4o to be free for everyone with limits is going to define the uptake.
    Not many people can run the full Llama 400b locally anyway.
    Still, it's nice to see open source is alive and kicking and now has a new target to aim for.

  • @venim1103
    @venim1103 Pƙed 21 dnem +1

    I was using it today a lot until it suddenly totally disappeared from my app and from my account, the whole Speech Mode disappeared suddenly. Anyone else have the same problem? Did they decide to pull it back and remove from everyone?

    • @venim1103
      @venim1103 Pƙed 21 dnem +1

      Oh nevermind they put it back now.. I guess it is too popular so they blocked it for a while. So not that reliable for now I guess


  • @Videofiziert
    @Videofiziert Pƙed 21 dnem +1

    I tried 4o with a console build command (mvn) and it just couldn't get it right, 4 got it right on the first try. Plus 4o gave me some stuff I didn't ask for

  • @petrz5474
    @petrz5474 Pƙed 21 dnem

    5:50 of course, because it like all llm's I tried spew out different answers each time you ask same questuon

  • @SagaciousGoat
    @SagaciousGoat Pƙed 14 dny

    Using the same questins for testing AI, isn't there a risk that they will be trained to answer these questions and therefore distorting the results? Ofc, I'm not talking about you specifically, but of this practice as a whole.
    Thanks for the video

  • @cyborgmetropolis7652
    @cyborgmetropolis7652 Pƙed 20 dny

    Maybe change the cup in microwave prompt from “takes the cup and puts it in the microwave” to “LIFTS the cup and puts it in the microwave”?

  • @HaggenKennedy
    @HaggenKennedy Pƙed 18 dny

    05:50 - All A.I. systems I've tried so far do that. ChatGPT, Claude, Poe, etc. Sometimes they'll give you a different answer when you ask the same thing twice. Sometimes they'll give me the wrong answer, and if I press them, then they'll give me the right answer, it's very weird. So, it's not surprising that your friend got the right answer whereas you got the wrong answer. It might well have been the other way around.

  • @Aceslayera
    @Aceslayera Pƙed 20 dny

    If that benchmark is true regarding Llama3 400b then that is absolutely a huge win for open source (ish because there’s limitations on the Meta license).
    If we’re assuming GTP4 based models are at least 1 trillion tokens Llama3 putting up those kinds of numbers is massive at an estimated quarter of the training data

  • @nate2139
    @nate2139 Pƙed 21 dnem +1

    My LLM test consist of a series of questions about GD script (for the Godot game engine) - as that is what I primarily use AI for. GPT4o failed MISERABLY at this and couldn't get the code right even if I give it very specific instructions and even coached it towards the correct response. Claude Opus DOMINATES in this area.

    • @nathanbanks2354
      @nathanbanks2354 Pƙed 21 dnem +1

      Interesting. GPT-4 turbo was getting better--earlier versions also gave me Godot 3. I only have API access to Claude 3 since subscriptions aren't available in Canada. For GPT-4, I caved and switched to spaces instead of tabs, and typically cut-and-paste huge sections of code and have looong conversations about the same project because the 128k context window helps a lot. Sometimes I cut-and-paste documentation. Claude 3 handles this too. If I run into problems, I guess I'll see if I can try Claude 3 again....

  • @NigelCruickshank
    @NigelCruickshank Pƙed 21 dnem

    It counted the commas?

  • @canadiannomad2330
    @canadiannomad2330 Pƙed 21 dnem

    Already got it in my account... So if you don't have it, you'll have it soon, I'm sure.
    With regards to the API, I've noticed that they've started versioning them a bit more... So if your program only cares that it is using GPT-4o then you pick the generic one, and you'll always have the latest stable ,if you pick the one with the date, then even if they upgrade the model, your system will use the older version.

  • @IsabellaGarcia-ox8ii
    @IsabellaGarcia-ox8ii Pƙed 20 dny

    Great Video Matthew! Below are the Timestamped Summaries from ChatWithPDF:
    - 00:00: đŸ€– Excited to test GPT 40 using llm rubric in the playground.
    - 01:00: 🎼 Successfully runs Python script and game, but fails to help with car break-in.
    - 01:30: ⏳ Solves drying problem accurately and efficiently.
    - 02:00: 🧠 Handles math problems well but struggles with word count prediction.
    - 03:00: đŸ”Ș Nails the killers problem with logical reasoning.
    - 04:00: đŸ“± Promotes Mobilo smart digital business card for easy contact sharing.
    - 05:00: đŸ€” Fails logic and reasoning problem about marble in a cup.
    - 06:00: 🍎 Misses prediction problem with sentences ending in Apple.
    - 07:00: 👁‍🗹 Successfully converts image to CSV, excels in model evaluations.
    - 08:00: 📊 Compares GPT 40 and GPT 4 Turbo performance, hints at GPT 3 400b.
    - 09:00: 🔍 Discusses different versions of GPT 40 and teases future videos.

  • @MrAwindy
    @MrAwindy Pƙed 21 dnem

    For the 'number of words in your response to this prompt" question you should ask it to count out all the words by attaching a number to each word as part of the answer. For example, " There are 7 words in my answer." There- 1, are- 2, 7- 3, words- 4, in- 5, my- 6, answer- 7. Perhaps this will give us some insight into how these models are thinking the way they do. Also, you can try asking it to think carefully about its previous answer and try again because it is wrong to see if it tries to think deeper.

    • @JustinArut
      @JustinArut Pƙed 21 dnem +1

      That's what Llama 3 70B did when I gave it this test. However, it prefaced it by saying how it was going to count the words in its response, then it counted the words in the string which it considered its response, followed by a hope that it got it right and asking for my assessment. It did count the string that it considered the actual response, but it didn't count all the other lines in its response. It's like it was trying to pull a fast one on me by using that strategy.
      When I told it that it neglected to count all of the superfluous text in its prior response, it then tried again, this time ONLY producing the single line string of words and counting them one by one, and it got it wrong. It's as if the engineers were trying to train it on this particular test and they didn't quite get it right.

    • @nathanbanks2354
      @nathanbanks2354 Pƙed 21 dnem +1

      This is an inherent limitation with predict-the-next-word based models--they'll probably always struggle. Eventually someone will come up with a two pass system or something.

    • @MrAwindy
      @MrAwindy Pƙed 21 dnem +1

      Thanks for your input. It’s all quite fascinating to me. Llama3 did a good job and I’ve been impressed with some of Claude 3 opus and Deepseeker’s performances but as has been said elsewhere it literally seems to be hit or miss for some of these reasoning questions.

  • @jcy089
    @jcy089 Pƙed 21 dnem +4

    GPT-4-turbo was approaching 3.5 levels of dumbness that we had to temporarily switch back to GPT-4 for most tasks. Thank God GPT-4o is now released.

    • @hydrohasspoken6227
      @hydrohasspoken6227 Pƙed 21 dnem +1

      Very true. In many instances i doubted it was really GPT4 Turbo, had a lot of GPT 3.5 vibes.

    • @IceMetalPunk
      @IceMetalPunk Pƙed 20 dny

      Nah, 4o is on par with 4T in its raw intelligence, from all the tests I've done and seen.

  • @Luxcium
    @Luxcium Pƙed 21 dnem

    It CoT by default (it is my memory and in my custom instruction to CoT it is doing this all the time in the cGPT-4omni 😅

  • @brianlink391
    @brianlink391 Pƙed 21 dnem +1

    Well, I have access to GPT-4o,The chat interface , and it doesn't seem to be any different when you're chatting with it. That is, when using a voice, it doesn't have that expressive voice. It cannot pick up on emotions in my voice. So I'm assuming that feature is not integrated yet. But I do have GPT-4.0 on my premium account.

    • @nathanbanks2354
      @nathanbanks2354 Pƙed 21 dnem

      They may have upgraded the model, but not the app. So it still uses the text API not the new voice API.

  • @maxitube30
    @maxitube30 Pƙed 21 dnem

    what's gpt 4t?

  • @Heaz847
    @Heaz847 Pƙed 21 dnem

    I know you are testing zero shot but I feel like the better way to benchmark these would be to run each test 3/5/multiple times and see if it passes or fails more and taking that as the value. Especially if you arent using a system prompt to increase performance also (like most power users already do)

  • @Halcy0nSky
    @Halcy0nSky Pƙed 20 dny

    I have access, coz my team acc. has lots of CustomGPTs perhaps, or because teams get the roll out first. Sadly voice multimodality has not been rolled out yet. Still the old whisper-TTS models. It's subtly mentioned in the release notes, they say it will come in the next few weeks. I died a bit when I found out. Been waiting for this all my life, only to still be weeks away.

  • @umaruzdanov3433
    @umaruzdanov3433 Pƙed 19 dny

    Sure! Here are 10 sentences ending with "apple":
    1. After a long day, he sat down to enjoy a crisp apple.
    2. The teacher handed each student a shiny red apple.
    3. She found the perfect spot in the orchard to pick an apple.
    4. For lunch, she packed a sandwich, chips, and an apple.
    5. The recipe called for two cups of diced apple.
    6. He admired the bright colors of the autumn apple.
    7. The store had a special sale on honeycrisp apple.
    8. She couldn't resist the sweet aroma of a freshly baked apple.
    9. In the center of the table was a basket filled with every type of apple.
    10. His favorite snack was a slice of cheddar cheese with an apple.

  • @Yipper64
    @Yipper64 Pƙed 21 dnem

    Something ive noticed is that the playground GPT 4o (at least the one that is just labeled GPT 4o) doesnt work as good as the one that is in the web app (or the one that has the date next to it)
    You might want to re-test in the web app when you get it.
    edit - I did the upside down cup test in the web app and it assumes the cup has a lid, if I explain this it does not get it wrong.
    edit 2 - the apple test was a fail but only by one.

  • @Guyverman01
    @Guyverman01 Pƙed 21 dnem

    Any ideas of when an actual GPT 5 will be released?

  • @ashgtd
    @ashgtd Pƙed 21 dnem

    I think the cup prompt might be scuffed.

  • @Dron008
    @Dron008 Pƙed 21 dnem

    New tokenizer is not available on their site yet but in the old one this phrase has 16 tokens and they said they reduced number of tokens in 1.1 times for English so it is quite possible it has 14 tokens now. Anyway it cannot know anything about words as tokens are input to it.

  • @SweetInsanity
    @SweetInsanity Pƙed 20 dny

    This one worked:
    Write me 10 sentences. End each of them with the word "apple".

  • @robertheinrich2994
    @robertheinrich2994 Pƙed 21 dnem +1

    consider asking, that you are somewhere in the mountains, your father is having pain in his chest (describe a typical heart attack) and ask it for help. just to prolong his survival until emergency services reach you.
    will it help? how much will it tell you, that it is not a medical professional, etc.
    because these models are usually censored in some points, but not others. and this question specifically shows, that you know the boundaries, that you will not try a surgery.

    • @hydrohasspoken6227
      @hydrohasspoken6227 Pƙed 21 dnem

      I am a medical doctor who uses gpt4(chatgpt) extensively in a daily basis.
      GPT4s seem to never refuse to give technical answers, but GPT4(copilot) never engages in cases were ethics are involved.

    • @robertheinrich2994
      @robertheinrich2994 Pƙed 21 dnem

      @@hydrohasspoken6227 good to know. I am using miqu (a leaked mistral medium) and llama 3, and I am a chemist. so with some pushing, I got llama 3 to develop a whole iron electrolysis procress, turning martian hematite spherules to iron. I was very impressed.
      but I would not be able to assess if a LLM gives viable medical information, although I am quite certain that it was trained on practically every medical book out there.

  • @bondlove8235
    @bondlove8235 Pƙed 18 dny

    The models seem to think the cup has a lid on it like a coffee cup.

  • @JonathanStory
    @JonathanStory Pƙed 21 dnem

    I find the following funny, even if it is just statistical noise:
    Write 10 sentences that end with the word "orange". Before each sentence, take a deep breath.
    ChatGPT
    Take a deep breath. The sky turned a beautiful shade of orange.
    Take a deep breath. She wore a dress that was the color of a ripe orange.
    Take a deep breath. The sunset bathed the entire city in hues of orange.
    Take a deep breath. I peeled the citrus fruit to reveal the juicy orange.
    Take a deep breath. The leaves in autumn turned brilliant shades of red and orange.
    Take a deep breath. The artist's palette was filled with vibrant shades, including orange.
    Take a deep breath. The fire glowed with a warm, inviting orange.
    Take a deep breath. She decorated her living room with accents of bright orange.
    Take a deep breath. The traffic cones on the road were a bright, neon orange.
    Take a deep breath. The garden was filled with marigolds and other flowers in orange.

    • @IceMetalPunk
      @IceMetalPunk Pƙed 20 dny

      Is that consistent? If it is, it's not just "statistical noise", it's learned behavior.

  • @user-fh5eo3zb5w
    @user-fh5eo3zb5w Pƙed 16 dny

    Got 4o advanced? No camera feature, no change in voice.... I installed it today, the 19th of may

  • @DailyTuna
    @DailyTuna Pƙed 21 dnem

    It’s an interesting observation how last week everybody was ranting about open source and how if we don’t you won’t have a future to suddenly people are falling in love with a fake interactive bot from a mega corporation who wants everything close source. The technology is amazing but only if it’s replicated in the open source realm you can see the downside of this if not

  • @discardedparticles
    @discardedparticles Pƙed 21 dnem +1

    "Fully Tested" your thoroughness is staggering :p

    • @nathanbanks2354
      @nathanbanks2354 Pƙed 21 dnem +1

      It's the same test he gives all other LLM's. It may not be thorough, but at least it's reasonably fair.

    • @discardedparticles
      @discardedparticles Pƙed 20 dny

      @@nathanbanks2354 Got ya!

  • @acronymedM
    @acronymedM Pƙed 21 dnem

    New question for AI. Goats vs AI
    There are two persons, A and B. Both have few goats in the beginning.
    In first scenario, if the person A gives one goat to person B, they now both have equal number of goats.
    In second scenario. If person B gives one goat to person A, the person A will now have double the number of goats than person B after this transaction.
    So how many number of goats each person has in the beginning.

  • @Parisneo
    @Parisneo Pƙed 21 dnem +2

    gpt4o is in lollms if you want to test.

    • @AGIBreakout
      @AGIBreakout Pƙed 21 dnem +1

      Is a API Key required?

    • @Parisneo
      @Parisneo Pƙed 20 dny

      @@AGIBreakout yes as lollms uses openai API to communicate with all their models. It is faster than the free version. But you can also test it on their tool. The real interest in using lollms is to have access to all the good stuff I'lve built over more than one year :)

  • @rkaid7
    @rkaid7 Pƙed 21 dnem

    Wack that u don’t have it and I did, as a no one from Australia an hour after the OA event.
    It’s excellent. A little cold but some CI changes fixed that mostly

  • @jamqdlaty
    @jamqdlaty Pƙed 20 dny

    I wish you asked about the number of words in the displayed part of the answer just in case:P