Benchmarks Say Claude 3 is Better than GPT-4, But is It?

Sdílet
Vložit
  • čas přidán 4. 03. 2024
  • Anthropic has released a new version of its Claude Large Language Model. The new LLM, called Claude 3, comes in 3 versions. According to the benchmarks, Claude 3 Opus is better than GPT-4. But do the real-world tests show the same thing? Let's find out.
    ---
    Let Me Explain T-shirt: teespring.com/gary-explains-l...
    Twitter: / garyexplains
    Instagram: / garyexplains
    #garyexplains
  • Věda a technologie

Komentáře • 27

  • @milan1p
    @milan1p Před 2 měsíci +2

    Great video. You should look at doing some recall tests

  • @reficwitte5771
    @reficwitte5771 Před 2 měsíci +5

    Your questions might be good but asking every question in the same chat window might help or confuse the model. Starting a new chat for each question would also be concidered a zero-shot. What you are doing currently, might be referred to as "contaminated". Dont want to sound mad but thats just the nature of chat, it doesnt transfer emotions very well. I hope this is useful to you! Thanks for the vid

    • @GaryExplains
      @GaryExplains  Před 2 měsíci +2

      But you agree that the "contamination" is the same for every model, yes?

    • @sgartner
      @sgartner Před 2 měsíci

      @@GaryExplains That's a good point, since you did the same with the other models. It would be an interesting test to see if that significantly affected the responses in the different models...

  • @D3ND
    @D3ND Před 2 měsíci

    This is an unrelated comment, but I can't find an older Gary Explains video (I think it is 3-4 years old). I'd be glad if someone can help me find it.
    In that video, Gary was presenting a software that follows your actions and records them to create a manual or instructions set.
    Can someone point me to that video or the software? I've been trying to find it for half an hour with no success.

    • @GaryExplains
      @GaryExplains  Před 2 měsíci

      It is called Scribe - czcams.com/video/t1WkYkNcWMM/video.html

    • @D3ND
      @D3ND Před 2 měsíci

      @@GaryExplains thanks, you're amazing! Seems like my memory doesn't serve me that well, and it is much newer. Have a great day!

  • @KingFeraligator
    @KingFeraligator Před 2 měsíci

    Can you do benchmarks of the free versions?

  • @anb4351
    @anb4351 Před 2 měsíci +2

    You should now change your questions to make them a bit more harder

  • @mccannger
    @mccannger Před 2 měsíci

    Thanks for the interesting comparison.
    Hopefully all the AI suppliers will start to compete on price and drive the costs down. I have no prob paying £20 (ish) for any of them, but as they're all so close on performance, availability, features etc, price is an important factor IMHO now.

  • @justronny20
    @justronny20 Před 2 měsíci +1

    Claude Sonnet can write reports that cannot be detected by AI plagiarism detectors GPTZero and Turnitin. That is a win over GPT-4 in my book

  • @jeffreyjoshuarollin9554
    @jeffreyjoshuarollin9554 Před 2 měsíci

    Inconceivable!

  • @technolus5742
    @technolus5742 Před 2 měsíci +7

    I love this. But having tried, gemini advanced and from what I see here in this demo, they are all still a step behind gpt4.
    And they all need custom instructions. With Professor Synapse instructions I get an noticeable increase in performance for complex coding prompts.
    Update: I have used Claude and it performs really well for coding.
    Update 2: Solves capchas well too. Best performing model in my tests.

    • @tonysheerness2427
      @tonysheerness2427 Před 2 měsíci

      What is impressive is the speed it generates the answers, it has interpret what you are asking, go to its data base and search it than supply an answer. My mind boggles at the speed it does it.

    • @technolus5742
      @technolus5742 Před 2 měsíci +3

      @@tonysheerness2427 good point! I generally focus on the raw ability (cause in my use cases, I don't mind waiting).
      That is a fair point indeed.

  • @ukaszLiniewicz
    @ukaszLiniewicz Před 2 měsíci

    It definietly is. Especially the context. It can actually use information in most of its context window and apply it to e.g. code. GPT 4 will either refuse, produce a generic and unhelpful response or start hallucinating. Claude 3 is miles better for coding-related tasks. GPT 4 Turbo's "128k context" just doesn't work properly, and even he 32k version, which I tried via API, is not as cabable within its context as Claude is within its 256k context. I had it print 900 lines of code with requested modifications - and it was actually correct. This won't be the case every time, it will make mistakes, but you will be able to have it correct them because the context can encompass both the code and the conversation.

  • @perschistence2651
    @perschistence2651 Před 2 měsíci +2

    What really wonders me, all these models showing benchmarks where they beat GPT-4 everywhere, but when I try it, GPT-4 is definitely a step above. They are better than 3.5 but 4 is still ahead.

  • @tonysheerness2427
    @tonysheerness2427 Před 2 měsíci

    How will teachers know if AI did the students homework?

    • @technolus5742
      @technolus5742 Před 2 měsíci +9

      By giving students a test. Those who did the homework will be prepared, those who didn't.... welp

    • @technolus5742
      @technolus5742 Před 2 měsíci +3

      Love this. Good break-down of the capabilities.
      You could put the models through a leet code competition (easy, medium hard), and see how they compare.
      A lot of people use these as coding assistants, and this is super relevant.
      I love that competition is gaining ground, but from my experience with gemini advanced and seeing your demo here, gpt4 is still one step ahead.

    • @bakedbeings
      @bakedbeings Před 2 měsíci

      The homework is for the students own learning, and overseen by their parents. If those two aren't invested, it won't much matter what the teachers can detect.

  • @ThePowerLover
    @ThePowerLover Před 2 měsíci

    Interesting.

  • @nhtna4706
    @nhtna4706 Před 2 měsíci

    Is it free, I mean does it have a basic, free version like ChatGPT?

    • @GaryExplains
      @GaryExplains  Před 2 měsíci

      Yes, as I said in the video. But when I tried it, the system was overloaded. But just try it and see.

    • @nhtna4706
      @nhtna4706 Před 2 měsíci

      ​@@GaryExplains smart guys, whether its really overloaded with real users or just the gimmicks of the brand to make you pay for the opus, will spoil the brand reputation. Its like bombarding with ad's every 1-2 min, and force the user to go with paid plan to avoid ad's ;)

    • @bakedbeings
      @bakedbeings Před 2 měsíci

      @@nhtna4706 Taking a position on what's marketing vs capacity is a risk for no-return, unless you have psychic powers. With time you might have answers, if you're still invested.