Claude 3 Release and The Problem with Benchmarks

Sdílet
Vložit
  • čas přidán 15. 06. 2024
  • In this video, we will look at the new king of LLM benchmarks, Claude-3 from Anthropics. We will do a few tests of our own and will look at why the reported results may not reflect the true performance of the Claude-3 family.
    🦾 Discord: / discord
    ☕ Buy me a Coffee: ko-fi.com/promptengineering
    |🔴 Patreon: / promptengineering
    💼Consulting: calendly.com/engineerprompt/c...
    📧 Business Contact: engineerprompt@gmail.com
    Become Member: tinyurl.com/y5h28s6h
    LINKS:
    Claude-3 Announcement: www.anthropic.com/claude
    Claude Chat: claude.ai/chats
    Technical Report: tinyurl.com/yc5y6zwj
    Claude-3 vs GPT-4: tinyurl.com/mprdy3rp
    Claude-3 API Access: console.anthropic.com/
    TIMESTAMPS:
    [00:00] Introducing Cloud3 3: The Challenger to GPT-4
    [01:41] Benchmarking Cloud3 3 Against GPT-4: The Reality
    [03:35] Intended Applications and Price Analysis of Cloud 3 Models
    [06:21] Hands-On Tests: Accuracy, Image Understanding, and Coding Abilities
    [14:04] Revisiting Benchmarks: A Closer Look at Cloud 3 vs. GPT-4
    All Interesting Videos:
    Everything LangChain: • LangChain
    Everything LLM: • Large Language Models
    Everything Midjourney: • MidJourney Tutorials
    AI Image Generation: • AI Image Generation Tu...
  • Věda a technologie

Komentáře • 38

  • @duudleDreamz
    @duudleDreamz Před 3 měsíci +7

    Excellent video. Kudos for pointing out the benchmark comparison inconsistencies.

  • @adamgdev
    @adamgdev Před 3 měsíci +1

    Good find with the footnotes!

  • @jd_real1
    @jd_real1 Před 3 měsíci +1

    I had very good experience with Claude. It did everything that GPT4 did, plus it answered medical questions for me that GPT refused to answer. The only thing I didn’t like about Claude was that I couldn’t find a way to speak as a way of input. I won’t change from GPT4 until they add this function

    • @engineerprompt
      @engineerprompt  Před 3 měsíci

      From what I have noticed, the earlier versions of Claude had more refusals compared to GPT4 model so it is great to see this one has fewer refusals in this version. Speech input might be coming if they ever release an app.

  • @TheEarlVix
    @TheEarlVix Před 3 měsíci

    Thank you for putting in so much work so that we don't have to. Much appreciated :-)

  • @cromdesign1
    @cromdesign1 Před 3 měsíci

    Poe chat added it to their app and site a couple of hours after it got released. Not the haiku yet. The 200k versions are there too. 1.000.000 chat credits per month with their subscription. 750 per opus non 200k prompt. Poe has a lot of various models.

  • @MarcusNeufeldt
    @MarcusNeufeldt Před 3 měsíci +1

    🎯 Key Takeaways for quick navigation:
    00:00 *🚀 CLA 3 challenges GPT-4 with claims of superior benchmarks.*
    00:15 *🌐 Introduces Haiku, Sonet, Opus models with diverse applications and enhanced multimodal support.*
    01:09 *💸 Models vary in intelligence and cost; Sonet and Opus accessible via Cloud API.*
    01:49 *🏆 Opus tops GPT-4 in benchmarks, but comparisons use GPT-4's older version.*
    03:38 *🛠️ Opus suits task automation and enterprise-level R&D; more expensive yet offers input token cost advantage.*
    07:07 *🕵️‍♂️ & - **09:46** 📸 Showcases prowess in information retrieval, correction, and multimodal interpretation.*
    12:58 *💻 Proves coding capability through D3 code generation for self-portrayal.*
    14:19 *🤔 Highlights issues in benchmark fairness due to outdated GPT-4 comparison.*
    17:04 *📈 CLA 3's retrieval capabilities present as a viable, albeit pricier, alternative to ARA pipeline.*
    Made with HARPA AI

  • @HistoryIsAbsurd
    @HistoryIsAbsurd Před 3 měsíci

    I swear Ive subbed to you before? I dont understand CZcams sometimes lol.
    Thanks for the vid and I really agree with this alot.

  • @pooyatolideh9527
    @pooyatolideh9527 Před 3 měsíci +2

    So much for "industry shaking"...

  • @xbon1
    @xbon1 Před 3 měsíci

    If you pay $20 a month you can get opus model in chat too

  • @gtpolpo9445
    @gtpolpo9445 Před 3 měsíci

    Cloude 3 better then gpt3.5 for coding?

    • @engineerprompt
      @engineerprompt  Před 3 měsíci

      I would say yes. Claude 3 Opus, not sure about the smaller models.

  • @elawchess
    @elawchess Před 3 měsíci +3

    I think what Anthropic did with comparing to the old GPT-4 benchmarks is fair. Why? Because this medprompt stuff you are talking about is "unofficial" in some sense. If Open AI had taken it upon themselves to release official updates on those tests, only then would it become absurd for Anthropic to ignore that and compare to the old benchmark results.

  • @CyberGizmo
    @CyberGizmo Před 3 měsíci

    Well poor Claude still got it wrong, Mike Scott was the first CEO of Apple

    • @engineerprompt
      @engineerprompt  Před 3 měsíci

      Even I didn't know :)

    • @CyberGizmo
      @CyberGizmo Před 3 měsíci

      @@engineerpromptReally enjoy your channel, thanks for all the hard work you do!.

  • @carlkim2577
    @carlkim2577 Před 3 měsíci

    For my use cases, i tested both and it clearly apparent that gpt4 is superior in reasoning.

    • @engineerprompt
      @engineerprompt  Před 3 měsíci +2

      Just curios, what is your use-case?

    • @carlkim2577
      @carlkim2577 Před 3 měsíci

      @@engineerpromptSure, I appreciate your videos so I'll give a full reply. I use AI for everything but mainly tech support for data modeling. I'm an economist so in my day job I clean data and use Tableau for visualization.
      I tested Sonnet last night. In windows 10, how do I set a folder and subfolder to a certain view ie large icons. Sonnet kept getting it wrong, hallucinating. Even when I told it my windows version and kept correcting it. I tried GPT4 and one shot, done.
      Then I asked both models to give me story beats based on a scenario. I compared both and clearly GPT4 is more sophisticated in ideation. Claude may have a simple grammatical structure that appeals to more people, but in terms of idea generation GPT4 was better.
      I do see a use case for Claude Opus. It likely handles long doc recall better than GPT4. And I do believe it matches in coding tasks. But is it enough for me to pay the Pro? I'm still debating and testing.

  • @giosasso
    @giosasso Před 3 měsíci

    Their naming conventions make no sense. First, why is Claude the base level name?
    Introducing the Claude 3 family.
    We have: Claude Haiku, Sonnet, and Opus.
    These names don't work as product names. It's not clear as to which model is better.

    • @engineerprompt
      @engineerprompt  Před 3 měsíci

      Hiakus are usually short poems, sonnets are longer than haikus and Opus are longer than sonnets. Seems like they are taking inspiration from poetry here.

  • @frogdeity
    @frogdeity Před 3 měsíci

    I've been using Claude 3 for a bit now and it's really nothing special. It consistently cannot answer simple questions and forgets important context constantly.

  • @PrincessBeeRelink
    @PrincessBeeRelink Před 3 měsíci

    shady advertising on their part, no one will ever beat GPT4!

  • @adhumon55
    @adhumon55 Před 3 měsíci

    When it comes to writing better & human like copies , Claude 2.0 and 2.1 is the king..... Unfortunately claude 3.0 is destroying the strength that claude has!(Human like content)

  • @anonymeister123
    @anonymeister123 Před 3 měsíci +4

    Claude sucks. Still waiting for API access. They’re trying to gate keep when there are other greater services

    • @HaseebHeaven
      @HaseebHeaven Před 3 měsíci

      Its easy to get access sign up and in Week or so you will get access to

    • @thomassynths
      @thomassynths Před 3 měsíci +2

      On top of that they disallow commercial use. I can’t think of anyone (even OSS bros) who are willing to pay for that.

    • @anonymeister123
      @anonymeister123 Před 3 měsíci +1

      @@HaseebHeaven It's been over a month. I have a feeling they gate keep so that they can keep the service level high. They can handle a lot more tokens and at a faster rate when they only allow a few people to use their service.

    • @R0cky0
      @R0cky0 Před 3 měsíci

      Don't use it then

    • @anonymeister123
      @anonymeister123 Před 3 měsíci

      @@R0cky0 duh. Keep up with the conversation, bub. That’s exactly what I’m recommending.

  • @arunachalpradesh399
    @arunachalpradesh399 Před 3 měsíci

    claude 3 is better in coding than gpt 4. and gpt 4 coding sucks