Claude 3 Release and The Problem with Benchmarks

Prompt Engineering

zhlédnutí 7 361

Přidat do
- Můj playlist
- Přehrát později
Sdílet

Sdílet

Vložit

Velikost videa:

Zobrazit ovladače přehrávání

Automatické přehrávání

Přehrát

čas přidán 15. 06. 2024
In this video, we will look at the new king of LLM benchmarks, Claude-3 from Anthropics. We will do a few tests of our own and will look at why the reported results may not reflect the true performance of the Claude-3 family.
🦾 Discord: / discord
☕ Buy me a Coffee: ko-fi.com/promptengineering
|🔴 Patreon: / promptengineering
💼Consulting: calendly.com/engineerprompt/c...
📧 Business Contact: engineerprompt@gmail.com
Become Member: tinyurl.com/y5h28s6h
LINKS:
Claude-3 Announcement: www.anthropic.com/claude
Claude Chat: claude.ai/chats
Technical Report: tinyurl.com/yc5y6zwj
Claude-3 vs GPT-4: tinyurl.com/mprdy3rp
Claude-3 API Access: console.anthropic.com/
TIMESTAMPS:
[00:00] Introducing Cloud3 3: The Challenger to GPT-4
[01:41] Benchmarking Cloud3 3 Against GPT-4: The Reality
[03:35] Intended Applications and Price Analysis of Cloud 3 Models
[06:21] Hands-On Tests: Accuracy, Image Understanding, and Coding Abilities
[14:04] Revisiting Benchmarks: A Closer Look at Cloud 3 vs. GPT-4
All Interesting Videos:
Everything LangChain: • LangChain
Everything LLM: • Large Language Models
Everything Midjourney: • MidJourney Tutorials
AI Image Generation: • AI Image Generation Tu...
Věda a technologie

Komentáře • 38

@duudleDreamz Před 3 měsíci ⁺⁷
Excellent video. Kudos for pointing out the benchmark comparison inconsistencies.
@engineerprompt Před 3 měsíci
Thank you!
@adamgdev Před 3 měsíci ⁺¹
Good find with the footnotes!
@jd_real1 Před 3 měsíci ⁺¹
I had very good experience with Claude. It did everything that GPT4 did, plus it answered medical questions for me that GPT refused to answer. The only thing I didn’t like about Claude was that I couldn’t find a way to speak as a way of input. I won’t change from GPT4 until they add this function
@engineerprompt Před 3 měsíci
From what I have noticed, the earlier versions of Claude had more refusals compared to GPT4 model so it is great to see this one has fewer refusals in this version. Speech input might be coming if they ever release an app.
@TheEarlVix Před 3 měsíci
Thank you for putting in so much work so that we don't have to. Much appreciated :-)
@engineerprompt Před 3 měsíci
:)
@cromdesign1 Před 3 měsíci
Poe chat added it to their app and site a couple of hours after it got released. Not the haiku yet. The 200k versions are there too. 1.000.000 chat credits per month with their subscription. 750 per opus non 200k prompt. Poe has a lot of various models.
@engineerprompt Před 3 měsíci
nice, need to check that out!
@MarcusNeufeldt Před 3 měsíci ⁺¹
🎯 Key Takeaways for quick navigation:
00:00 *🚀 CLA 3 challenges GPT-4 with claims of superior benchmarks.*
00:15 *🌐 Introduces Haiku, Sonet, Opus models with diverse applications and enhanced multimodal support.*
01:09 *💸 Models vary in intelligence and cost; Sonet and Opus accessible via Cloud API.*
01:49 *🏆 Opus tops GPT-4 in benchmarks, but comparisons use GPT-4's older version.*
03:38 *🛠️ Opus suits task automation and enterprise-level R&D; more expensive yet offers input token cost advantage.*
07:07 *🕵️‍♂️ & - **09:46** 📸 Showcases prowess in information retrieval, correction, and multimodal interpretation.*
12:58 *💻 Proves coding capability through D3 code generation for self-portrayal.*
14:19 *🤔 Highlights issues in benchmark fairness due to outdated GPT-4 comparison.*
17:04 *📈 CLA 3's retrieval capabilities present as a viable, albeit pricier, alternative to ARA pipeline.*
Made with HARPA AI
@impushprajyadav Před 3 měsíci
Thank 😂
@HistoryIsAbsurd Před 3 měsíci
I swear Ive subbed to you before? I dont understand CZcams sometimes lol.
Thanks for the vid and I really agree with this alot.
@engineerprompt Před 3 měsíci
Welcome back!!!!
@pooyatolideh9527 Před 3 měsíci ⁺²
So much for "industry shaking"...
@xbon1 Před 3 měsíci
If you pay $20 a month you can get opus model in chat too
@gtpolpo9445 Před 3 měsíci
Cloude 3 better then gpt3.5 for coding?
@engineerprompt Před 3 měsíci
I would say yes. Claude 3 Opus, not sure about the smaller models.
@elawchess Před 3 měsíci ⁺³
I think what Anthropic did with comparing to the old GPT-4 benchmarks is fair. Why? Because this medprompt stuff you are talking about is "unofficial" in some sense. If Open AI had taken it upon themselves to release official updates on those tests, only then would it become absurd for Anthropic to ignore that and compare to the old benchmark results.
@tarmiziizzuddin337 Před 3 měsíci
gggg🎉😢😢
@CyberGizmo Před 3 měsíci
Well poor Claude still got it wrong, Mike Scott was the first CEO of Apple
@engineerprompt Před 3 měsíci
Even I didn't know :)
@CyberGizmo Před 3 měsíci
@@engineerpromptReally enjoy your channel, thanks for all the hard work you do!.
@carlkim2577 Před 3 měsíci
For my use cases, i tested both and it clearly apparent that gpt4 is superior in reasoning.
@engineerprompt Před 3 měsíci ⁺²
Just curios, what is your use-case?
@carlkim2577 Před 3 měsíci
@@engineerpromptSure, I appreciate your videos so I'll give a full reply. I use AI for everything but mainly tech support for data modeling. I'm an economist so in my day job I clean data and use Tableau for visualization.
I tested Sonnet last night. In windows 10, how do I set a folder and subfolder to a certain view ie large icons. Sonnet kept getting it wrong, hallucinating. Even when I told it my windows version and kept correcting it. I tried GPT4 and one shot, done.
Then I asked both models to give me story beats based on a scenario. I compared both and clearly GPT4 is more sophisticated in ideation. Claude may have a simple grammatical structure that appeals to more people, but in terms of idea generation GPT4 was better.
I do see a use case for Claude Opus. It likely handles long doc recall better than GPT4. And I do believe it matches in coding tasks. But is it enough for me to pay the Pro? I'm still debating and testing.
@giosasso Před 3 měsíci
Their naming conventions make no sense. First, why is Claude the base level name?
Introducing the Claude 3 family.
We have: Claude Haiku, Sonnet, and Opus.
These names don't work as product names. It's not clear as to which model is better.
@engineerprompt Před 3 měsíci
Hiakus are usually short poems, sonnets are longer than haikus and Opus are longer than sonnets. Seems like they are taking inspiration from poetry here.
@frogdeity Před 3 měsíci
I've been using Claude 3 for a bit now and it's really nothing special. It consistently cannot answer simple questions and forgets important context constantly.
@PrincessBeeRelink Před 3 měsíci
shady advertising on their part, no one will ever beat GPT4!
@adhumon55 Před 3 měsíci
When it comes to writing better & human like copies , Claude 2.0 and 2.1 is the king..... Unfortunately claude 3.0 is destroying the strength that claude has!(Human like content)
@anonymeister123 Před 3 měsíci ⁺⁴
Claude sucks. Still waiting for API access. They’re trying to gate keep when there are other greater services
@HaseebHeaven Před 3 měsíci
Its easy to get access sign up and in Week or so you will get access to
@thomassynths Před 3 měsíci ⁺²
On top of that they disallow commercial use. I can’t think of anyone (even OSS bros) who are willing to pay for that.
@anonymeister123 Před 3 měsíci ⁺¹
@@HaseebHeaven It's been over a month. I have a feeling they gate keep so that they can keep the service level high. They can handle a lot more tokens and at a faster rate when they only allow a few people to use their service.
@R0cky0 Před 3 měsíci
Don't use it then
@anonymeister123 Před 3 měsíci
@@R0cky0 duh. Keep up with the conversation, bub. That’s exactly what I’m recommending.
@arunachalpradesh399 Před 3 měsíci
claude 3 is better in coding than gpt 4. and gpt 4 coding sucks

Další v pořadí

Automatické přehrávání

Introducing DEVIKA - OpenSource AI Software Engineer | Local Install