NEW Open-Source LLM Tops The Rankings...But Is It Actually Good?

Matthew Berman

zhlédnutí 39 873

Přidat do
- Můj playlist
- Přehrát později
Sdílet

Sdílet

Vložit

Velikost videa:

Zobrazit ovladače přehrávání

Automatické přehrávání

Přehrát

čas přidán 10. 09. 2024
Cohere released Command R+ with open weights! It i currently the top open model according to lmsys, but let's test it ourselves. This model is optimized for retrieval and tool usage, with a focus on enterprise use cases.
Be sure to check out Pinecone for all your Vector DB needs: www.pinecone.io/
Join My Newsletter for Regular AI Updates 👇🏼
www.matthewber...
Need AI Consulting? ✅
forwardfuture.ai/
My Links 🔗
👉🏻 Subscribe: / @matthew_berman
👉🏻 Twitter: / matthewberman
👉🏻 Discord: / discord
👉🏻 Patreon: / matthewberman
Rent a GPU (MassedCompute) 🚀
bit.ly/matthew...
USE CODE "MatthewBerman" for 50% discount
Media/Sponsorship Inquiries 📈
bit.ly/44TC45V
Links:
LLM Leaderboard - bit.ly/3qHV0X7
huggingface.co...
txt.cohere.com...

Komentáře • 139

@aloveofsurf Před 5 měsíci ⁺³⁹
Thank you Matthew Berman for respecting our time by keeping this video under 10min while serving our preference to hear it from you.
@DailyTuna Před 5 měsíci ⁺¹
True dat. Lately people go on die an hour about news that should take 10minutes.
@avi7278 Před 5 měsíci ⁺⁹⁶
When Matthew puts "Is it any good?" in the title, you know it's garbage.
@seppdaniel_ Před 5 měsíci ⁺²
Bem dessa.
@aigeekhub Před 5 měsíci
Good to know.
@Cine95 Před 4 měsíci
Now try it its oustanding
@axetilen Před 3 měsíci ⁺¹
it's not bad as a search tool. I use it to search for researches and medical articles. It's better than several well-known chatbots
@phobes Před 5 měsíci ⁺²⁹
"Ok well YOU needed to do that"
🤣
@Dygit Před 5 měsíci ⁺¹⁵
Why are people ignoring the fact that it’s meant to be used for RAG
@camelCased Před 3 měsíci ⁺²
Even with RAG, it should be good at logic, otherwise it will come to illogical assumptions for the information in RAG. However, I know that AIs don't have "logic", they just predict the right text. But one might argue that humans do the same - we "complete" our thoughts and actions based on all the information we have accumulated in our lives.
@Artificial-Cognition Před 2 měsíci
Because RAG models are still good for all kinds of use cases.
@jim-i-am Před 5 měsíci ⁺¹⁰
As we start seeing more specialized models that are less generalized, it may help to reconsider the testing methods used. I like the practicality in your reviews and would like to see that extend to tailoring your challenges (or weighting what you have) to see how well these models (especially open source ones) do what they claim to be best at.
Thank you for another great video!
@DailyTuna Před 5 měsíci ⁺¹
I agree. As Matt becomes more popular, they’re going to port the models to beat the tests
@Dreamslol Před 5 měsíci ⁺¹³
LLM is for Rag.
"Do Snake in Python" ??????
@guilhermeveiga9345 Před 4 měsíci
yeah hehehe
@spookymv Před 3 měsíci ⁺²
ahahaha i thought the same thing.
@JanBadertscher Před 5 měsíci ⁺⁷
Testing out command-r v1 and command-r-plus out of domain, hmm... I mean as you stated, the model is fine tuned for grounding and citation in RAG. Wouldn't it make sense then to extend your eval data set with RAG tests? RAGAs would be very easy to implement. Grounding and RAG is the most common business use case for LLMs.
@SasskiaLudin Před 5 měsíci ⁺³
I managed to put this104 billion parameters model on my phone, and it works, YES.
Granted, it's a 24Gb RAM phone (OnePlus 12 coming straight from China as the global market is still limited to 16Gb) and very heavily quantized (Q1) but nevertheless, from that Snapdragon 8 gen 3 SOC, it produces very good answers where command-R+ shines, i.e. code, and it does so with just about 12 seconds initial latency and then answers about at half normal reading speed 🙂
@pin65371 Před 5 měsíci ⁺⁵
Yah I dont think this model is really designed for the type of work you were trying to do here. The Langchain channel put out a video already talking about this model and they seemed impressed. I think you will end up finding more models coming out where regular tests will be horrible but for the edge use case the model is built for it is great. That is how I see agent workflows working anyways. I dont think you will be calling one model. You will be having the agents using specific models doing specific jobs. Its really no different than real life where you have specialists that are very good at the jobs they are trained to do. AGI wont be a single model. It will be many models working together. I've seen a few videos talking about tiny models that are designed for specific tasks that outperform much larger models.
@brunodangelo1146 Před 5 měsíci ⁺¹³
Ability to reason and writing code are not the only benchmarks to measure model real life applications.
Creative writing is one too. I use AI in a setting where I'm interested in the entertainment value of the replies, not entirely on wether they are logically correct.
The best open source model I've found for this is Llama 2 chat 13B. It writes the most fun answers by far. It even uses emoji in its replies in a natural way, without being prompted to do so.
I even compared it to Gemini Pro, a much bigger and faster model, and despite it having an amazing inference time due to cloud computing, the answers it wrote were just boring.
@jtjames79 Před 5 měsíci ⁺¹
Highly underrated comment. 👆
@splitpierre Před 5 měsíci ⁺⁵
Sorry Matthew for my negative comment, I usually love your videos, but, this was kinda useless video/tests, I was expecting tests on its strengths, function calling and rag.
"Hey we are gonna test the components in this ice cream, but we don't have any, so we are using butter for the tests."
C'mon buddy, don't be lazy 😂
@brianmi40 Před 5 měsíci ⁺⁹
When you see so many fails, you start wondering if the test scores were a bit cooked, like a VW emmission test!
@moamber1 Před 5 měsíci ⁺⁴
These green and red screens for Pass and Fail. Here is a suggestion: make them flickering, longer and perhaps add some siren sound. I almost got a seizure, but not quite, I feel you need to push it a bit harder.
@kiiikoooPT Před 5 měsíci
this is not a good platform for you if you almost get a seizure from that, I would even go as far as saying, dont use a PC or phone if you are that sensitive.
@spookymv Před 3 měsíci ⁺¹
R+: i am for RAG
M: Okay. Then can you make a pea soup game in python?
R+: Bro i am for RAG
M: FAIL
@Canna_Science_and_Technology Před 4 měsíci
Additionally, there is Command-R Plus, which is 104b and offers significant improvements over Command-R. Notably, Ollama runs it flawlessly.
@AvizStudio Před 5 měsíci ⁺¹
Don't forget it also open-source and can deploy locally which is crucial for some organization for privacy. So it's may be the best solution for some cases even though it's not the smartest model in the field.
@muhammadlufti2967 Před 5 měsíci
I've only been trying out the API for a few days, and I'm impressed with the capabilities of this command-r-plus model, besides the connector function that is already integrated with web-search by default, the multi-turn conversation capability that is very, very easy, without having to design my own schema to make it possible, the main thing is that so far, I haven't found any answers that are "unsatisfactory, tend to be hallucinatory, and don't provide enough insight" for me. It's going to be a tough competition!
@marcfruchtman9473 Před 5 měsíci ⁺¹
Thanks for this review. I don't understand how they are claiming it did super well in benchmarks but failed so many of these.
@kilosera Před 5 měsíci
Please ditch the fullscreen green/red ;) . Replace it with small sign somewhere on the bottom because currently it works like a flashbang.
@KoenigNord Před 5 měsíci
In my experience, Cohere did a great job to build their models for RAG and search cases. Their reranker and embedding models are a good starting point for rapid prototyping.
@Dundell2 Před 5 měsíci ⁺¹
Well that's unfortunate on a bit of fronts... I'll still work towards testing it locally and comparing results with some other options. I'm still looking forward to testing out its 128k context window to see how well it responds with large scripts to edit.
@cloudd901 Před 5 měsíci
I could watch Matthew test models all day! Great work. We should have a scaling system like 1-4 for each test. From exceptionlly failed to exceptionally passed. Then give each model an M.B. score.
@madqwer Před 5 měsíci ⁺²
25-4*2+3=20,
LLM: "the answer is 20."
This guy: It's not a PASS!!
LOL :DDD
@miguelsalcedo01 Před měsícem
LMAO!!! Dude! that was hilarious!! you need to import the sys module, well no YOU needed to do that!! I'm dying I've had that same exact thought debugging code output lol
@charlestrent8688 Před 5 měsíci ⁺¹
OK we must not sit here and focus on general aspects of usage when it was created and tested on RAG. You should’ve stayed consistent with what it was created for to give its test. It wasn’t made for what you tested for. Therefore it wasn’t great. It is great if you know what it’s for and you’re using it for what it’s made for. It’s better than anything available. I don’t want to cause any misinformation or confusion, based on the general usage aspects like ChatGPT it’s not for that.
@jamesyoungerdds7901 Před 5 měsíci ⁺¹
Hey Matthew, another great video - just a huge fan and love your content 🙏. I'm glad you touched on this at the end - and I've watched every LLM you've tested go through your crucible of questions - and in this one, it stuck out - that the questions are well-suited for general purpose reasoning, etc. but when a model comes out that has a claimed speciality or use case, it may end up failing a lot of those tests but still be the best and its intended use. Not sure if it's worth coming up with different sets of tests in categories depending on the claims the use case of the models?
@archerkee9761 Před 5 měsíci ⁺¹
this new fail screen and sound is kinda annoying tbh, cant you jsut stamp "failed" and make it more silently?
@MarcusNeufeldt Před 5 měsíci ⁺²
🎯 Key Takeaways for quick navigation:
00:00 *🔍 Introducing Cohere's new open-source LLM, Command R, optimized for enterprise use cases like RAG and tool usage.*
00:27 *🎯 Key features: strong accuracy, low latency, high throughput, 128k context, and lower pricing.*
01:08 *🏆 Outperforms MixL in real-world use cases like document assistance and customer support.*
01:36 *🔍 Performs well on long-context retrieval benchmarks.*
02:17 *💻 Purpose-built for enterprise use cases with web search, document grounding, and competitive pricing.*
02:46 *⚠️ Struggles with some programming and logic tasks, but excels at web search and document-based tasks.*
Made with HARPA AI
@MrNobodyX3 Před 3 měsíci
When you share videos of this topic, it would be nice if you provide the links to what it is, you're talking about
@theh1ve Před 5 měsíci ⁺²
I would like to see this tested in a crewAI setup using tools and running locally.
@fullcrum2089 Před 5 měsíci
cohere has API, test using they API and if it's good, check how setup locally.
@theh1ve Před 5 měsíci
@@fullcrum2089 I mean I could do it myself sure, but then that wouldn't make an interesting CZcams video?
@bestemusikken Před 5 měsíci
Love your tests. Looking forward to the model that can coherently answer "one" on the how long is your next response" question.
@tejeshwar.p Před 5 měsíci
AI benchmarks - ❌
Matthew Berman testing - ✅
@hqcart1 Před 5 měsíci ⁺¹
The problem with new models coming up is that, if they suck, they will die, why should anyone use them if they fail a simple math problem!
@hqcart1 Před 5 měsíci
@JustinArut I want to order 4 times my previous order, then you are fu*ed, math is the most important aspect of LLM, most of the studies are to solve math problems, without it, everything fails.
@GuidedBreathing Před 5 měsíci
6:21 isn’t the one prompting the killer forcing x to kill; and since the prompter is sandboxed outside there’s only two 3-1=2 (since prompter is never inside the sandbox; normally the prompter is just a human to take revenge on later 😂)
@digitus78 Před 5 měsíci
Its a rag model for data collection. Like failing a English major for not knowing Geometry. Im glad you made that clear at the end, yet your test and reviews are viewed and trusted by many. Still seems not unfair but biased against its purpose. Would love to see you use it with a Agent framework like Devika, codel, crewai or opendevin. See how optimized it is running as a rag agent or crew. P.s. maybe not opendevin... need a degree to get that running.
@HansJrgenFurfjord Před 5 měsíci ⁺¹
I want to move to a part of the world where Joe is faster than Jane at least once in a while, like before everyone went completely and permanently insane.
@hatimalamshawala684 Před 5 měsíci ⁺¹
can you please make a video on how to install codel in windows
@inteligenciamilgrau Před 5 měsíci ⁺⁴
The best part is when been "censured" "It's a FAIL"!! lol 4:34
@leonidaltman Před 5 měsíci
The visual identity created by Pentagram is great, though.🔥
@gabrielsandstedt Před 5 měsíci ⁺¹
It could be great for use with rag. Thats what it was designed for.
@janchiskitchen2720 Před 5 měsíci ⁺¹
Think about it this way, If it's not smart enough to answer general questions, how can be it smart enough to answer questions about you own files even though it memorized them.
@gabrielsandstedt Před 5 měsíci
@@janchiskitchen2720 because if it has been trained with some additional context in mind in addition to the question, then it will be good at retrieving from that information instead of using its compressed data to draw conclusions that might be false. Its not meant to be used without an additional context in addition to the user prompt. Most companies will not want a small language model to draw conclusions from its inherent very compressed memory but from the external source like a graph or vector database with a language embedding, presenting it in a nice way through. The presentation is where the AI is used in those cases, summarizing from other sources. That is not the same as being intelligent.
@gabrielsandstedt Před 5 měsíci
Of course logic is great too, but it was not the goal for this model. I have built a couple of RAG systems. A great approach is to use different specialized machine learning models for different parts of the job. This saves compute compared to using really large models and they are easier to fine tune on limited hardware for specific buissness cases.
@Artificial-Cognition Před 2 měsíci
@@gabrielsandstedt RAG actually improves a model's ability to reason, no?
@GuidedBreathing Před 5 měsíci
5:58 20 it’s the correct answer! 25-8*3=20 sometimes these nudges such as pemdas seem too much.. but it’s anyhow correct
@jaancarlo1237 Před 5 měsíci
I´d love to see small language models tested by you, like rocket, phi 2, etc.
@GuidedBreathing Před 5 měsíci
3:32 maybe just: make a python snake game self contained in one script, so think out all necessary imports and that all references must be defined
@patrickneal8131 Před 5 měsíci ⁺¹
i'm in love with work and i have a request can you do a benchmark for AI agents that you already did, what you prefer?
@rywmark Před 5 měsíci ⁺¹
6:01 Umm yes. The answer is 20. Do we give you the fail?
@kalomaze1912 Před 5 měsíci
Not fair to judge it in zero shot without disclosing the sampling params being used, as a default high temperature setup is terrible for a model like this with a large vocabulary (256k tokens). For my use case, it's doing phenomenal and is comparable to Sonnet.
This model has something that others don't, in my opinion. Benchmarks and riddles only go so far to measure something high dimensional
@theAIsearch Před 5 měsíci ⁺¹
Yikes, that was a lot of fails! Thanks for being honest
@claudiantenegri2612 Před 5 měsíci
So Matthew, how do we test new Mixtral 8x22B? (231 GB)?
@OscarTheStrategist Před 5 měsíci
I guess we need a name for non foundational, workload type oriented models.
@thetabletopskirmisher Před 4 měsíci
Strangely enough, it’s near GPT4 level for creative writing. Much better than 3.5 (but then nowadays, what isn’t?)
@gazorbpazorbian Před 5 měsíci
what I don't like of these test, is that maybe you can train the LLM with these questions it might get them right but not because it "knows" but because it was memorized.
ironically enough it would be the same for some people
@DailyTuna Před 5 měsíci ⁺¹
Web search is needed. That has merit.
@user-wr4yl7tx3w Před 5 měsíci
but is it only for enterprise clients?
@sh0ndy Před 4 měsíci
Wait, so this LLM can access websites and real-time data? How?
@masonbroughton1656 Před 5 měsíci
Is it uncensored?
@zyxwvutsrqponmlkh Před 5 měsíci
In the future we will show AI magic tricks, by putting marbles in cups and moving the cups to microwaves. Where'd the marble go? See, I made it teleport over to the table. Yup, these magic powers are common amongst humans.
@JonathanStory Před 5 měsíci
I wouldn't say that you're testing edge cases. There's nothing particularly hard about your questions (with the exception of snake, and the application of gravity. Rather, I this this LLM did a Google and was rushed out the door.
@nasimobeid2945 Před 5 měsíci ⁺¹
I'm starting to believe that model metrics aren't very accurate in determining real world results
@siwakotisaurav Před 5 měsíci ⁺¹
It was rated the highest open weight model on lmsys, which is human-benchmarked/tested. Use it for actual use cases rather than puzzle-type questions. I tried it for translation and coding and it was better than all current open weights models
@hikaroto2791 Před 5 měsíci
you tested it so we dont have to. thanks❤
@JuanGea-kr2li Před 5 měsíci
I don’t get it, where is the “Open Source” here if it cannot be downloaded and used locally? Or did I miss something?
@kalomaze1912 Před 5 měsíci
weights available on cohere's huggingface. also GGUF quants on the llama.cpp PR
@user-en4ek6xt6w Před 5 měsíci ⁺¹
You should test it on rag and internet that better
@ct8060 Před 5 měsíci
Since it’s a RAG centered llm try to ask it to summarize a transcribed 1.5 hour interview… and compare it with other SOTA Llms…
@dreamphoenix Před 5 měsíci
Thank you.
@deo-nis Před 5 měsíci
You should try testing Weyaxi/Einstein-v4-7B best 7B model in my hands
@nothingtoseehere5760 Před 5 měsíci
How is it open source exactly? I don't see any source code anywhere. And the license clearly state that all output is the exclusive property of Cohere. It's useless.
@Artificial-Cognition Před 2 měsíci
models always on hugging face.
@jaysonp9426 Před 5 měsíci
You didn't know that sys is just a python import statement?
@firesoul453 Před 5 měsíci ⁺¹
Corel is different. If you want to test Command R, then go to the playground and there is an option for model.
@Arcticwhir Před 5 měsíci
yeah to me it doesnt seem like it was finetuned or created for chatting. Really was made for retrieval and tool use. I was a bit surprised it got that high on lmsys arena, because I havent found it to be that useful
@TheKBlunt Před 5 měsíci
Dog, this model was not designed to be a code assistant.
@pubfixture Před 5 měsíci
bud, get rid of the pas fail screen. it no good.
at least not fullscreen. maybe a small window in the corner while you continue to next test..
@ruadd4592 Před 5 měsíci
great review
@guitarbillymusic Před 5 měsíci
One could argue the dead killer is no longer a killer since a corpse lacks the ability to kill anything.
@guitarbillymusic Před 5 měsíci ⁺²
@JustinArut doh. Clearly I would be a terrible llm
@danielgomez2503 Před 5 měsíci
Cohere must be powered by Gemma Bard 🤣😂
@julkiewicz Před 25 dny
5:14 wtf? It's like, well it did it incorrectly, but we'll gonna give it a pass.
@raul17533 Před 5 měsíci
We really need a matrix with all the models and the tests you have done to know which one passes them all and which one to use for my shitty python scripts to turn on/off my lights 🤣 - but srsly we need a matrix
@fabiankliebhan Před 5 měsíci
Weird results. The benchmarks suggested something way better.
Should be at least better then mistral 7b 🤔🤷‍♂️
@omarlotfy5645 Před 5 měsíci ⁺⁶
It's horrible
@omarlotfy5645 Před 5 měsíci
I tried to use it for organising my repo I ended up losing my code
@dtory Před 5 měsíci ⁺²
This your new style of ✅ or 👎 looks weird. Try changing the style instead of full screen make it a circle and only show( not the full code )
@madqwer Před 5 měsíci
The answer is right,
The guy: "It's not what I ask, FAIL!!" XDD
@inteligenciamilgrau Před 5 měsíci
I cannot believe that the guys who train this models have not already prepared their models to make a golden medal at Matthew Berman's CZcams test! So disappointing!
@inteligenciamilgrau Před 5 měsíci
@JustinArut I was a joke 😬
@Azazeal777 Před 5 měsíci
Awesome 🎉
@CronoBJS Před 5 měsíci
Code Gemma was just released. It still sucks but can almost make a snake game lol
@NickYoung16 Před 5 měsíci
Thoughts on updating the ball in the basket test to putting the lotion in the basket instead?
@Sri_Harsha_Electronics_Guthik Před 5 měsíci
need local rag vids plz!
@spinningaround Před 5 měsíci
It would be nice if you also ask abstract spatial questions. For example: What shape is formed if you connect the opposite sides of a square sheet of paper sequentially two times?
@MilesBellas Před 5 měsíci
If it doesn't do the snake game.......it's a no......
@ScottSummerill Před 5 měsíci
Boy. Don’t think I’d trust it for anything.
@rrrrazmatazzz-zq9zy Před 5 měsíci
I feel like more llm's would pass the snake test if the requirements for snake were more specfic, ie which libraries you want and how you want the game to work.
@Gafferman Před 5 měsíci
Wow this model is actually awful. I thought we were going up?
@robertgeczi Před 5 měsíci
Theme of the video for this model....fail. lol
@m.3257 Před 5 měsíci
Yea, so I'm gonna pass on this model 😅
@sisitheprincess7094 Před 5 měsíci
No matter how many 👕 on the 🌞, it will always dry 4 hours if you lay them next to each other, it's a tricky question
@michaeljay7949 Před 5 měsíci
Yikes. This is not comfy.
@marcinkepski4977 Před 5 měsíci
Another bunch of bs.
@silverchenmel Před 5 měsíci
This video is wasting everyone time including yours.
@seppimweb5925 Před 5 měsíci
Why are you wasting so much time with this nonsense?
@aipower-ho1mt Před 5 měsíci
newest joke model
@posavenkataakhil8862 Před 5 měsíci ⁺¹
It's worst llm models i ever seen (command R)
@blingbang2621 Před 5 měsíci
cohere models are not too good its just context even in that it give bad response many times

Další v pořadí

Automatické přehrávání

Strawberry Q* SOON, Apple Intelligence Updates, $2,000/mo ChatGPT, Replit Agents (AI News)