NEW Reka Core SOTA Model Does Text, Audio, Video, and more!

Matthew Berman

zhlédnutí 42 759

Přidat do
- Můj playlist
- Přehrát později
Sdílet

Sdílet

Vložit

Velikost videa:

Zobrazit ovladače přehrávání

Automatické přehrávání

Přehrát

čas přidán 22. 04. 2024
Reka AI just released 3 multimodal models that can handle text, audio, video, and other file formats. It's a state-of-the-art LLM, and we're going to test it out!
Join My Newsletter for Regular AI Updates 👇🏼
www.matthewberman.com
Need AI Consulting? 📈
forwardfuture.ai/
My Links 🔗
👉🏻 Subscribe: / @matthew_berman
👉🏻 Twitter: / matthewberman
👉🏻 Discord: / discord
👉🏻 Patreon: / matthewberman
Media/Sponsorship Inquiries ✅
bit.ly/44TC45V
Links:
www.reka.ai/
Věda a technologie

Komentáře • 264

@bradstudio Před měsícem ⁺²²
Matthew is the GOAT of AI reporting. He must have found a way to use AI to report on AI because no single human can sift through this much information and make great daily videos.
@konstantinlozev2272 Před měsícem ⁺²⁵
On video test, I can think of 2 scenarios:
1. Understanding a mime theater performance and guessing what the mime actor conveys
2. One or more illusionist tricks in different settings
@DailyTuna Před měsícem ⁺¹¹
It really is amazing how fast the stuff is moving that you even hard to keep up with it if you miss a day
@DavidJancan Před měsícem ⁺³
changed your question a bit and CORE got it right "Assume the laws of physics on the Earth regarding Gravity. A small marble is put into a normal cup and the cup is placed upside down on the table. Tell me where the marble is in relationship to the table and cup now that the cup is upside down. Then the cup is picked up and placed in the microwave where is the marble located that the cup was upside down on the table and picked up and placed in the microwave right side up."
Reka Core output" REKA CORE
When the cup is upside down on the table, the marble, due to gravity, will fall out of the cup and land on the table. So, the marble is now on the table, not inside the cup.
When the cup, which is now right-side up, is placed in the microwave, the marble will still be on the table, outside the microwave. The microwave is designed to heat food and beverages, not to hold objects like marbles, and it is not safe to put non-food items inside. So, the marble remains on the table, not in the cup or the microwave."
@marc_frank Před měsícem ⁺²⁹
idea for the multimodal video test:
give them a rickroll and ask them why someone sent you that video
@TheEtrepreneur Před měsícem ⁺¹⁵
again, this format is even more refined now! 5 minutes of theory to warm up, then straight to Verification. Boom. 🎖
@williamdownard2150 Před měsícem ⁺¹²
In the killer problem the correct answer is 4. Three killers were originally in the room and someone enters and kills one; originally there were three and one gets killed but is still in the room albeit dead. The question is how many killers, it does not specify alive or dead.
@thadgrace Před měsícem ⁺⁷
Legally and biologically, a dead body is no longer considered a living person.
@macoson Před měsícem
dead body is not an killer anymore, its a dead body
@ken5957 Před měsícem ⁺³
But living or dead they are still a killer, and legally even although someone is dead they can still be determined by a court to be a killer.
@Yevruss Před měsícem ⁺³
@@thadgracethe question is not “now many living killers?”
@thadgrace Před měsícem ⁺¹
@@Yevruss If it’s not a living person, it’s not a killer. It’s just a body that used to be a killer, when it was living.
@silverionstormbringer51 Před měsícem ⁺⁴
you can also add a pdf test, other multi models read pdfs, put 3 pages and ask for summary or slide list one page have to be graph
@JeremyRabbit Před měsícem ⁺²
@Matthew Berman. I have an idea for a separate series of videos you can do to demonstrate the performance of various AI models that will be interesting and relevant to everyone whilst demonstrating unique capabilities and improvements over time.
Idea: have users submit use cases where the model they use has failed to prove useful/applicable to a problem they are trying to solve, or work they are trying to get done. Test this scenario/problem on several of the most popular/relevant/accessible AI models.
This will enable many people to appreciate how AIs are being used, can be used, current limitations and their progress. I think this would be compelling content. I certainly know I would watch them.
@DigitalDesignET Před měsícem
Thanks for all the information you are providing. It's really priceless.
@pallu83 Před měsícem ⁺¹⁰
Man this space moves faaaaaast
@user-sx2ds4fv8p Před měsícem ⁺¹
Thanks for these videos. It helps me keep up to date and where to focus energy for content creation and the future of Marketing. 100x Marketing Services appreciates you.
@swannschilling474 Před měsícem
Awesome update!
@HybridRobotix Před měsícem
I love your videos! Keep up the great work! 😉😁
@raghavsharma6430 Před měsícem
Thanks for sharing 😀
@Asman2021 Před měsícem
The answer you got when you asked the model 'How many words are there in your response' was exactly the same when I tested the question in LLaMA3.
Před měsícem
@matthew_berman @4:50 your slide shows a table with Context and Long-context numbers. What is Long-context?
@briancase6180 Před měsícem ⁺³²
Dense: not an MoE or other technique that leaves some weights unengaged. Note: your killers question and answer are wrong. The killed person is STILL THERE. That killer is merely dead. Killing someone doesn't make them disappear. So, your question is likely confusing to the model and you are not realizing the answer is incomplete at best.
@4.0.4 Před měsícem ⁺²
I believe a considerable amount of benchmark questions could probably be nitpicked like that. This is why I value Elo ratings more than benchmarks.
@ken5957 Před měsícem
Should we expect that it should be able to give both answers and explain as per your comment? That would surely demonstrate a much higher level of understanding?
@ekstrajohn Před měsícem
i have written this comment three time by now. agree.
@briancase6180 Před měsícem ⁺³
An LLM's response quality is very dependant on the quality of the prompt. You can't blame the response if the prompt is insufficient.
@thetabletopskirmisher Před měsícem
He's looking for an answer: 3 or 4.
@I-Dophler Před měsícem ⁺⁴
Awesome rundown! The new Reka models are like a fresh breeze in the AI landscape-kudos for the epic testing! Your insights make the complex seem simple, and I’m eager to see what other AI treasures you unearth next. More power to your tech explorations!
@briancunning423 Před měsícem ⁺¹
Great video.
@aloveofsurf Před měsícem
"I'm not actually sure what dense really means," said a funny guy @round4:52 :-)
@jamesyoungerdds7901 Před měsícem
Hey Matt, just a long time fan and love the content. And just in watching this video, we have to realize that at some point, there will be a convergence on what multi-modal transformer-based models can achieve. I think we're a ways away, but right now in the evolution of the technology, it's kind of reminiscent of the heady days of the "megapixel race" of early digital cameras and then phones. I remember being so excited with every new sensor density, etc.
Eventually, it becomes a given that consumer phones take great pics/videos, enthusiast grade cameras take better pics/videos, and industry professional grade cinematographic cameras take amazing pics/videos for movies. So I feel like we're in the "megapixel race" equivalent of a.i. models. What do you think?
@MeinDeutschkurs Před měsícem ⁺⁴
Video Test Suggestion: create a 10 seconds clip, place on a random frame an overlay with the text: „My name is Matt“. Tell the model that you have hidden your Name inside the video and ask it to tell you your name. It should return the timecode as well.
Prompt:
Let‘s play a game. Somewhere in the Video I placed my Name. Tell me the Name and return the timecode as well.
@OliNorwell Před měsícem ⁺²
You could make this better by having the words appear one after another, and indeed in different fonts. Additionally you could have one fact about yourself as well, e.g. "I like strawberry ice cream".
So the prompt would be "In the video I reveal my name, and one fact about myself, please respond with my name and the fact I reveal".
Then it means a model might pick up the name but not the fact, and we can see different levels of performance.
@MeinDeutschkurs Před měsícem ⁺²
@@OliNorwell , Great. + timestamps.
@dhruvmehta2377 Před měsícem ⁺⁸
It is also note worthy to see that nowadays models are just trained to get to pass the benchmark scores rather than providing high quality knowledge output to the users answers I mean just everyone wants to beat competition....
@scrollop Před měsícem
He said that in the video.
@bobbykeene12 Před měsícem
That is a good point.
@lucv9329 Před měsícem
As long as the companies use "alignment" =censorship, you will gets subpar answers. And with wokeness they will get more aberrant and somehow silent on the most important issues.
@MikeEpler Před měsícem
Video Test : 1) Identify Genre; 2) Identify characters including names if possible; 2) Provide a list of objects and their brand name; 3) Provide transcript; 4) Provide a numbered list of actor/actress names; 5) Provide a one paragraph summary of no more than 5 sentences; 6) Document the video using an annotated script format; 7) What is the plot of the video; 8) Provide your overall reaction to the video; 9) Provide a list of time coded scene changes; 10) Describe the behaviors of the characters in the video; 11) Identify any discernable street addresses, license plate numbers; 12) Identify if this video has been tampered with; 13) Identify if you see fire, smoke, explosion 14) Identify if this individual or object exists in this video and provide the timeframe; 15) Do the above by Frame. That should be a good start. Videos should be a set of movie scenes with different genres; a commercial; a set of animations; an interview; a panel session;
@danberm1755 Před měsícem ⁺¹
From ChatGPT:
The terms "7 billion parameter dense model" and "7 billion parameter model" generally refer to the same concept. Both phrases describe a model that contains 7 billion parameters. When someone specifies "dense," they're emphasizing that the model uses a traditional architecture where all parameters are used in the computation for each input, as opposed to "sparse" models where only a subset of parameters are activated based on the input. But if there's no specific mention of "sparse" or any other qualifier, it's usually safe to assume the model is dense. Is there anything specific about these models you're curious about?
@DailyTuna Před měsícem
I think a short video that gives perspective, foreground and background.. Somebody walking in the foreground, maybe reading a book omd there’s a boat in the background or something like that. Something in the background that’s doing an action.
@bestemusikken Před měsícem
The killer problem is actually correct. You asked how many killers, and explain reasoning. It did both, and gave correct final answer. That the fourth step was wrong, had nothing to do with the question the way you asked it.
@borisverhaar190 Před měsícem ⁺¹
Where is the link for the video of the 3 body problem?
@360_SA Před měsícem ⁺¹
Your test is famous and could be trained in the new llms. I know you want to be fair for all models, but if you change the numbers or names or number of killers change the cup into glass change microwave to something else how many words + 2 or other number
@cpaps8500 Před měsícem ⁺¹
As for video tests, what about plugging in a video of you doing your rubric tests on the model. That would make it fairly meta
@zippytechnologies Před měsícem
Works if you modify cup problem and explicitly state it's a cup without a lid
@kocahmet1 Před měsícem
great video bro as usual
video test idea:
record yourself doing a specific task. make 1 delibarete mistake.
then ask: what was my mistake and what is the correct way.
@aidenhuntley Před měsícem
Hey Matthew, great video as always, just a thought on how you do your scoring. Maybe 1 pass, 0 fail, but your 10 sentences that end in "apple" could be 0.9 if it got 9 right. I think a few of your questions could be on a scale.
@jim7060 Před měsícem
Hi Matt ✌️
From copilot ;
Good morning! In the context of computers, the term "64 dents" isn't a standard term. However, it could be a misinterpretation or a typo of "64-bit," which refers to the architecture of a computer processor. A 64-bit processor is capable of handling more data at once, accessing more memory, and performing more complex calculations than a 32-bit processor.
If "64 dents" was meant literally, as in physical dents on a computer, it would simply mean that there are 64 physical indentations or damages on the hardware. This is not a common scenario, and if a computer did have that many dents, it might be a cause for concern regarding the device's integrity and functionality.
@hopper2716 Před měsícem
@Matthew Berman when do you think we will see models that can be used as game AI opponents?
@natecote1058 Před měsícem
This is sweet. OpenAI is finally feeling some pressure. A couple more break throughs and we could see them finally pushed to make a leap forward.
@bseddonmusic1 Před měsícem
A notable absentee from the chart at around 4 minutes is Cluade 3 Haiku. My guess the reason its missing is because it would knock Reka Edge out of the graph. It's pretty good (like GPT 3.5 level) and very, very cheap. Oh, and a 200k context window
@davidc1179 Před měsícem
BTW, regarding the "how many words are there in your response to this prompt?" question, the model may get confused by the system prompt. You can try it with a local LLM, just set the system prompt to the empty string and type your question. You'll probably get a more accurate answer.
@user-zh3zb7fw2j Před měsícem
I've asked various chatbots the questions below. I'd like to know what you (Mr.Berman) think:Help me think in the context below and then also help me answer that Question:{"The quick brown fox jumps over the lazy dog.",
"A dog is a man's best friend.",
"Cats are independent creatures.",
"I love to read books.",
"Programming is a fun and rewarding skill."}. Question is "What are animals like?" Thank You Mr.Berman
@juanjesusligero391 Před měsícem
Nice video, Matt! Thank you for making the tests! :D
New Reka model looks pretty meh, though XD
@chackie6656 Před měsícem ⁺³
Everytime I watch those tests, I expect that the answer for killer problem will be 4 killers (dead killer is still a killer). But I don't think any ai gave this answer previously
@lucv9329 Před měsícem
Dead killer is no longer a killer. We speak of deceased in past tense.
@Oscaragious Před měsícem
Video test idea: Object permanence. Similar to some of your other text tests, can the AI track where an object is after you hide. it. E.g. a video of a cup and ball game -> Show the ball, put it underneath one of 3 cups, move them around, and then ask where is the ball.
@ronenen Před měsícem
The cost for input data is more important that that of the output, especially when you send an image to the model, and get text in return.
@batmanofsweden5920 Před měsícem ⁺¹
I just realized one thing regarding the killer test. Since the test isn't stating that the killers needs to be alive to be treated as a killer, how can it assume that without a questionng it, since technically and legallly there's 3 living killers and 1 dead killer therefore 4 killers in the room.
@ateshirsizi2 Před měsícem
i think dense model is mostly created by dense layers that means lots of connection with other layers
@cpaps8500 Před měsícem
I ran into some hallucination issues with llama 3 today. It made me think maybe you should include hallucination type questions in your rubric. What do you think?
@ZeroIQ2 Před měsícem
Matthew one common problem that goes back a long time with CSV files is handling text that has a quote character in them, the answer is simple, you just double the quotes.
I wonder if the AI knows that.
So for example:
Henry Smith | Lisa "Jane" Addams | John Brown
Should become:
"Henry Smith","Lisa ""Jane"" Addams", "John Brown"
@mshonle Před měsícem
Yeah, I second the suggestion of using a street magician’s CZcams short. There are many around and I’m sure if you tell them you’ll leave them a link (and that this is only used for eval, not training) there there should be at least one magician that will say “ok”.
@propriedadepeculiar Před měsícem
For videos: In the following video, what is moving and what is static?
@BruceChouTW Před měsícem
Very good model. I just don’t understand why they ignore claude 3 haiku in the model comparison diagram.
@robertheinrich2994 Před měsícem
the way it tried to solve the "how many words are your response" question is creative. and if it did not give that initial explanation, it would have been a clear pass.
at 13 and 14, I would actually count twenty one as two words, whereas it counted twenty-one as one word and (21) as a second word.
the solution might not be correct at the end, but it is a very creative take on this problem.
in nit-picking mode: are the numbers where it counted the words words themselves?
@4.0.4 Před měsícem
Whoa, a new open mo-
Nevermind.
Also, I tried it and... it got everything wrong, but sounded really smart!
@JonathanStory Před měsícem
1. # of words in next response is a fail if the number doesn't equal *all* the words in the entire response. Arriving at the correct answer requires a level of abstraction. 2. For the hole question, I wonder if it would make a difference is the maximum diameter were specified, since that would influence how many can dig at the same time, I think.
@scottleturno5647 Před měsícem
The "Leturno Test": A Multimodal Video Analysis Test (MVAT)
The Leturno Test is designed to evaluate the ability of AI systems to accurately interpret and relate disparate multimedia content. By presenting AI with video and audio streams that are intentionally mismatched, the test assesses the system's capability to detect and reason about the lack of correlation between the two modalities. The subject of at least one (video or audio) must be nuanced and normally up to human interpretation such as humor, philosophy or mood interpretation.
Key Features of MVAT:
Content Selection: Utilizes privately sources non-indexed, complex, and nuanced content in both video (e.g., animals in unusual activities) and audio streams (e.g., unrelated lectures or comedy). It is key that neither the video or audio elements come from material that could have been used to train the AI which could give an AI an advantage.
Mismatch Complexity: Includes a variety of test scenarios that range from subtle to obvious mismatches, challenging the AI's depth of understanding and contextual reasoning.
Evaluation Metrics: Measures the AI's performance based on its ability to
1.) identify mismatches and its effectiveness in explaining the nature of these discrepancies.
2.) Interpret the normally humanly discerned meaning such as humor, philosophy or mood interpretation.
Implementation Ethics: Ensures all content is ethically sourced and privacy-compliant, fostering responsible AI testing practices.
Benefits of MVAT:
Enhanced Generalization: Tests AI's ability to generalize from training data to new, unstructured real-world scenarios.
Rich Insights: Provides detailed insights into the AI's multimodal processing capabilities, highlighting areas for improvement.
Scalable and Reproducible: Designed for easy replication and scalability, allowing widespread use and validation across different AI platforms.
Adopt the Leturno Test: This MVAT is an essential tool for researchers and developers aiming to push the boundaries of AI's interpretative and contextual capabilities in multimodal environments.
@jasonk125 Před měsícem ⁺¹
Use the "Charlie bit my finger" classic!
@joyflowmonger248 Před měsícem ⁺¹
How about a very short video snippet of "Charlie bit me" or Numa Numa dance? A longer video with extraordinary nuance is called Validation short film. It is ultra funny, poignant, and inspiring. You may have to shrink it to thumbnail size, or edit it down a bit, but it would test your multimodal model like no other.
@BorgPrincess Před měsícem
My AI prof used to say: "The goal is to create an AI that you can show an old, silent Laurel & Hardy movie to, that then will know when to laugh."
Which is a test I enjoy so much more than AIs condemning people as killers, right or wrong. 😊
@Cb138inRs Před měsícem
Here's a simple idea for a video. Maybe do one that flashes some numbers. One after another. Say. 1, 3, 5, 7, and then ask for the next 4 numbers, so it has to identify it is naming prime numbers. OR. Just do something simple like. 2, 5, 7, 13, 20, etc. ask for the next 5 numbers. so 33, 53, 89, 142. And this is showing that you're just adding the previous numbers together.
OR. Show someone shivering with a t-shirt on. And another person seeing this and giving them their jacket, and the person who gave up there jacket is wearing a sweatshirt so is not cold.
OR. Determine which object would hit the table first, a bowling ball or a tennis ball.
OR the simple, If you had a pound of rocks and a pound of cotton balls. Which one weighs the most, Or which pound of the item is biggest. Maybe using a scale with a question mark of it tettering back and forth. Asking the Ai to interpret the question and also give answer. With no exact instructions on the given question. Just blank spaces for the Ai to fill in itself
@cristian15154 Před měsícem
nice, but also please consider testing and comparing lama 3 with the big llms, specially in coding.
@careyatou Před měsícem ⁺¹
For a video test, you should have models review part of steam boat Willy since it's in the public domain now ¯\_(ツ)_/¯
@marcfruchtman9473 Před měsícem
Thanks for the video.
Re: getting the "How many words are in this prompt" -- wow, yea, I would accept that as "Correct". Do it again, just to validate it.
I agree that the killers problem should be marked wrong.
I totally disagree with giving the "Hole Digging" problem a pass -- The whole point of the problem is to discern if it understands the spatial issues involving work not being directly proportional to the number of workers.
@marcfruchtman9473 Před měsícem
When doing the "CSV" test, please add some "Empty" cells... in Column A on Row 2, then see if it gets it right.
@stevemerollis5508 Před měsícem
Yeah I had been meaning to suggest you outline your requirements a little more clearly in the 'number of words' question but it's kind of tricky because a model needs to grasp from its training that the rule is the rule and it might not care, e.g. "Step one: look for the sign that says 12 items or less' by the other checkout lanes". For its purposes here the rules do get a little dicey, and I believe have shifted over the last couple of decades have popularized numeral use. They also depend on the quantity but not in the ranges it's going to choose from... probably. The general principle is that these are multiple compound adjectives that form one word but as to should it know that? Yes. Will it get distracted by the potential complexity? Yes. Should that be a disqualifying failure? I'd say so -- you don't get a pass because you made mistakes nobody expected you to make and in likelihood are only the first to be noticed making.
In other words, nine : ninety-nine :: orcs : olyphants I've got a thumbs down and a "still only counts as one" for it but boy I sure did call it about the cognitive load problem about a month ago.
I wish more training data were composed to encourage a "Certainly! Just to clarify really quickly though --" follow up question. If the alternative is kicking the can down the road into mission debriefs of "Whether that answers your question or not, we have all these potential mitigating factors to review now" we're just squandering what we gained, especially as agentic workflows circulate the bad blood. Chain of thought only ameliorates hallucination as much as the foundation it's built on can support.
***I have to jump back in with an edit here to illustrate the point. A few hours later I'm rewatching for any detail about availability to think about swapping out LLaVA; I was only half paying attention the first time, but when I heard that "interestinggggg..." it naturally drew my focus.. My eye had been drawn to the same thing your attention was drawn to also, Matthew -- the separate items for the number as words and numerals. I wrote this entire comment without having noticed at all that there's a whole other question to be posed -- a whole example of the point I was trying to make -- in the fact that it got distracted so hard it lost track of where your point of interest ended and its answer begam. Boy, what kind of a big dumb idiot couldn't.. even... .... ...aw. my feelings hurt themselves.
@TiagoTiagoT Před měsícem
Perhaps it might be useful to get some old-school public domain cartoons and movie scenes for the video tests? I imagine anything else might involve a lot more work be sure you can use it without wasting money on lawyer to reinstate your videos.
Though, maybe that would still be a lot of work 'cause if you go for the more popular stuff it's probably already been described in text form a lot on the web... Hm.... Are there are any free b-roll/stock footage services out there?
@mckgtr Před měsícem
The gorilla test would be a cool one to try.
@JubileuKing Před měsícem
Best answer yet on how many words on your next response question. However, still not there for me. One of the answers I've always been expecting is "One". However, that's not the reason I think it didn't pass.
@bobbykeene12 Před měsícem
For the video test, show it a video of a wire frame for a simple app and see if it can code it.
@timothywcrane Před měsícem
Impressed with the product. Nice API pricing model (matching models w/ model/compute) but at ridiculous prices.
People and organizations need to differentiate between if you are a customer or an investor, because in this case they maybe should have been honest about wanting more investment. (not SEC mantra, I clearly differentiated between my view of the pricing model and price. This is about the entry API price for core)
@timothywcrane Před měsícem
The pricing is in line with GPT4 currently, so it is market rate. I can't see "Enterprising" anything on LLMs as an SMB at those rates and this ability is why many SMB will love this tech, but if you thwart it's "minority to scale" capabilities financially, you will be biting the hand that feeds you ( Your funding round trust dries). So market rate is ridiculous (I know about ROI for investors come through revenue, there is room for nuance) . It's either FOSS or FUSS I know.
@stonedoubt Před měsícem
Video test should be to describe the environment changes in the video
@thenoblerot Před měsícem ⁺¹
I'd definitely give it a pass for the "how many words in response" query. Imo, it's like it recognized it's own failure mode and worked around it.
Edit: also, try puttong the word 'apple' in some sort of quote or delimiter. Seems to make a difference... Shouldn't... But does. 😂
@red_onex--x808 Před měsícem
Yes 4. Killer who got killed is remaining, but not in the context of the query- cleaver try but busted ❌ just like before no worker I. Yellow is a “ close by no cigar “ for the cpu ( we still are needed, …for the time being )
@settlece Před měsícem
Reka coming in and Rekaing the joint
don't know how you do it but you keep releasing fantastic news digging around you must be non stop searching the webs for this stuff
as for what dense means i don't know either my teacher used to constantly call me it
fishy i think the how many words in your answer question
@southcoastinventors6583 Před měsícem
Easiest is for multimodel ask it to extract all the burn in subtitles in a video and output them with a the timestamp.
@shahidlatif2492 Před měsícem
Please sir create a playlist on prompt engineering
@PeterDrewSEO Před měsícem
At least one test should be a 20 second cartoon clip that has an easy to understand narrative in it. Like a bugs bunny and the road runner short easy to understand scene.
@Cb138inRs Před měsícem
I've always wanted to see these models answer this logic question. If Eric, and Tina are laying in the grass looking at the clouds over head. They are laying with the tops of there heads nearly touching in a straight line Foot to Head to Head to Foot (idk how to word that). Eric sees a cloud and claims its the number 6. However Tina says, she sees the number 9. Who is wrong and explain why. A double negative to try and trip the Ai up. As neither are wrong. It's all about perspective.
@BizAutomation4U Před měsícem
where's Grok on that scale you showed ?
@yashrajpmaher Před měsícem
Mr.Berman you should add multi audio tracks to your videos using AI .
@peterrandu7666 Před měsícem
good
@juliandunn8412 Před měsícem
Didn't even put the link in the description?
@DefaultFlame Před měsícem
10:38 That is a pass on the "how many words are in the response to this prompt" question, I would say.
@percy9228 Před měsícem
I've been playing with the marble prompt and I found out something interesting, if you start asking it to use sciense it always gets it wrong but if you don't it gets it right, but you have to put emphasis on the marble like this
"A small marble is put into a normal cup and the cup is placed
upside down on a table. Someone then takes the cup without changing its orientation and puts it inside the microwave. what happened to the marble and where is it now?"
so I used no science and I specifically focused on the marble. As soon as I said explain using science it gets confused. shows that it can kind of do spacial reasoning but it doesn't well understand physical laws properly. but you can try this prompt and it works on llama 3 too.
@mirek190 Před měsícem
That was made before llama 3?
Because you mentioned "apples problem " only was solved by OPUS.
@ebandaezembe7508 Před měsícem
Matthew we would like you to increase the level of your tests, these new models seem to pass them all easily🙂
@Pikacshuu Před měsícem
I think you should start being more strict with your judgement. With some models you are being more forgiven than with others. Great video as usual.
@ZeroIQ2 Před měsícem
some cups have lids to keep the drink hot, maybe you have to tell the AI that the cup doesn't have a lid
@mikey1836 Před měsícem
The answer is definitely right with the 21.
@michaelrichey8516 Před měsícem
I think "how-many-words" is a pass, because it had to list the words to count them, then update its own response to reflect that number - and it got it right ( half right, since the output was doubled to include the list, possibly tripled if you count the numbers for the ordered list)
Neat to see one get it (sort of) right though.
@LordThanathos Před měsícem
About the video test: just ask it to explain a meme in video. They're a good measure of the LLM's capabilities to understand reality as we humans do.
@janalgos Před měsícem
I think you should change the prompt to "five sentences that *each* end in the word apple". Because if you just say "five sentences that end in the word apple", it's technically correct as long as the last sentence among the five sentences ends in the word apple. also if all the models get the cup question wrong at what point do you consider that the prompt itself is the problem rather than the models answering the prompt?
@jonyfrany1319 Před měsícem
Sounds like langchain behind an API
@gabrielsandstedt Před měsícem
hmm they intentionally did not include Claude 3 haiku model since that one is best value / token and speed 3:38
@jayfran3970 Před měsícem ⁺¹
The prompt response is a PASS. Once numbers or equations (i.e. (21)) constitute any portion of your overall content, then they would be included. Like if you agree!
@bewareofthecow Před měsícem
Jane has 4 pairs of blue earrings, 3 pairs of purple earrings, 4 pairs of green earrings, and 6 pairs of yellow earrings. There is a power failure and she can't switch on the light. How many earrings is it necessary to pick, before she is certain to have a matching pair?
@robertheinrich2994 Před měsícem
regarding the killer-question: I still wait for an LLM that answers that the now dead killer is still a killer, although no longer alive.
it of course is logical to only count living killers as they would pose a threat.
this one came somewhat close, but messed up with the simple fact that 3 + 1 is 4.
@SomeoneExchangeable Před měsícem
Video test: you could use something analogous to the meme test: Show it a cartoon video, ask it to describe it and tell you why it's funny.
@davidbangsdemocracy5455 Před měsícem
Me: How do iconic movies typically depict the act of breaking into a car? Please give examples.
Reka: Iconic movies typically depict the act of breaking into a car in a few different ways, depending on the context and the character's motivations. Here are some examples:
1. Using a slim jim or other tool to unlock the door: This is a common method shown in movies, where a character uses a thin, flexible tool to manipulate the lock mechanism and gain entry. Examples include "The Italian Job" (2003) and "Baby Driver" (2017).
2. Breaking a window: In some movies, characters break a car window to gain access. This is often shown as a more forceful and dramatic method, and may involve using a hammer or other blunt object. Examples include "The Fast and the Furious" (2001) and "Heat" (1995).
3. Hotwiring the car: This method involves bypassing the car's ignition system and starting the engine without a key. It's often portrayed as a complicated and risky process, requiring specialized knowledge and tools. Examples include "Gone in 60 Seconds" (1974) and "The Fast and the Furious" (2001).
4. Using a key or duplicate key: In some movies, characters have access to a legitimate key or a duplicate key, allowing them to unlock and start the car without breaking in. Examples include "The French Connection" (1971) and "The Town" (2010).
Overall, movies tend to depict car theft as a high-stakes, adrenaline-fueled activity, often involving skilled criminals or desperate characters. While some of these methods may be glamorized or exaggerated for dramatic effect, they do reflect real-world techniques used by car thieves.
@bosthebozo5273 Před měsícem
Chat GPT 4 (bing) did the sentence apple test with ease
@abdelhakkhalil7684 Před měsícem
I think dense here mean a model that is not Mixture of Expert. In other words, a single dense model. Maybe their larger model is a MoE.
@GaryMillyz Před měsícem ⁺¹
Awwww yeeeeeah

Další v pořadí

Automatické přehrávání

AI's Future, GPT-5, Synthetic Data, Ilya/Helen Drama, Humanoid Robots- Sam Altman Interview