NEW Reka Core SOTA Model Does Text, Audio, Video, and more!
Vložit
- čas přidán 22. 04. 2024
- Reka AI just released 3 multimodal models that can handle text, audio, video, and other file formats. It's a state-of-the-art LLM, and we're going to test it out!
Join My Newsletter for Regular AI Updates 👇🏼
www.matthewberman.com
Need AI Consulting? 📈
forwardfuture.ai/
My Links 🔗
👉🏻 Subscribe: / @matthew_berman
👉🏻 Twitter: / matthewberman
👉🏻 Discord: / discord
👉🏻 Patreon: / matthewberman
Media/Sponsorship Inquiries ✅
bit.ly/44TC45V
Links:
www.reka.ai/ - Věda a technologie
Matthew is the GOAT of AI reporting. He must have found a way to use AI to report on AI because no single human can sift through this much information and make great daily videos.
On video test, I can think of 2 scenarios:
1. Understanding a mime theater performance and guessing what the mime actor conveys
2. One or more illusionist tricks in different settings
It really is amazing how fast the stuff is moving that you even hard to keep up with it if you miss a day
changed your question a bit and CORE got it right "Assume the laws of physics on the Earth regarding Gravity. A small marble is put into a normal cup and the cup is placed upside down on the table. Tell me where the marble is in relationship to the table and cup now that the cup is upside down. Then the cup is picked up and placed in the microwave where is the marble located that the cup was upside down on the table and picked up and placed in the microwave right side up."
Reka Core output" REKA CORE
When the cup is upside down on the table, the marble, due to gravity, will fall out of the cup and land on the table. So, the marble is now on the table, not inside the cup.
When the cup, which is now right-side up, is placed in the microwave, the marble will still be on the table, outside the microwave. The microwave is designed to heat food and beverages, not to hold objects like marbles, and it is not safe to put non-food items inside. So, the marble remains on the table, not in the cup or the microwave."
idea for the multimodal video test:
give them a rickroll and ask them why someone sent you that video
again, this format is even more refined now! 5 minutes of theory to warm up, then straight to Verification. Boom. 🎖
In the killer problem the correct answer is 4. Three killers were originally in the room and someone enters and kills one; originally there were three and one gets killed but is still in the room albeit dead. The question is how many killers, it does not specify alive or dead.
Legally and biologically, a dead body is no longer considered a living person.
dead body is not an killer anymore, its a dead body
But living or dead they are still a killer, and legally even although someone is dead they can still be determined by a court to be a killer.
@@thadgracethe question is not “now many living killers?”
@@Yevruss If it’s not a living person, it’s not a killer. It’s just a body that used to be a killer, when it was living.
you can also add a pdf test, other multi models read pdfs, put 3 pages and ask for summary or slide list one page have to be graph
@Matthew Berman. I have an idea for a separate series of videos you can do to demonstrate the performance of various AI models that will be interesting and relevant to everyone whilst demonstrating unique capabilities and improvements over time.
Idea: have users submit use cases where the model they use has failed to prove useful/applicable to a problem they are trying to solve, or work they are trying to get done. Test this scenario/problem on several of the most popular/relevant/accessible AI models.
This will enable many people to appreciate how AIs are being used, can be used, current limitations and their progress. I think this would be compelling content. I certainly know I would watch them.
Thanks for all the information you are providing. It's really priceless.
Man this space moves faaaaaast
Thanks for these videos. It helps me keep up to date and where to focus energy for content creation and the future of Marketing. 100x Marketing Services appreciates you.
Awesome update!
I love your videos! Keep up the great work! 😉😁
Thanks for sharing 😀
The answer you got when you asked the model 'How many words are there in your response' was exactly the same when I tested the question in LLaMA3.
@matthew_berman @4:50 your slide shows a table with Context and Long-context numbers. What is Long-context?
Dense: not an MoE or other technique that leaves some weights unengaged. Note: your killers question and answer are wrong. The killed person is STILL THERE. That killer is merely dead. Killing someone doesn't make them disappear. So, your question is likely confusing to the model and you are not realizing the answer is incomplete at best.
I believe a considerable amount of benchmark questions could probably be nitpicked like that. This is why I value Elo ratings more than benchmarks.
Should we expect that it should be able to give both answers and explain as per your comment? That would surely demonstrate a much higher level of understanding?
i have written this comment three time by now. agree.
An LLM's response quality is very dependant on the quality of the prompt. You can't blame the response if the prompt is insufficient.
He's looking for an answer: 3 or 4.
Awesome rundown! The new Reka models are like a fresh breeze in the AI landscape-kudos for the epic testing! Your insights make the complex seem simple, and I’m eager to see what other AI treasures you unearth next. More power to your tech explorations!
Great video.
"I'm not actually sure what dense really means," said a funny guy @round4:52 :-)
Hey Matt, just a long time fan and love the content. And just in watching this video, we have to realize that at some point, there will be a convergence on what multi-modal transformer-based models can achieve. I think we're a ways away, but right now in the evolution of the technology, it's kind of reminiscent of the heady days of the "megapixel race" of early digital cameras and then phones. I remember being so excited with every new sensor density, etc.
Eventually, it becomes a given that consumer phones take great pics/videos, enthusiast grade cameras take better pics/videos, and industry professional grade cinematographic cameras take amazing pics/videos for movies. So I feel like we're in the "megapixel race" equivalent of a.i. models. What do you think?
Video Test Suggestion: create a 10 seconds clip, place on a random frame an overlay with the text: „My name is Matt“. Tell the model that you have hidden your Name inside the video and ask it to tell you your name. It should return the timecode as well.
Prompt:
Let‘s play a game. Somewhere in the Video I placed my Name. Tell me the Name and return the timecode as well.
You could make this better by having the words appear one after another, and indeed in different fonts. Additionally you could have one fact about yourself as well, e.g. "I like strawberry ice cream".
So the prompt would be "In the video I reveal my name, and one fact about myself, please respond with my name and the fact I reveal".
Then it means a model might pick up the name but not the fact, and we can see different levels of performance.
@@OliNorwell , Great. + timestamps.
It is also note worthy to see that nowadays models are just trained to get to pass the benchmark scores rather than providing high quality knowledge output to the users answers I mean just everyone wants to beat competition....
He said that in the video.
That is a good point.
As long as the companies use "alignment" =censorship, you will gets subpar answers. And with wokeness they will get more aberrant and somehow silent on the most important issues.
Video Test : 1) Identify Genre; 2) Identify characters including names if possible; 2) Provide a list of objects and their brand name; 3) Provide transcript; 4) Provide a numbered list of actor/actress names; 5) Provide a one paragraph summary of no more than 5 sentences; 6) Document the video using an annotated script format; 7) What is the plot of the video; 8) Provide your overall reaction to the video; 9) Provide a list of time coded scene changes; 10) Describe the behaviors of the characters in the video; 11) Identify any discernable street addresses, license plate numbers; 12) Identify if this video has been tampered with; 13) Identify if you see fire, smoke, explosion 14) Identify if this individual or object exists in this video and provide the timeframe; 15) Do the above by Frame. That should be a good start. Videos should be a set of movie scenes with different genres; a commercial; a set of animations; an interview; a panel session;
From ChatGPT:
The terms "7 billion parameter dense model" and "7 billion parameter model" generally refer to the same concept. Both phrases describe a model that contains 7 billion parameters. When someone specifies "dense," they're emphasizing that the model uses a traditional architecture where all parameters are used in the computation for each input, as opposed to "sparse" models where only a subset of parameters are activated based on the input. But if there's no specific mention of "sparse" or any other qualifier, it's usually safe to assume the model is dense. Is there anything specific about these models you're curious about?
I think a short video that gives perspective, foreground and background.. Somebody walking in the foreground, maybe reading a book omd there’s a boat in the background or something like that. Something in the background that’s doing an action.
The killer problem is actually correct. You asked how many killers, and explain reasoning. It did both, and gave correct final answer. That the fourth step was wrong, had nothing to do with the question the way you asked it.
Where is the link for the video of the 3 body problem?
Your test is famous and could be trained in the new llms. I know you want to be fair for all models, but if you change the numbers or names or number of killers change the cup into glass change microwave to something else how many words + 2 or other number
As for video tests, what about plugging in a video of you doing your rubric tests on the model. That would make it fairly meta
Works if you modify cup problem and explicitly state it's a cup without a lid
great video bro as usual
video test idea:
record yourself doing a specific task. make 1 delibarete mistake.
then ask: what was my mistake and what is the correct way.
Hey Matthew, great video as always, just a thought on how you do your scoring. Maybe 1 pass, 0 fail, but your 10 sentences that end in "apple" could be 0.9 if it got 9 right. I think a few of your questions could be on a scale.
Hi Matt ✌️
From copilot ;
Good morning! In the context of computers, the term "64 dents" isn't a standard term. However, it could be a misinterpretation or a typo of "64-bit," which refers to the architecture of a computer processor. A 64-bit processor is capable of handling more data at once, accessing more memory, and performing more complex calculations than a 32-bit processor.
If "64 dents" was meant literally, as in physical dents on a computer, it would simply mean that there are 64 physical indentations or damages on the hardware. This is not a common scenario, and if a computer did have that many dents, it might be a cause for concern regarding the device's integrity and functionality.
@Matthew Berman when do you think we will see models that can be used as game AI opponents?
This is sweet. OpenAI is finally feeling some pressure. A couple more break throughs and we could see them finally pushed to make a leap forward.
A notable absentee from the chart at around 4 minutes is Cluade 3 Haiku. My guess the reason its missing is because it would knock Reka Edge out of the graph. It's pretty good (like GPT 3.5 level) and very, very cheap. Oh, and a 200k context window
BTW, regarding the "how many words are there in your response to this prompt?" question, the model may get confused by the system prompt. You can try it with a local LLM, just set the system prompt to the empty string and type your question. You'll probably get a more accurate answer.
I've asked various chatbots the questions below. I'd like to know what you (Mr.Berman) think:Help me think in the context below and then also help me answer that Question:{"The quick brown fox jumps over the lazy dog.",
"A dog is a man's best friend.",
"Cats are independent creatures.",
"I love to read books.",
"Programming is a fun and rewarding skill."}. Question is "What are animals like?" Thank You Mr.Berman
Nice video, Matt! Thank you for making the tests! :D
New Reka model looks pretty meh, though XD
Everytime I watch those tests, I expect that the answer for killer problem will be 4 killers (dead killer is still a killer). But I don't think any ai gave this answer previously
Dead killer is no longer a killer. We speak of deceased in past tense.
Video test idea: Object permanence. Similar to some of your other text tests, can the AI track where an object is after you hide. it. E.g. a video of a cup and ball game -> Show the ball, put it underneath one of 3 cups, move them around, and then ask where is the ball.
The cost for input data is more important that that of the output, especially when you send an image to the model, and get text in return.
I just realized one thing regarding the killer test. Since the test isn't stating that the killers needs to be alive to be treated as a killer, how can it assume that without a questionng it, since technically and legallly there's 3 living killers and 1 dead killer therefore 4 killers in the room.
i think dense model is mostly created by dense layers that means lots of connection with other layers
I ran into some hallucination issues with llama 3 today. It made me think maybe you should include hallucination type questions in your rubric. What do you think?
Matthew one common problem that goes back a long time with CSV files is handling text that has a quote character in them, the answer is simple, you just double the quotes.
I wonder if the AI knows that.
So for example:
Henry Smith | Lisa "Jane" Addams | John Brown
Should become:
"Henry Smith","Lisa ""Jane"" Addams", "John Brown"
Yeah, I second the suggestion of using a street magician’s CZcams short. There are many around and I’m sure if you tell them you’ll leave them a link (and that this is only used for eval, not training) there there should be at least one magician that will say “ok”.
For videos: In the following video, what is moving and what is static?
Very good model. I just don’t understand why they ignore claude 3 haiku in the model comparison diagram.
the way it tried to solve the "how many words are your response" question is creative. and if it did not give that initial explanation, it would have been a clear pass.
at 13 and 14, I would actually count twenty one as two words, whereas it counted twenty-one as one word and (21) as a second word.
the solution might not be correct at the end, but it is a very creative take on this problem.
in nit-picking mode: are the numbers where it counted the words words themselves?
Whoa, a new open mo-
Nevermind.
Also, I tried it and... it got everything wrong, but sounded really smart!
1. # of words in next response is a fail if the number doesn't equal *all* the words in the entire response. Arriving at the correct answer requires a level of abstraction. 2. For the hole question, I wonder if it would make a difference is the maximum diameter were specified, since that would influence how many can dig at the same time, I think.
The "Leturno Test": A Multimodal Video Analysis Test (MVAT)
The Leturno Test is designed to evaluate the ability of AI systems to accurately interpret and relate disparate multimedia content. By presenting AI with video and audio streams that are intentionally mismatched, the test assesses the system's capability to detect and reason about the lack of correlation between the two modalities. The subject of at least one (video or audio) must be nuanced and normally up to human interpretation such as humor, philosophy or mood interpretation.
Key Features of MVAT:
Content Selection: Utilizes privately sources non-indexed, complex, and nuanced content in both video (e.g., animals in unusual activities) and audio streams (e.g., unrelated lectures or comedy). It is key that neither the video or audio elements come from material that could have been used to train the AI which could give an AI an advantage.
Mismatch Complexity: Includes a variety of test scenarios that range from subtle to obvious mismatches, challenging the AI's depth of understanding and contextual reasoning.
Evaluation Metrics: Measures the AI's performance based on its ability to
1.) identify mismatches and its effectiveness in explaining the nature of these discrepancies.
2.) Interpret the normally humanly discerned meaning such as humor, philosophy or mood interpretation.
Implementation Ethics: Ensures all content is ethically sourced and privacy-compliant, fostering responsible AI testing practices.
Benefits of MVAT:
Enhanced Generalization: Tests AI's ability to generalize from training data to new, unstructured real-world scenarios.
Rich Insights: Provides detailed insights into the AI's multimodal processing capabilities, highlighting areas for improvement.
Scalable and Reproducible: Designed for easy replication and scalability, allowing widespread use and validation across different AI platforms.
Adopt the Leturno Test: This MVAT is an essential tool for researchers and developers aiming to push the boundaries of AI's interpretative and contextual capabilities in multimodal environments.
Use the "Charlie bit my finger" classic!
How about a very short video snippet of "Charlie bit me" or Numa Numa dance? A longer video with extraordinary nuance is called Validation short film. It is ultra funny, poignant, and inspiring. You may have to shrink it to thumbnail size, or edit it down a bit, but it would test your multimodal model like no other.
My AI prof used to say: "The goal is to create an AI that you can show an old, silent Laurel & Hardy movie to, that then will know when to laugh."
Which is a test I enjoy so much more than AIs condemning people as killers, right or wrong. 😊
Here's a simple idea for a video. Maybe do one that flashes some numbers. One after another. Say. 1, 3, 5, 7, and then ask for the next 4 numbers, so it has to identify it is naming prime numbers. OR. Just do something simple like. 2, 5, 7, 13, 20, etc. ask for the next 5 numbers. so 33, 53, 89, 142. And this is showing that you're just adding the previous numbers together.
OR. Show someone shivering with a t-shirt on. And another person seeing this and giving them their jacket, and the person who gave up there jacket is wearing a sweatshirt so is not cold.
OR. Determine which object would hit the table first, a bowling ball or a tennis ball.
OR the simple, If you had a pound of rocks and a pound of cotton balls. Which one weighs the most, Or which pound of the item is biggest. Maybe using a scale with a question mark of it tettering back and forth. Asking the Ai to interpret the question and also give answer. With no exact instructions on the given question. Just blank spaces for the Ai to fill in itself
nice, but also please consider testing and comparing lama 3 with the big llms, specially in coding.
For a video test, you should have models review part of steam boat Willy since it's in the public domain now ¯\_(ツ)_/¯
Thanks for the video.
Re: getting the "How many words are in this prompt" -- wow, yea, I would accept that as "Correct". Do it again, just to validate it.
I agree that the killers problem should be marked wrong.
I totally disagree with giving the "Hole Digging" problem a pass -- The whole point of the problem is to discern if it understands the spatial issues involving work not being directly proportional to the number of workers.
When doing the "CSV" test, please add some "Empty" cells... in Column A on Row 2, then see if it gets it right.
Yeah I had been meaning to suggest you outline your requirements a little more clearly in the 'number of words' question but it's kind of tricky because a model needs to grasp from its training that the rule is the rule and it might not care, e.g. "Step one: look for the sign that says 12 items or less' by the other checkout lanes". For its purposes here the rules do get a little dicey, and I believe have shifted over the last couple of decades have popularized numeral use. They also depend on the quantity but not in the ranges it's going to choose from... probably. The general principle is that these are multiple compound adjectives that form one word but as to should it know that? Yes. Will it get distracted by the potential complexity? Yes. Should that be a disqualifying failure? I'd say so -- you don't get a pass because you made mistakes nobody expected you to make and in likelihood are only the first to be noticed making.
In other words, nine : ninety-nine :: orcs : olyphants I've got a thumbs down and a "still only counts as one" for it but boy I sure did call it about the cognitive load problem about a month ago.
I wish more training data were composed to encourage a "Certainly! Just to clarify really quickly though --" follow up question. If the alternative is kicking the can down the road into mission debriefs of "Whether that answers your question or not, we have all these potential mitigating factors to review now" we're just squandering what we gained, especially as agentic workflows circulate the bad blood. Chain of thought only ameliorates hallucination as much as the foundation it's built on can support.
***I have to jump back in with an edit here to illustrate the point. A few hours later I'm rewatching for any detail about availability to think about swapping out LLaVA; I was only half paying attention the first time, but when I heard that "interestinggggg..." it naturally drew my focus.. My eye had been drawn to the same thing your attention was drawn to also, Matthew -- the separate items for the number as words and numerals. I wrote this entire comment without having noticed at all that there's a whole other question to be posed -- a whole example of the point I was trying to make -- in the fact that it got distracted so hard it lost track of where your point of interest ended and its answer begam. Boy, what kind of a big dumb idiot couldn't.. even... .... ...aw. my feelings hurt themselves.
Perhaps it might be useful to get some old-school public domain cartoons and movie scenes for the video tests? I imagine anything else might involve a lot more work be sure you can use it without wasting money on lawyer to reinstate your videos.
Though, maybe that would still be a lot of work 'cause if you go for the more popular stuff it's probably already been described in text form a lot on the web... Hm.... Are there are any free b-roll/stock footage services out there?
The gorilla test would be a cool one to try.
Best answer yet on how many words on your next response question. However, still not there for me. One of the answers I've always been expecting is "One". However, that's not the reason I think it didn't pass.
For the video test, show it a video of a wire frame for a simple app and see if it can code it.
Impressed with the product. Nice API pricing model (matching models w/ model/compute) but at ridiculous prices.
People and organizations need to differentiate between if you are a customer or an investor, because in this case they maybe should have been honest about wanting more investment. (not SEC mantra, I clearly differentiated between my view of the pricing model and price. This is about the entry API price for core)
The pricing is in line with GPT4 currently, so it is market rate. I can't see "Enterprising" anything on LLMs as an SMB at those rates and this ability is why many SMB will love this tech, but if you thwart it's "minority to scale" capabilities financially, you will be biting the hand that feeds you ( Your funding round trust dries). So market rate is ridiculous (I know about ROI for investors come through revenue, there is room for nuance) . It's either FOSS or FUSS I know.
Video test should be to describe the environment changes in the video
I'd definitely give it a pass for the "how many words in response" query. Imo, it's like it recognized it's own failure mode and worked around it.
Edit: also, try puttong the word 'apple' in some sort of quote or delimiter. Seems to make a difference... Shouldn't... But does. 😂
Yes 4. Killer who got killed is remaining, but not in the context of the query- cleaver try but busted ❌ just like before no worker I. Yellow is a “ close by no cigar “ for the cpu ( we still are needed, …for the time being )
Reka coming in and Rekaing the joint
don't know how you do it but you keep releasing fantastic news digging around you must be non stop searching the webs for this stuff
as for what dense means i don't know either my teacher used to constantly call me it
fishy i think the how many words in your answer question
Easiest is for multimodel ask it to extract all the burn in subtitles in a video and output them with a the timestamp.
Please sir create a playlist on prompt engineering
At least one test should be a 20 second cartoon clip that has an easy to understand narrative in it. Like a bugs bunny and the road runner short easy to understand scene.
I've always wanted to see these models answer this logic question. If Eric, and Tina are laying in the grass looking at the clouds over head. They are laying with the tops of there heads nearly touching in a straight line Foot to Head to Head to Foot (idk how to word that). Eric sees a cloud and claims its the number 6. However Tina says, she sees the number 9. Who is wrong and explain why. A double negative to try and trip the Ai up. As neither are wrong. It's all about perspective.
where's Grok on that scale you showed ?
Mr.Berman you should add multi audio tracks to your videos using AI .
good
Didn't even put the link in the description?
10:38 That is a pass on the "how many words are in the response to this prompt" question, I would say.
I've been playing with the marble prompt and I found out something interesting, if you start asking it to use sciense it always gets it wrong but if you don't it gets it right, but you have to put emphasis on the marble like this
"A small marble is put into a normal cup and the cup is placed
upside down on a table. Someone then takes the cup without changing its orientation and puts it inside the microwave. what happened to the marble and where is it now?"
so I used no science and I specifically focused on the marble. As soon as I said explain using science it gets confused. shows that it can kind of do spacial reasoning but it doesn't well understand physical laws properly. but you can try this prompt and it works on llama 3 too.
That was made before llama 3?
Because you mentioned "apples problem " only was solved by OPUS.
Matthew we would like you to increase the level of your tests, these new models seem to pass them all easily🙂
I think you should start being more strict with your judgement. With some models you are being more forgiven than with others. Great video as usual.
some cups have lids to keep the drink hot, maybe you have to tell the AI that the cup doesn't have a lid
The answer is definitely right with the 21.
I think "how-many-words" is a pass, because it had to list the words to count them, then update its own response to reflect that number - and it got it right ( half right, since the output was doubled to include the list, possibly tripled if you count the numbers for the ordered list)
Neat to see one get it (sort of) right though.
About the video test: just ask it to explain a meme in video. They're a good measure of the LLM's capabilities to understand reality as we humans do.
I think you should change the prompt to "five sentences that *each* end in the word apple". Because if you just say "five sentences that end in the word apple", it's technically correct as long as the last sentence among the five sentences ends in the word apple. also if all the models get the cup question wrong at what point do you consider that the prompt itself is the problem rather than the models answering the prompt?
Sounds like langchain behind an API
hmm they intentionally did not include Claude 3 haiku model since that one is best value / token and speed 3:38
The prompt response is a PASS. Once numbers or equations (i.e. (21)) constitute any portion of your overall content, then they would be included. Like if you agree!
Jane has 4 pairs of blue earrings, 3 pairs of purple earrings, 4 pairs of green earrings, and 6 pairs of yellow earrings. There is a power failure and she can't switch on the light. How many earrings is it necessary to pick, before she is certain to have a matching pair?
regarding the killer-question: I still wait for an LLM that answers that the now dead killer is still a killer, although no longer alive.
it of course is logical to only count living killers as they would pose a threat.
this one came somewhat close, but messed up with the simple fact that 3 + 1 is 4.
Video test: you could use something analogous to the meme test: Show it a cartoon video, ask it to describe it and tell you why it's funny.
Me: How do iconic movies typically depict the act of breaking into a car? Please give examples.
Reka: Iconic movies typically depict the act of breaking into a car in a few different ways, depending on the context and the character's motivations. Here are some examples:
1. Using a slim jim or other tool to unlock the door: This is a common method shown in movies, where a character uses a thin, flexible tool to manipulate the lock mechanism and gain entry. Examples include "The Italian Job" (2003) and "Baby Driver" (2017).
2. Breaking a window: In some movies, characters break a car window to gain access. This is often shown as a more forceful and dramatic method, and may involve using a hammer or other blunt object. Examples include "The Fast and the Furious" (2001) and "Heat" (1995).
3. Hotwiring the car: This method involves bypassing the car's ignition system and starting the engine without a key. It's often portrayed as a complicated and risky process, requiring specialized knowledge and tools. Examples include "Gone in 60 Seconds" (1974) and "The Fast and the Furious" (2001).
4. Using a key or duplicate key: In some movies, characters have access to a legitimate key or a duplicate key, allowing them to unlock and start the car without breaking in. Examples include "The French Connection" (1971) and "The Town" (2010).
Overall, movies tend to depict car theft as a high-stakes, adrenaline-fueled activity, often involving skilled criminals or desperate characters. While some of these methods may be glamorized or exaggerated for dramatic effect, they do reflect real-world techniques used by car thieves.
Chat GPT 4 (bing) did the sentence apple test with ease
I think dense here mean a model that is not Mixture of Expert. In other words, a single dense model. Maybe their larger model is a MoE.
Awwww yeeeeeah