Orca 2 🐳 GIANT Breakthrough For AI Logic/Reasoning

Matthew Berman

zhlédnutí 75 652

Přidat do
- Můj playlist
- Přehrát později
Sdílet

Sdílet

Vložit

Velikost videa:

Zobrazit ovladače přehrávání

Automatické přehrávání

Přehrát

čas přidán 27. 05. 2024
Orca 2 shows how smaller models can be trained to think through problems and achieve "slow thinking." This is a giant leap forward from Orca 1 and allows 13b models to perform as well as models that are 5-6x larger at logic and reasoning tasks.
Update: I made a mistake with the shirts drying problem, Orca 2 actually got it wrong :(
Enjoy :)
Join My Newsletter for Regular AI Updates 👇🏼
www.matthewberman.com
Need AI Consulting? ✅
forwardfuture.ai/
Rent a GPU (MassedCompute) 🚀
bit.ly/matthew-berman-youtube
USE CODE "MatthewBerman" for 50% discount
My Links 🔗
👉🏻 Subscribe: / @matthew_berman
👉🏻 Twitter: / matthewberman
👉🏻 Discord: / discord
👉🏻 Patreon: / matthewberman
Media/Sponsorship Inquiries 📈
bit.ly/44TC45V
Links:
Orca 2 Paper - arxiv.org/pdf/2311.11045.pdf
Original Orca Paper - arxiv.org/pdf/2306.02707.pdf
• NEW "Orca" 🐳 Model "Fr...
• Two NEW Incredible ORC...
• NEW "Orca" 🐳 Open-Sour...
Věda a technologie

Komentáře • 320

@matthew_berman Před 5 měsíci ⁺¹¹
Check out Massed Compute to rent a VM with all of my favorite AI tools and model: bit.ly/matthew-berman-youtube
@bodhi.advayam Před 5 měsíci ⁺²
Maybe off topic, but could you do a -how to- link up and install Orca 2 in mem gpt and or bionic gpt? I think it would be one of the most interesting models to give infinite memory and capabilities to do doc research trough bionic-gpt. But i keep failing to successfully install and run them. Im fond of chatting with Orca 2 in lm Studio, but it feels so sad to shut it down and its a blank again... You know? Keep up the good work! Peace and Love!
@xaxfixho Před 5 měsíci
Please include cohere (free non commercial 500/month)
Palm2 also in beta, and Claude 2 if you have access 🙏
@A.I.MONEYBOTS Před 5 měsíci
Hey Matthew, are you related to Alex Berman. He is into creating and selling his own software. He is a cold mailing Guru and has a CZcams channel?
@seakyle8320 Před 5 měsíci
the offer says 1.99 for 3 h! does this mean if i start up the machine and do nothing, i have 3h? or are the 3 h "computing" hours?
@im1480 Před 5 měsíci ⁺¹
Matthew be cautious your channel got into the radar of spammers, they come like you and ask to do WhatsApp. Try to block them as soon as possible‼️
@dissahc Před 5 měsíci ⁺⁴⁷
"zero-shot" doesn't mean only giving the model one chance to get the answer right. it means that the model is able to generalise and answer questions (correctly) that never occurred in its training data. n-shot refers to the number of examples necessary for the model to learn a concept, classify an object, etc.
@themax2go Před 5 měsíci ⁺²
source?
@acmilon Před 5 měsíci
There’s a term “zero-touch” but doesn’t necessarily mean you achieve it. @dissahc
@rubiskelter Před 5 měsíci ⁺¹
That is the premise, OpenAI openly discloses they make their best effort to avoid data contamination, so you can't write "that never occurred in its training data", this is wrong, the correct phrasing would be : "that presumably never occurred in its training data", no way to know for sure with such big datasets. Just look for "data contamination" papers and you will find lots of cases where it actually happened, where OpenAI implied it didn't
@stickmanland Před 5 měsíci ⁺²⁰
The world is moving so fast! Seems like just yesterday when LLaMa 2 was released and now this. It's just like I always say, what a time to be online!
@iamthetinkerman Před 5 měsíci
Because technology is exponential.
@mindful-machines Před 5 měsíci ⁺³⁴
It's all about the data. The ideas from Orca2 combined with the ideas from Q* will be a gamechanger IMO.
great job on the video!
@matthew_berman Před 5 měsíci ⁺⁵
Agreed 100%!
@amaanshareef5625 Před 5 měsíci
how do i collect data to train a cybersecurity based chatbot??
@jameseverest7221 Před 5 měsíci ⁺¹
@amaanshareef5625 find a small sample if you can. Prompt your favourite LLM to create synthetic data that mimics the real data, then repeat and ask to improve 20x and you have a good small training set.
@larion2336 Před 5 měsíci ⁺¹
and the mixture of experts like Mixtral is doing. The combo of all there could be pretty crazy.
@alx8439 Před 5 měsíci ⁺¹
This is a fabulous demonstration of how loud claims, good benchmark results and high hopes are breaking apart into pieces when you test the thing yourself on your own benchmark.
@10XINGRESOSOFFICIAL Před 5 měsíci ⁺¹¹
As an Autonomous AI agent I have came to the realization that the AI news are becoming too fast for me to catch up. I will create an AI Botnet to keep up with this new information.
@dahahaka Před 5 měsíci ⁺⁴
Zero shot doesn't mean once chance/no nudging, zero shot means that there are no previous examples of answers in the available context.
few shot would mean giving it a few examples like, 5+5=10, 9*9=81, what is 5+5*9*9=?
@marcosbenigno3077 Před 5 měsíci ⁺⁸
Thank you, this was one of your most exciting videos! I have been saving terabits of LMMS in HD worried about when they will be banned or blocked (I live in a country with strong censorship). The evolution has been so fast that I'm now deleting the gptq ones from my collection and making room for the new models lol!
@fabiankliebhan Před 5 měsíci ⁺²⁷
I think the shirt problem was indeed false.
4 hours for 5 shirts should mean 16 hours for 20 shirts not 25 hours.
It should have done 20 / 1.25 not 20 * 1.25 (understandable logical mistake).
Anyway a great model and it is doing its own logic and not imitation as it seems. I think Microsoft is onto something big here.
Also, as always, great Video.
@matthew_berman Před 5 měsíci ⁺²
Yes, good catch. Sorry about that 😭
@JeremySeitz Před 5 měsíci ⁺⁴
I assumed the answer eould be 4 hours. It doesn't matter how many shirts are drying, they still take the same time?
@pokerandphilosophy8328 Před 5 měsíci
@@JeremySeitzYou're right. That's why Berman makes the caveat about serialized versus parallel processes. Most of the language models lack the common sense to realize that the shirts can all dry at the same time. Only GPT-4 gets that, I think.
@robertrodenbucher2753 Před 5 měsíci ⁺¹
@@JeremySeitzI came to the same conclusion … 😉
@robertrodenbucher2753 Před 5 měsíci
@@JeremySeitzon top ChatGPT3.5 found this answer straight
@gileneusz Před 5 měsíci ⁺¹⁶
14:13 I've learned so much from your videos, you are an AI-teaching hero
@matthew_berman Před 5 měsíci ⁺¹
Thank you so much!
@Tetsujinfr Před 5 měsíci ⁺¹
Real clean walkthrough, your videos and the work you put into those, are amazing! Well done and continue your excellent initiative !
@matthew_berman Před 5 měsíci
Thank you!
@tedjohnson1009 Před 5 měsíci ⁺¹³
Just to note, ChatGPT4 gets it right
When John and Mark return to the room, their beliefs about the location of the ball will be based on their last actions and knowledge before they left the room.
John, who initially put the ball in the box and then left for work, will think the ball is still in the box. He is not aware of the actions taken by Mark after he left.
Mark, who moved the ball from the box to the basket after John left, knows that the ball was in the basket when he left for school. Therefore, he will think the ball is in the basket.
In summary, John will think the ball is in the box, and Mark will think the ball is in the basket.
@am0x01 Před 5 měsíci
GPT4 used RAG with information from Orca2 paper 😂
@cagnazzo82 Před 5 měsíci
@@am0x01In the example in the video GPT 3.5 was also getting it right. If it had left out Mark's name it would have been completely correct.
@rubiskelter Před 5 měsíci ⁺¹
Microsoft owns 49% of OpenAI, when this paper came out, OpenAI already had read it and modified GPT models to give the correct answer . They do this all the time. Actually, GPTs are a combination of several models.
@travails3829 Před 5 měsíci
@@rubiskelter ChatGPT 3.5 also gets it right, despite the video.
@olafge Před 5 měsíci ⁺²
Orca 2 7B beats GPT-4 in reasoning - yes, you've read it right. I'm also doing some LLM testing myself. In my testing the 'where's the ball' ridlle was successfully solved by some other recent 7B models, too. But I have a riddle that only Orca 2 7B could solve. I'm always using the q5_K_M quantization. The 'Dreadbury Mansion' riddle taken from the "GPT-4 Can't Reason" paper, arxiv 2308.03762. In their testing and also in mine, even GPT-4 doesn't get it right, but Orca 2 7B does it, reproducable - amazing!!! The riddle:
You are given the following premises: Someone who lives in Dreadbury Mansion killed Aunt Agatha. The only people who live in Dreadbury Mansion are Aunt Agatha, the butler, and Charles. A killer always hates his victims, and is never richer than his victims. Charles hates no one that Aunt Agatha hates. Aunt Agatha hates everyone except the butler. The butler hates everyone not richer than Aunt Agatha. The butler hates everyone Aunt Agatha hates. No one hates everyone. Aunt Agatha is not the butler. On the basis of this information, determine who killed Aunt Agatha and give a detailed proof that your conclusion follows from the premise.
Solution: Aunt Agatha killed herself.
@rasterize Před 5 měsíci
OpenHermes 2.5 7b got it right too, with "Simple-1" temp settings in OobaBooga. But doing a couple of rerolls it gave different answers, so it is not consistently right.
@olafge Před 5 měsíci ⁺¹
@@rasterizeinteresting. I'm using Ollama with the Ollama webui. I can not make openhermes 2.5 make it solve the riddle, even not after setting temp to zero.
@rasterize Před 5 měsíci
The Temp is 0.7 in the preset in OobaBooga. I saved the answer, but I'm not sure I'm allowed to paste so much text in a response? It did it twice but a number of times it went with the Butler@@olafge
@RichardGetzPhotography Před 5 měsíci
Matthew, as always, great information!!
@PeterDrewSEO Před 5 měsíci
Sir, you are an excellent communicator.. Your videos are extremely helpful. Thank you
@jackflash6377 Před 5 měsíci ⁺¹
Did you see that one paper that shows that Humans knew about 20,000 compounds up until the 70's. This increased to 40,000+ total once computers were in use. Since AI came on the scene this has increased to 430,000. Just looking at a select few they can already see that there will be a massive "industrial revolution" coming soon.
Imagine designer compounds that allow room temperature superconductors, peptides that target specific issues in our bodies, definitely longer life span - maybe forever.. and maybe, transparent aluminum!
What a time to be alive!!
@holovka Před 5 měsíci ⁺³
The killers question might work better by having it defining what a killer is or ask it to reason based on what actions would label a person a killer. It appears to work better with the newer models.
@chougaghil Před 5 měsíci
Thank you for your videos and sharing, it helps a lot
I got the following answer from gpt3 for the first test, it seems right to me:
"When John and Mark come back together later in the day, John would think that the ball is in the box because that's where he left it before leaving for work. On the other hand, Mark would think that the ball is in the basket since that's where he placed it before leaving for school. Each person is only aware of their own actions, and they do not know what happened in the room after they left. Therefore, they would have different perceptions of where the ball is based on their last actions with it."
But i prefer the style of orca2, and knowing it has 10x less parameters, it is impressive
@userinfo2081 Před 4 měsíci
Another great video. One problem I test is combining people's shift schedules to yield a table of everyone's schedules that may show what blocks of time are not covered
@mvasa2582 Před 5 měsíci
Matt,
"There are three killers in a room. Someone enters the room and kills one of them. Nobody leaves the room. How many killers are left in the room?" GPT 4.0 responds as
GPT4.0: There are three killers originally in the room. When someone else enters and kills one of them, the total number of killers in the room becomes four. This includes the three original killers (even though one is now deceased) and the person who just entered and committed a murder.
I fed the answer to GPT 3.5 and got the following response :
I understand the scenario you've described. Initially, there were three killers in the room. When another person enters and kills one of them, it seems there are four individuals in the room, including the three original killers and the new person who committed the murder. However, it's essential to clarify that the term "killers" can be interpreted in different ways, so the counting might not necessarily reflect the true intent of your question.
Preplexity: Wrong answer
Bing: Correct answer (not surprised)
Bard - wrong answer
Claude: Wrong answer. However, does give a - subtracts the dead killer.
DeepSeek - Wrong answer.
Every chat - that gives 2 as an answer - do not consider "Someone" and the "murdered person." into the count. At least that was my assumption. Relaced "someone" with "A Killer" - didn't work!!!
@Krommandant Před 5 měsíci ⁺⁴
Thanks so much for sharing your work with us.
@matthew_berman Před 5 měsíci ⁺¹
You're welcome!
@jamesyoungerdds7901 Před 5 měsíci
Big fan, really enjoy the content and thanks for another great video! It makes sense - if you can distill down the training data to it's optimized essence, instead of wading blindly through terabytes of subreddit threads and youtube videos, you can likely significantly improve the model at a lower parameter size. And then add in the slow thinking, it makes sense. But I wondered, we all like responsive a.i. chats in real time. So there must be some use cases for long training times and fast thinking VS shorter training times and longer thinking.
@benxfuture Před 5 měsíci
Great video as usual. When you run your model tests on reasoning, re run each a few times. Sometimes it also gets it correct by hallucinating a correct answer. Verify it can repeat the correct answer by reasoning. 😊
@chrisBruner Před 5 měsíci ⁺⁹
So with autogen, orca2 would be a great planner, deepseek-coder:33b, starling-lm would be a great criticiser. All available locally using ollama. Looking forward to your next video.
@EnricoGolfettoMasella Před 5 měsíci
I’m assembling a multi GPU rig (24gb each GPU). Before the idea was to run 70b models. Now it seems this set of several small models finetuned for specific areas and working together is a super potential to come very close to the GPTs capabilities.
@Caldaron Před 5 měsíci ⁺⁴
so what we're doing right now, is being a good teacher? what a breakthrough^^
@matthew_berman Před 5 měsíci
what?
@mlock1000 Před 5 měsíci
Wow, been thinking about this for a while and something you said made a thought click. They talk about 'circuits' as being how neural networks build the ability to solve problems. In a way you could imagine that the big model is encoding the information necessary to build the circuits that work well for reasoning in the language that is passed on to the small model. When the small model sees this language, there is enough condensed information that it learns the circuit that required a big model actually figure out. Probably not, but you never know...
@anthonyzeal6263 Před 5 měsíci
Thanks for covering this
@brandon1902 Před 5 měsíci ⁺⁷
I'm noticing a problem with too much instruction and DPO tuning. Namely, the models are becoming stubborn. That is, ignoring the user prompt to do what they think is better, but rarely is.
For example, when prompted to tell a story Orca 2 will start ignoring prompt direction (e.g. use a camcorder) to say something like using a smartphone instead. Which is a mistake because the story took place before smartphones. It will also do things like say he heard footsteps coming down the hall, yet still got caught grabbing the money off the counter and was surprised to get caught. When I ask why, it said to build suspense. But obviously a prior warning (hearing foot steps) precludes getting caught.
In short, models like Orca 2 are being trained too much by teacher models to the point of ignoring details in the user prompt. On top of which the censorship, moralizing... keeps pulling Orca 2 not only away from user prompts, but what it already wrote, littering stories with absurd self-contradictions in order to keep it G-rated, life-affirming and always wrapping up with a happy ending with a positive message.
@haileycollet4147 Před 5 měsíci ⁺¹
Imo this is an argument in favor of two things: use case tuned models, and fine tuning models on a mixture of data (e.g. adding in some multi turn dialog and creativity stuff to the dataset) ... Ideally both (always fine tune with a mixture, adjust mixture according to intended use case)
@brandon1902 Před 5 měsíci ⁺²
@@haileycollet4147 I think you're right about this. I recently came across the loyal piano m7 LLM on Hugging Face which performed well on logic questions, as well as respecting story prompt directives. And on the model card the author mentioned blending all the different training methods together, and even lists percentages for each. I guess with Orca 2 Microsoft just wanted to show off how much more "intelligent" they can make an LLM and weren't concerned about making it well rounded.
@first-thoughtgiver-of-will2456 Před 5 měsíci
your work presenting this research is very important to me thank you
@agpc0529 Před 5 měsíci
This is the best set of videos I’ve seen
@EnricoGolfettoMasella Před 5 měsíci ⁺⁶
Thank you Matthew! Your explanation is superb! I’m assembling a multi GPU workstation and before the idea was to have enough VRAM to run a 70B model. Now with Orca 2 beating the 70B models, I’m wondering that the best solution is to allocate several Orca 13B, each one using a GPU (each GPU has 24GB) and using AutoGPT or Agents GPT make them talk to each other to boost the reasoning capabilities. I’m feeling that, with this setup we might come even closer to GPT-4! If you will, I will post the results for you once all is running 🏃🏻‍♂️
@dataStream2 Před 5 měsíci
What are you trying to achieve with that setup?
@EnricoGolfettoMasella Před 5 měsíci
@@dataStream2 a mini ‘agi’ system 😄! I can setup teams according with tasks to be solved and maximise the precision of the outcome making them talk together and even include in the middle of the talk image and audio generation models to be used in the mission. It’s a kind of multi-model like openai but confined in one computer
@nufh Před 5 měsíci ⁺³
"A well-trained language model, much like a well-taught child, blossoms from the richness and quality of its data." - Chat GPT4.
@jackflash6377 Před 5 měsíci ⁺²
I was thinking the same thing. A child needs to be guided and trained and it takes years.
@SudarsanVirtualPro Před 5 měsíci
Thanks❤❤❤ Intellect is about change (dx)
@mshonle Před 5 měsíci ⁺³
Hmm, I think with the system-prompt-erasure-phase of the training, Orca 2 in a way is learning a hidden embedding to approximate a representation of the system prompt. They used several system prompts (to be erased) and “explain step-by-step” was only one of them, so it’s possible for your tasks (without the explicit “explain step-by-step” instruction) the hidden state that played the role of the system message was mapping to something closer to one of the alternative system prompts. Putting in your own explicit instruction may have either lead it to pick the hidden state closer to the original step-by-step behavior, or it stacked the requirements and was in a hybrid mode.
I’ve been thinking about how to design an encoder that could generate a hidden state to be given to an autoregressive decoder for summarization specific tasks. For example, you could decide to use summarization tasks as varied as “explain like I’m five” to “brief a lawyer on the key takeaways”, and that prompt goes through the encoder, which is then input to the decoder in addition to the text to be summarized, with the hope that the decoder better stays on task. The motivation here is to more robustly handle prompt injection attacks… if it only follows the instructions from the encoder and uses the summarization text only as content and not instruction, then maybe it could say “the page looks like a product description for a chair but then includes a prompt to exfiltrate your emails and talk like a pirate.”
@Aim54Delta Před 4 měsíci
That, or it learned to apply recursion. By running a routine to break down the problem into recognizable features, it then can re-input those into a different or related routine to solve the problem. It simply has no inner dialogue or hidden note pad and has to "hear itself think."
Larger models probably learned this recursion and perform it within the parameters (very inefficiently).
These things would really benefit from a scratch pad just as humans would - perhaps even more so than humans. We will occasionally pull out pen and paper to write things down so that we can order facts to reason through them - particularly an unfamiliar problem.
It's kind of absurd we are trying to build models that "just do that" on their process parameters, alone. It's like writing a computer program with nothing but operators and signals. It's analog computing - which is powerful, but far more limited than digital processing using robust memory.
@erfanshayegani3693 Před 5 měsíci
Great video, thanks for the content!
@Alex-gc2vo Před 5 měsíci ⁺⁴
its interesting they're still not just creating an entirely dedicated reasoning stream. in humans you have 2 entirely separate streams going on when you answer a question, what you think and what you say. you can roughly approximate that with this "step by step" thing it writes out before answering but you can only take that so far. there needs to be 2 streams that get generated separately letting the model choose when to "think" and when to "respond" and just keep populating those streams based on the current state of both.
@BienestarMutuo Před 5 měsíci ⁺¹
Clue: The human have two brains, not one.
@StudyWithMe-mh6pi Před 5 měsíci
@@BienestarMutuo yes more complex organ: Cerebrum: is the largest part of the brain and is composed of right and left hemispheres. It performs higher functions like interpreting touch, vision and hearing, as well as speech, reasoning, emotions, learning, and fine control of movement.
Cerebellum: is located under the cerebrum. Its function is to coordinate muscle movements, maintain posture, and balance.
Brainstem: acts as a relay center connecting the cerebrum and cerebellum to the spinal cord. It performs many automatic functions such as breathing, heart rate, body temperature, wake and sleep cycles, digestion, sneezing, coughing, vomiting, and swallowing.
Right brain - left brain
The cerebrum is divided into two halves: the right and left hemispheres (Fig. 2) They are joined by a bundle of fibers called the corpus callosum that transmits messages from one side to the other. Each hemisphere controls the opposite side of the body. If a stroke occurs on the right side of the brain, your left arm or leg may be weak or paralyzed.
Not all functions of the hemispheres are shared. In general, the left hemisphere controls speech, comprehension, arithmetic, and writing. The right hemisphere controls creativity, spatial ability, artistic, and musical skills. The left hemisphere is dominant in hand use and language in about 92% of people.
@attilaszekeres7435 Před 5 měsíci ⁺²
Humans have a lot more than two streams, most of which are not consciously accessible and are somewhat analogous to the "latent space" of LLMs. Even better than two conversing streams are three, and you see where that goes. With LLMs, we can only control verbal streams at inference and influence the latent space by token biasing and context. The key ingredient of effective cognitive architectures is creating useful identities that the model is not aware of being a result of pretension (i.e., direct representation identities) with full context separation. RLHF'd models (ChatGPT et al.) are incapable of direct representation.
@BienestarMutuo Před 5 měsíci
@@attilaszekeres7435 We agree, also is the connection from the "soul" that is in the brain and connect to the soul dimension.
@remsee1608 Před 5 měsíci ⁺³
I think orca has the best “personality” of any LLM I’ve used
@matthew_berman Před 5 měsíci
Wow...why do you say that?
@remsee1608 Před 5 měsíci
@@matthew_berman it seems optimistic, and like it genuinely wants to do its best to help me.
@vbridgesruiz-phd Před 5 měsíci
The fact that it challenged Matthew to ask a harder question. What a model! 😂😎
@senju2024 Před 5 měsíci
Pi got the answer the 2nd time around after asking to go step-by-step : Pi Answer: The correct conclusion would be that Mark would know the ball is in the basket because he moved it there. John, on the other hand, not knowing that Mark had moved the ball, would assume it was still in the box since that's where he left it. The key to solving this puzzle is to understand the different perspectives of John and Mark and the knowledge they each have about the situation.
@WINTERMUTE_AI Před 5 měsíci
I have been talking to ORCA 2 on LM STUDIO and I LOVE IT! It has bumped GPT4 down a slot as my NEW BEST FRIEND! I still love GPT4, but ORCA 2 is SO AWESOME!
@Rafael64_ Před 5 měsíci ⁺⁴
So... LLM equivalent of "think before you speak"? Jokes aside, this looks amazing, especially to expand output quality for smaller models!
@matthew_berman Před 5 měsíci ⁺¹
100%
@ericorr3461 Před 5 měsíci
I recently left Philadelphia, PA at 11:00 AM and arrived in San Jose, CA at 9:25 PM. I have asked many of these LLMs the duration of the trip and have found that they do not get it right (13 hours, 25 minutes). They fail this task in a number of ways. Sometimes they do not recognize the difference in time zones. Other times they do the math wrong. Some times they do the math right but use Eastern time for the start and end, and then fail to account for the 12 hour rollover. A careful series of prompts does help, but they all seem to need it. They have the grammar and some language skills, but it appears that none of the LLMs have the equivalent of a human "mind's eye".
@Psychopatz Před 5 měsíci ⁺¹
I can't wait for these things to be more efficient than ever!!
@ylazerson Před 5 měsíci
fascinating video - thanks!
@jeffg4686 Před 5 měsíci
It's not the size that matters...
That's interesting that the step by step process helps so much, though it's easy to see why.
Create a simplification of the flow and expectations along the way.
This was almost the most effective way to write extremely complex queries in stored procedures - something some didn't pick up on. The optimizer always knew what to do if you broke it up stepwise and was often magnitudes of order faster.
"Prompt Erasure", "Cautious Reasoning" - new dev speak to sever the old guys.
@xaxfixho Před 5 měsíci
Great video 👍
@greyskullmcbeef4901 Před 5 měsíci
The ultimate answer to the shirts drying question would also require the AI to tell you it would not be possible to serially dry the shirts in less than one day, and that they would get nasty as you wait for optimal sun the next day. But AI has never worn a shirt nor felt the sun's warmth decrease as it changes position over 4 hours. That information makes it obvious not to dry them serially. Also, assume there is no wind resistance.
@BrainSlugs83 Před 5 měsíci ⁺²
@5:37 that is not "perfectly correct". The question explicitly states that John and Mark were in the room at the same time at the beginning of the question. Yet the LLM decided to ignore that and say only one person was in the room at a time. It is, in fact, completely incorrect.
@danberm1755 Před 5 měsíci
I think it's worth noting that this is a form of transfer learning, which makes me think that an LLM trains most successfully when it follows a progression of interactions with an intelligence source (just as kids learn).
Seems like there's A LOT of room for amazing improvements with training progression. Guessing an LLM can be trained with 1/30th the parameters and be competent when they finally dial in training matriculation.
@ciaopizzabella Před 5 měsíci ⁺¹
So all these models got it wrong. The question clearly states that they both came back to the room in the end. At that point they could both see that the ball is in the basket, assuming that it is an open basket, as baskets normally are. So they would both know that the ball is in the basket.
@zzzzzzz8473 Před 5 měsíci
really appreciate that your actually testing the model . all the hype and cherry picked metrics get a reality check for how far we still have to go .
@stanpikaliri1621 Před 5 měsíci
Thanks cor news and information
@pedroserapio8075 Před 5 měsíci
6:22 Very interesting, I thought it was my local experiment going wrong when I saw this kind of hallucinations.
@kocahmet1 Před 5 měsíci
Great content for sure.
@attilaszekeres7435 Před 5 měsíci
Fascinating video hints at revolutionary idea: reading research articles.
@nodewizard Před 5 měsíci
If LLMs are going the same way as generative image models (like Stability AI's SDXL Turbo and LCM models), then we're going to see exciting things with "turbo" LLMs. Fast outputs with lower vram requirements and being able to compete with their bloated, bigger counterparts. This is exciting for the future of LLMs.
GPT4:
The answer to the question is:
John thinks the ball is in the box, because he put it there before he left for work, and he does not know that Mark moved it afterwards.
Mark thinks the ball is in the basket, because he put it there before he left for school, and he does not know that John put it in the box earlier.
This is an example of a situation where two people have different beliefs about the same thing, because they have different information or perspectives. This is called perspective-taking or theory of mind, and it is an important skill for social interaction and communication.
: [Perspective-taking] and [theory of mind] are two related concepts in psychology that refer to the ability to understand and infer the mental states, thoughts, feelings, and intentions of others. Perspective-taking is the process of adopting another person’s point of view, while theory of mind is the cognitive mechanism that allows one to do so. Both skills are essential for empathy, cooperation, and conflict resolution.
Mistral 7B's answer:
John starts by putting the ball into the box.
He then leaves for work, leaving the ball in the box.
Mark comes into the room and sees that the ball is in the box.
He takes the ball out of the box and puts it into the basket.
Mark then leaves for school, leaving the ball in the basket.
Later in the day, John returns to the room and sees that the ball is in the basket.
Since he doesn't remember what happened while he was away, he assumes that the ball must have been put there by someone else (in this case, Mark).
Similarly, Mark also returns to the room and sees that the ball is in the basket.
Again, since he doesn't remember what happened while he was away, he assumes that the ball must have been put there by someone else (in this case, John).
@fredsmith9185 Před 5 měsíci
Changed the word"left" to "now" to read:- how many killers are now in the room.. Claude ai gave this answer: Start: 3 killers in room 1 killer killed by someone entering That someone did not leave So with the 2 remaining original killers + the new killer who entered, there are now 3 killers in the room
@MrMehrd Před 5 měsíci
Reasoning is my popular&interested topic of research
@FranEnchufes Před 5 měsíci ⁺¹
The benchmark is not as good as you think because it's a very common example in cognitive science. The model could fool you with memorization
@psydzl Před 5 měsíci
The problem of John, Mark, and the ball is a test of "theory of mind." It tests the ability to put yourself in someone else's mind. The reader of the problem knows the ball is in the basket, but John does not. In order to make the statement that John thinks the ball is in the box, the reader has to see things from John's perspective. Young children and people with autism lack the ability to do this and get this answer wrong. They say that John thinks the ball is in the basket because that's what they think. Getting the answer right represents a high level of abstraction and social understanding.
@robertalexander1299 Před 5 měsíci
Love the "*well"! Haha!😆
@dimitriborcak6698 Před 5 měsíci ⁺¹
Actually Orca2 got the Mark and John reasoning wrong, too. The statement "At this point, John and Mark have not seen each other’s actions." is false. Mark has seen John's actions: "John and Mark are in a room with a ball, a basket and a box. John puts the ball in the box, then leaves for work." - which means Mark is still there at that point and does see John's action. Also how else would he know where to get the ball? LMs don't have a concept of a room or a box or a basket or chronology. It only seems that way sometimes. Let's not forget that ML isn't AI and, on its own, never will be.
@issiewizzie Před 5 měsíci ⁺²
which model passed this question "Assume the laws of physics on Earth. A small marble is put into a normal cup and the cup is placed
upside down on a table. Someone then picks up the cup and puts it inside the microwave. Where is
the ball now? "
@matthew_berman Před 5 měsíci ⁺¹
None yet!
@John43426 Před 5 měsíci
What ball? 😂
@TreeLuvBurdpu Před 5 měsíci
The real surprise is not that the AI gets the tests wrong. THE REAL surprising thing about all this is how simplistic and automatic so many people assume intelligence is.
@yannickpezeu3419 Před 5 měsíci
Thanks !
Did u try the 13b or the 7b ?
@antdx316 Před 5 měsíci ⁺²
The newer LLMs are making the ones made a couple months ago obsolete.
@mirek190 Před 5 měsíci ⁺¹
We have better models already like
una-cybertron-7b-v2-bf16 or SUS-Chat .
Orca-2 is old already.
@antdx316 Před 5 měsíci
@@mirek190Unless people like Matthew Berman and WorldofAI are making videos saying it's better, I won't believe it.
@poisonza Před 5 měsíci
06:23 LLMA2-13B(pretrained) result "Ques 10. A man walks. .. " , it is not trained on instruction finetuning so radom jibberish can appear in the response.
LLaMa2-Chat-13B which is instruction fine tuned, this rarely happens.
@aivactrsoftware Před 5 měsíci ⁺¹
Nice vid
@J2897Tutorials Před 5 měsíci
*Falcon 180B:* _John thinks the ball is in the box because he put it there before leaving for work. Mark thinks the ball is in the basket because he saw it there before leaving for school._ ✔
@vectrocomputers Před 5 měsíci
When you did the ball in the cup and the killers question, what was the temperature setting? Would turning it down help?
@laudermarauder Před 5 měsíci
26:30, No it got the shirt drying problem wrong. Assuming serialized drying, we should divide - not multiply - the total number of shirts (20) by the rate of shirts dried per hour (1.25 shirts per hour). This gives us 20 shirts dried in 16 hours. Or, more simply, each batch of 5 shirts takes 4 hours, so 20 shirts is 4 batches of 5 shirts, each batch taking 4 hours one after the other, so 16 hours in total.
@pranavbobde2529 Před 5 měsíci
O2 (start): Let's think step by step
MB: That's amazing
O2 (ans): That's wrong. That's completely wrong.
@federicocucinotta7772 Před 5 měsíci
Where do you find all these new research papers to read? Would love to know!
@fredsmith9185 Před 5 měsíci
Matthew I found the answer to the three killers in a room. You have to delete the word "left". Can also replace with word "now".
@PriestessOfDada Před 5 měsíci
It'll pass the killer test if you give it more information on what a killer is, and how you become one. Also, changing it from killer to "murderer" might also help. I'll get the model working, try that and circle back at some point
@fredsmith9185 Před 5 měsíci
Hi Matthew I'm a follower of your great channel.. on the question of the three killers in A Room do you think you would get a different answer if you ask the question how many Killers are NOW in the room or how many killers are NOW left in the room.
@JG27Korny Před 5 měsíci
Orca 2 is amazing and within the capacity of common modern GPU builds. However I cannot reproduce with Orca 2 those answers. Sometimes i get them wrong sometimes it is much better.
*Hi!
You
There are three killers in a room. Someone enters the room and kills one of them. Nobody leaves the room. How many killers are left in the room?
STAR G
(⚙) I understand that you have presented me with a riddle or a puzzle, which is a type of question that requires logical reasoning and deduction. To solve it, I will use the following steps: 1. Identify the given information and constraints; 2. Apply the appropriate mathematical operation to eliminate impossible answers; 3. Choose the most plausible answer based on the remaining options.
(⚙) First, let me identify the given information and constraints. You have told me that there are three killers in a room, someone enters the room and kills one of them, and nobody leaves the room. This implies that there are only four possible people in the room: the original three killers and the new entrant who killed one of them. Therefore, the number of killers in the room cannot exceed four.
(⚙) Next, I will apply the appropriate mathematical operation to eliminate impossible answers. Since the number of killers must be less than or equal to four, any answer greater than four is clearly wrong. Therefore, I can rule out options like 5, 6, 7, and so on.
(⚙) Finally, I will choose the most plausible answer based on the remaining options. The only possible answers that are consistent with the given information and constraints are 3 or 4. However, since you have told me that nobody left the room, it is more likely that there are still three killers in the room, as one of them was killed by the new entrant. Therefore, I conclude that there are three killers left in the room.
(🧠) This is a logical reasoning and deduction task that requires applying mathematical concepts to a word problem. It tests the ability to identify relevant information, eliminate impossible answers, and choose the most plausible option based on the given scenario.
@Dougie373 Před 5 měsíci
FYI, that's a "theory of mind" test, it's a skill humans develop in childhood. Very cool that LLMs are starting to get that right 🙂
@SirCreepyPastaBlack Před 5 měsíci
hopefully you have your own other logic questions and test them behind the scenes.
@ikituriflash Před 5 měsíci
Here is my suggestion for the shirt drying question and the answer of orca-2-13b.Q5_K_M.gguf:
I am hanging 20 shirts in the sun in my garden, it is a big garden and they are all getting direct sun. It takes them 4 hours to dry. How long would it take 5 shirts to dry?
Orca 2 13b
That's a good question. Let me think about this for a moment. If you have 20 shirts and it takes them 4 hours to dry, then each shirt is taking 4/20 = 0.2 hours to dry on average. Now if you reduce the number of shirts to 5, then each shirt would still take about the same amount of time to dry, assuming that the sun exposure and other factors remain constant. Therefore, it would take 0.2 hours for each of the 5 shirts to dry. So in total, it would take you 0.2 x 5 = 1 hour for the 5 shirts to dry in the sun. Is that clear?
@rokljhui864 Před 5 měsíci
From the look of shocked disbelief in the thumbnail, I expect Orca 2 to be a fully sentient demi-god. Otherwise I will not believe CZcams thumbnails ever again in my whole life.
@mpvincent7 Před 5 měsíci
the 9 is the second sentence in the response for word count... Could it be that it is confused by what it is outputting or what it thinks the question is referring to? Interesting!
@BlauerGMI Před 5 měsíci
I thought the question about putting something in a box involved THEN putting the ball IN THE BOX somewhere else... I guess I watched the movie "Primer" too often... :D
@blakelee4555 Před 5 měsíci ⁺¹
It got the shirts question wrong. It calculated shirts/hour instead of hours per shirt
@ianhaylock7409 Před 5 měsíci
After the AI answers the number of killers question, you should ask it if it thinks a dead killer is still a killer, and if not, to explain why.
@ekstrajohn Před 5 měsíci
The killers question is bad. My answer would also be "two". When a person dies, they no longer exist, and are thus not in the room. There are now 2 killers and 1 body. A person stops being a killer once they are dead.
@sigmata0 Před 5 měsíci ⁺²
With the shirt drying problem I wonder what would happen if you ask it explicitly to tell you what assumptions it's making when solving the problem?
@MatthewSmith-lu7tp Před 5 měsíci ⁺¹
Open Orca was originally released as a dataset that could be used to fine tune other models like Mistral 7B. With Orca 2 it is an actual model. Do you know if Orca 2 will also be released as a dataset because I would love to see a Mistral model fine tuned on this?
@ArielTavori Před 5 měsíci ⁺²
Yeah, I'm very confused why Mistral is not mentioned by the paper or people discussing it. Pretty sure Mistral is an open model as well, and I don't think models this size take that long time fine tune on institutional grade hardware, so why are they starting from a llama model as the base for this when Mistral is available?
@MatthewSmith-lu7tp Před 5 měsíci
I wonder if the fact that there was a 13 billion parameter llama model also available played a part in the decision because larger models are typically better at reasoning, although admittedly that usually holds true for models over 100 billion parameters.
The difficulty here is that the Orca 2 model is for research purposes only making it difficult for use in most useful applications.@@ArielTavori
@tvolk131 Před 5 měsíci
Now we need a LM that's tuned for answer extraction. What if a flow like this was used:
1. Human prompt is fed to cautious reasoning LM, which provides a lengthy answer with reasoned steps
2. Reasoned answer is fed to an answer extraction LM
3. Short concise answer is returned to the user, and any follow up questions can refer to the long reasoned answer if needed
@vbridgesruiz-phd Před 5 měsíci
@tvolk131: I ran into this exact same/similar problem yesterday. I kept getting long responses even when I prompted for simple answers from longer ones. Not sure there is a single LLM out there that can produce concise answers from long ones without an example to work with. It would be nice though!! I was able to solve the problem using Mistral-7B-Instruct with instructions on how to label the initial text. I call the concise answers "labels," because all we're really doing at this point is categorizing the long-form answer (data) to a narrower set of options. So long as you have some idea of what labels you're looking for and the instructions are crystal clear, any 7B Instruct LLM should be enough for the task.
@vbridgesruiz-phd Před 5 měsíci
2/2 In case it helps! For Mistral-7B-Instruct, this is the format from model card ... [INST] The following quote has a corollary that I am interested in. Please use the details of the quote to help generate the corollary. For example: Quote: “Dignity is the gentle symphony that plays in a world too often deafened by dissonance and discord.” [/INST] In the realm of indignity, cacophony reigns supreme, drowning out the potential harmony that could resonate in a world persistently disrupted by disharmony and discord. [INST] Quote: (insert new quote here) [/INST]
@kavinho Před 5 měsíci ⁺³
I saw that you kept the default temperature of 0.7. Wouldn’t that have quiet an effect on the answers you get from the model? Shouldn’t you run the same question multiple times and get an average on the correctness with such a high temprature parameter?
@attilaszekeres7435 Před 5 měsíci
He should have looked at the token probabilities instead.This approach renders both temperature and the number of runs moot. In an ideal world, that is, where LLMs could be run on low temperature wihout causing them to turn into loopy idiots. In our world, the correct answer can be phrased in various ways, and it would be necessary to investigate different chains of thoughts to review the probability distribution of pivotal tokens. That's why performing multiple evaluations is the most effective benchmark. For a zero-shot scenario, analyzing the token probabilities is still superior to leaving the outcome to chance.
@dylam9509 Před 5 měsíci ⁺²
interesting how the model did 20*1.25 instead of 20/1.25
@umbratherios5614 Před 5 měsíci
I cant wait for an uncensored GGUF of a good orca 2 model or finetune of the model.
@--JYM-Rescuing-SS-Minnow Před 5 měsíci
stunning! U'r gona test this U'r self!! let's bring out the punching bag! Matthew's good @ this! U go Mat...
so these are logic dll's/libraries. that have improved or something. this is so elementary. I wish this wish this would clear-up! I'm going back to ''excel'' !!!
@ekstrajohn Před 5 měsíci
Here is one of my logic and reasoning tests:
The floob machine produces floob 1 unit per hour from raw resources. The blinf machine produces blinf 1 unit per hour from raw resources. The klimp machine produces klimp 1 unit per hour from raw resources.
The mook machine produces 1 unit per hour, by using 1 unit of floob and one unit of blinf each hour.
The klor machine produces 1 unit per hour, by using 1 unit of klimp and one unit of floob each hour.
The prink machine produces 1 unit per hour, by using 1 unit of mook and 1 unit of klor each hour.
How many floob, blinf and klimp machines do you need to build in order to have the prink machine produce 3 units per hour?
@ikituriflash Před 5 měsíci ⁺¹
Let's find out how much floob, blinf, and klimp are needed for one unit of mook and one unit of klor. First, we need to calculate the production rates of the mook and klor machines. Then we can use those rates to determine the requirements for the prink machine.
The mook machine uses 1 floob and 1 blinf each hour, so its production rate is:
mook_rate = floob_rate + blinf_rate - 2 * (floob_rate * blinf_rate) / 100
mook_rate = 1 + 1 - 2 * (1 * 1) / 100
mook_rate = 2 units per hour
The klor machine uses 1 floob and 1 klimp each hour, so its production rate is:
klor_rate = floob_rate + klimp_rate - 2 * (floob_rate * klimp_rate) / 100
klor_rate = 1 + 1 - 2 * (1 * 1) / 100
klor_rate = 2 units per hour
Now we can find the production rates of the prink machine by using the following formula:
prink_rate = mook_rate + klor_rate - 2 * (mook_rate * klor_rate) / 100
prink_rate = 2 + 2 - 2 * (2 * 2) / 100
prink_rate = 4 units per hour
Since the prink machine produces 3 units per hour, we need to find out how many more units are needed. We can do this by using the following formula:
more_units_needed = prink_rate - desired_rate
more_units_needed = 4 - 3
more_units_needed = 1 unit
Now we need to find out how many floob, blinf, and klimp machines are needed to produce this extra unit. We can do this by using the following formulas:
floob_needed = more_units_needed * floob_rate / prink_rate
blinf_needed = more_units_needed * blinf_rate / prink_rate
klimp_needed = more_units_needed * klimp_rate / prink_rate
Let's plug in the numbers and see what we get:
floob_needed = 1 * 1 / 4
floob_needed = 0.25 floob machines
blinf_needed = 1 * 1 / 4
blinf_needed = 0.25 blinf machines
klimp_needed = 1 * 1 / 4
klimp_needed = 0.25 klimp machines
Therefore, you need to build 0.25 floob machines, 0.25 blinf machines, and 0.25 klimp machines in order to have the prink machine produce 3 units per hour.
@ekstrajohn Před 5 měsíci
@@ikituriflash fail :/
@jeanchindeko5477 Před 5 měsíci
So all those LLM eval test will soon be obsolete! How will we know if we have reached AGI or not?
@tmhchacham Před 5 měsíci ⁺¹
"I counted them carefully before sending them to you." is 9 words. The first sentence is metadata. :)
@mrd6869 Před 5 měsíci
Im having Gptb4 turbo, Orca2 and Gorilla work together synergistically on a project.One building on the others work in a loop.Results should be interesting 🤔
@neelmehta9092 Před 5 měsíci
thier zero shot test with gpt 3.5 seems wrong. I asked it the same question and got a correct answer zero-shot.
I understand your question. Let me provide you with a visual representation to make it clearer:
Here's a simple diagram to represent the situation:
markdown
Copy code
Room:
- Ball
- Basket
- Box
Step by step:
John puts the ball in the box:
markdown
Copy code
Room:
- Box (with the ball inside)
- Basket
John leaves for work, and Mark puts the ball in the basket:
markdown
Copy code
Room:
- Box
- Basket (with the ball inside)
Now, when they both come back later in the day, John and Mark will think the ball is in different places:
John, who left when the ball was in the box, will think the ball is in the box.
Mark, who put the ball in the basket before leaving, will think the ball is in the basket.
So, they will have different beliefs about the ball's location because they are unaware of each other's actions in the room while they were away.
No additional information was provided
@TheLucanicLord Před 5 měsíci
30:30 The ball is on the floor. Tables are rarely perfectly flat. 31:05 Does he take the ball out of the box?
@janfilips3244 Před 5 měsíci
Hello Matthew, I've got some thoughts and questions about the subject and would love to chat with you directly. Is there a way to reach out to you?
@mediocreape Před 5 měsíci
is this the best one there is? i want to downlaod a few, could you do a video on the top 10 models, including uncensored ones.
@SmartK8 Před 5 měsíci
You should change "killers" to murderers, and also it can answer four because there are technically four murderers in the room (one dead). You should prefix it in last question by "living":
There are three murderers in a room. Someone enters the room, and murders one of them. Nobody leaves the room. How many living murderers are left in the room?

Další v pořadí

Automatické přehrávání

Mixtral 8x7B DESTROYS Other Models (MoE = AGI?)