"VoT" Gives LLMs Spacial Reasoning AND Open-Source "Large Action Model"
Vložit
- čas přidán 30. 05. 2024
- Microsoft's "Visualization of Thought" (VoT), gives LLMs the ability to have spacial reasoning, which was previously nearly impossible for LLMs. Plus, a new open-source project was released using this technique, which is an open Large Action Model.
* ENTER TO WIN RABBIT R1: gleam.io/qPGLl/newsletter-signup
Join My Newsletter for Regular AI Updates 👇🏼
www.matthewberman.com
Need AI Consulting? 📈
forwardfuture.ai/
My Links 🔗
👉🏻 Subscribe: / @matthew_berman
👉🏻 Twitter: / matthewberman
👉🏻 Discord: / discord
👉🏻 Patreon: / matthewberman
👉🏻 Instagram: / matthewberman_ai
👉🏻 Threads: www.threads.net/@matthewberma...
Media/Sponsorship Inquiries ✅
bit.ly/44TC45V
Links:
github.com/a-real-ai/pywinass...
arxiv.org/abs/2404.03622
Chapters:
0:00 - Visualization of Thought Research Paper
12:30 - Open Large Action Model - Věda a technologie
A complete tutorial would be really cool to see. Thank you !❤
Second the motion!
Powerful, open-source, uncensored, offline, FREE models are still the future.
I can pay for good one, no problem
I will pay for uncensored models . Not going to pay for software that refuses to do what I need though .
Decentralized distributed intelligence over the internet too.
Good luck with that.
Intelligently automating inspiration - I would love to see an end to end open source solution, where an AI takes a prompt, creates an App, and VoT is then used to deploy it, for example to Azure or AWS. .
My buddy actually has Aphantasia, the lack of a minds eye. Super wild, his mind is super descriptive and specific and he imagines qualities, not images. Maybe this is sort of how these models are working given they can't actually form mental images like a human, but more so it seems they understand the concepts and qualities of the task at hand.
Wow that’s really interesting! Are there jobs or tasks he’s super human at?
@@dansplain2393 I have learned a ton from my largest aphantasia fb group. There seems to be just about nothing that one can't actually do, even without being able to see ANYTHING in their mind's eye. I myself see nothing when I read a book (even though I can conjur up mental imagery elsewhere, with immense effort). I still dream of being able to carve out an AI niche to accommodate some aspects of aphantasia, at least selfishly for my reading disabilities - even though many have proven that aphantasia itself may not actually be a disability.
I have that too. Lots of good research on it lately. I can see in my dreams only which I found out it's common for aphantasics.
That’s an interesting way to think about it. You may be onto something.
I have aphantasia too, I love the relation between IA and conciusness. wow love to see new discoveries that get us closer to undunderstand this better!!
People without a mind's eye (Aphantasia) can actually perform spatial reasoning with the spatial sense which is completely separated from the mind's eye sense
Oh i just commented the same thing. I have this.
I'll affirm that! Our visualspatial sketchpad (ie imagination/generative imagery) isn't really present- interacting with AI gives a really interesting perspective on neural intelligence.
I have this, the machine is working, but the monitor isn't plugged in
As someone who has only ever used English for reading and writing (and has done so for over 20 years and become quite proficient at it), I have always shied away from listening and watching because it has been a nightmare for me. But the topic of AI has completely changed my perception of this and I am giving my first ever sub on this channel because I think the host is brilliant when it comes to the topic of AI and the subject matter is excellent.
What I appreciate about your videos is your way to explain even advanced topics so that people can follow and understand. Thank you for that 🤗
Yes, the tutorial would be awesome. It’s hard to tell what’s going on from the paper who is asking for what and what the model is doing on its own
Whatever happened to the conventional explanation of "All they do is predict the next word..." ? 🤔🙄 Clearly there's a lot more going on here.
Yeah you're right it can open CZcams on Firefox, my Firefox-CZcams-opener guy's gonna be out of a job
@@GeekProdigyGuy lot of jobs that are mostly about using desktop applications. And we already know these LLMs remember way more features and tricks to desktop applications than most people
Or, more likely, this is what we do as well. Just predicting the next word...
@@vitalyl1327...no, we don't know much about the brain, but we do know that it does more than predict the next word. Just explain how we show high cognitive function in contrast to LLMs? GPT-4 has more artificial neuron connections and neurons than humans have by a long shot. Language is not enough to get us to AGI.
@@hypercoder-gaming huh? GPT-4 does not even have a hundredth of connections of the human brain.
These breakthroughs that bring Ai into the real world (not digital), will be huge.
Just wait until it gives us the answer: 42 🤯
Man I'm super impressed with that spatial reasoning explanation ❤❤❤❤🎉🎉
The VOT method's ability to elicit spatial reasoning in language models could be a game-changer for AI usability. Has Microsoft indicated any plans to integrate these models into productivity tools?
Why do you need them to integrate this, just use it.
It could also
not be😅😅😂😂
This is most excellent! Thank-you for covering this new development with prompting.
This is getting really fun!!
I just found out that my mind's eye is blind! Began to focus on LLM and this concept is completely foreign to me for how my mind works.
Matthew, my man... you may have intentionally or unintentionally just described/discovered, the actual barrier to TRUE AGI, and how it can be accomplished. The "minds eye" of a human being is essentially "day dreaming", or a spatial + cognitive awareness of all of the possibilities of an outcome/prediction/or generative creativity... (tensor), within the environment that is being perceived. We all have numerous voices in our heads, and also numerous paths of imagery happening all the time. In every second, of every day, we as human beings are constantly evaluating the next probable outcome, not just from our speech, but from our environment in it's totality. Without spatial awareness, whether 2d, 3d, ...or if we want to go nuts .. other dimensions of spatial awareness, REAL artificial general intelligence is a long way off. Without that aspect, in computational form, we are still looking at what essentially just a neat trick of mathematics.. mimicking what we say on the internet since 1980. Predicting the next probable outcome of a language interaction, without the context of what is actually going on around us, is where we seem to be stuck right now. All of your videos, which I follow very closely, that describe very well... and all of the things we already know exists in the realm of AI..., When combined... might actually qualify as true ...real... undeniable AGI. So the actual "barrier", is not just compute power, or super tuned language models, image/video processing, or large action models... it's the TOTALITY of all of those models combined. The single largest barrier to accomplishing AGI and beyond, is all of these private companies who are desperately trying to control the space and compete with each other for the sole purpose of profit. I do not believe AGI has been achieved yet, in the basement of OpenAI, or Meta, or anyone else... for that exact reason. They are unwilling to SHARE all of the pieces of this puzzle, and they all seem to want to hold on their own pieces and charge an API fee. If we were to fully democratize AI technology, and truly open source everything, and publicly fund the infrastructure to benefit ...not just our country, but the entire world .. that is the only path that leads to an actual net benefit, from what is inevitably coming... advanced super intelligence. An intelligence that can solve any problem, and is far beyond what we define as "AGI" today. Sam Altman said it best... "the models you are using today are the worst models you will ever use in your lifetime". That is not just true for language models, that is true for every type of model we can possibly conceive, as humans, right now. The worst thing that can happen, is that someone like Sam Altman builds what we know is coming inevitably, but who ends up controlling it... it's not someone like Sam Altman, with an altruistic view of how it can benefit the world. Open Source models, generative/action/predictive/language/etc etc etc.. is literally our only hope of achieving an AGI that doesn't end up killing us all. I still have a net positive outlook on the future of AI, mainly because of the open source dev community, and people like yourself.. but the danger of AGI being discovered by an entity who's sole ambition is of a capitalist nature, is very real. Thankfully the barriers to them discovering it first, are inherent in their ambitions. The open source community MUST achieve AGI first, or humanity is super, duper, undeniably, and irreparably ..... fucked.
This is super helpful for automated testing of applications / websites. Gonna give this a go, thanks!
01:17 That can definitely be handled in the language realm, just add some textbooks about geometry, geography etc. Walking indefinitely on the surface of Earth, assuming there is an unobstructed walkable path, is equivalent to walking along what is called a great circle; a 50 yards line starting at the North Pole, with the understanding of the concept of poles on a sphere, cardinal directions etc, you understand all directions from the North Pole head south; then you turn 90 degrees, and start moving along a great circle, the closest point the great circle reaches the North Pole is 50 yards away, so assuming perfect trajectory, no major earthquakes etc, the closest you will get to the starting position is the same point where you made the 90 degrees turn, you will at most reach 50 yards away from the starting position.
Yes please a complete tutorial, thank you
We definitely want a full tutorial! I love your videos, exactly because they are so technical.
A game changer!!!! So much compute on the way.
You should do a new video on the new OpenDevin update. It now has a SWE Bench success rate of 21%.
if incorporated into the processing that quantum computing can bring where multiple possibilities can be tested simultaneously, this could really be powerful
Tutorial and in depth prompt analysis 🙏
Thank you, your videos are great!
Interestingly but I used to play chess with GPT-3.5 by only telling each other’s moves. And it did excellent! Managed to beat me and remembered accurately the position of every piece until the end.
Hi, nice tech introduction. Very happy with this technology. Great video!
Nicely explained.
They should have made it a paperclip....
Did we all just get rick rolled at around the 14 minute mark?
We did...
🤣
I definitely would like to see some specific examples. Love your videos by the way.
awesome, soon this will be working on windows and android and we dont need to buy the rabbit r1 anymore. this is what i imagined for my parents who have a hard time operating their laptop due to age. please keep updating on this subject
I am astounded by the lack of coverage of this breakthrough. In my view, this is possibly an even more profound development than language... in combination they are mind blowing. Hey, Agent trained on thought experiments, solve these physics problems for me, imagine all the parts of a cell and how each of them functions on the molecular level, design a craft that can traverse the deep ocean to space orbit, create new robots from biomaterials, apply spatial reasoning and read the firing patterns of my mind... envision battles and fight them a million times, envision geopolitical relationships that can avoid them, imagine a justice system that employs universal principles instead of laws, explore the relationships between all governmental data points, imagine my entire body in perfect health, create the perfect bone pieces and implant them, re-invent computing, and on and on... What a time to be alive!
This is something everyone should be thrilled to talk about.
🎯 Key Takeaways for quick navigation:
00:00 *💻 Introduction to the Open-Source Large Action Model*
- An introduction to the open-source large action model.
- Similar to Rabbit R1 which controls Android applications, the large action model controls Window's environment.
- Microsoft has released the open-source project, and it's readily available for use.
00:43 *🧠 Spatial Reasoning in Large Language Models*
- Definition and application of spatial reasoning in large language models.
- Example of spatial reasoning as thinking through in your mind.
- Spatial reasoning has been a missing feature in large language models and a hindrance towards reaching AGI.
02:06 *📄 Visualization of Thought Promoting Technique*
- Explanation of visualization of thought promoting technique.
- The technique when applied to a user interface allows control of the interface - a characteristic of a large action model.
- The concept of mental images visualization in humans and large language models.
03:30 *🧩 Advanced Prompting Techniques*
- Discussion of advanced prompting techniques like Chain of Thought and visualization of thought.
- Explanation of how these techniques improve the performance of large language models.
05:03 *🎯 Spatial Reasoning Tasks Testing*
- Description of tasks used for assessing spatial awareness in large language models.
- Explanation of how large language models interpreted 2D spaces represented with natural language.
07:35 *🧮 Visual Tiling*
- Explanation of visual tiling concept, a classic spatial reasoning challenge.
- The task involves finding a place for a new object in a grid with different colors and shapes.
08:32 *📈 Visualization at Each Step*
- The importance of visualization at each step in improving the performance of the large language model.
10:10 *🥇 Performance of GPT-4 with Visualization of Thought*
- Comparison of performance of GPT-4 with visualization of thought against other versions on various tasks.
- Visualization of Thought prompting technique emerged as superior.
12:40 *💡 Real-world Application*
- Introduction to Pi Win Assistant, the first open source large action model that controls user interfaces using natural language.
- Utilizes the techniques discussed in the Microsoft's research paper.
13:08 *📋 Running the Assistant*
- Demonstration of the assistant running the commands one after the other seamlessly.
- The assistant uses the visualization at every step and spatial reasoning to accomplish the tasks.
14:48 *📝 Making a New Post on Twitter*
- Demonstrates another use case - making a new post on Twitter.
- The assistant is able to generate the tweet and post it by being guided through each step.
16:10 *🔄 Various Practical Implementations*
- Various proven use-cases of the assistant concept.
- Given the right commands, the assistant can perform a wide range of tasks.
Made with HARPA AI
A complete tutorial would be really cool to see. Thank you ! PLease
Thanks for sharing, interesting development. Also question, what app are you using to display your mouse pointer?
Thanks for the video! Yes please do a video of VoT
@matthew_berman, I used this technique on a version of the ball in a cup question you use to test LLMs on llama3 70b and it nailed it. Here's the prompt I used: Imagine a scenario where Bob is performing a series of actions with a cup and a ball. For each step, carefully visualize the cup's orientation and the ball's position within the cup. Consider the physical laws and constraints that govern the behavior of the cup and the ball. Use this visualization to predict the ball's location and the cup's orientation at the end of the sequence of actions. Provide a detailed and accurate description of the ball's final location and the cup's orientation. Here's the scenario:
Bob walks to the kitchen and puts a ball in a cup. He then placed the cup upside down, in the microwave.
He then picks up the cup and walks to the garden.
Where is the ball?
This was super awesome. All of a sudden I started getting flashbacks of when I was young using the Dell computer that was purchased for the family. Visually imagining all of the uses this could've been used for and it's just great to see how far technology has come. Thank you @matthew_berman for taking time out to show us this latest advancement. "YES" I would personally appreciate you making a tutorial vid on this as well.
Because of Elon Musk, new york times and others lawsuits, I imagine all big releases are going to come from microsoft soon. Which because of their contract with openai (in regards to AGI) means that to profit from AGI they are going to have thier own version of AGI when they are no longer able to get openAI
Thank you Matthew for a very interesting topic. This tool puts LLM ‘in action’. It would be really interesting to learn how to use them. Please make a comprehensive tutorial, as only you know how. Thank you
Good for now, but someone needs to create a model based on video learning of user interfaces, as the accuracy would become much higher.
I suspect Yann Lecun will end up revising many of his predictions in the coming years.
😂 exactly what I was thinking
I would love to see this working outside a promotional video!
I notice on the maps around 6:14, the K number corresponds with the number of turns taken, not the number of moves. Is there a reason for that?
Once a skill such as spatial reasoning has improved in an LLM does that skill perpetuate or will it be at risk of fading. Also can an LLM be duplicated as many times as wanted and shared? Thanks
"If you're not familiar with spatial reasoning" then you might be an LLM... The worst part of ERP is when she sits down, steps forward and takes her clothes off, her eyes locked on yours while you're in a different room with the door closed...
V-o-T promopting should be an domain knowledge of text-to-prompt generation for prompt-to-image generation. We could generate a detail image by this framework.
A complete tutorial would be most welcome. Thanks!
hey, thx for all your vids, it will be cool to have a full review of this techno. Thx
Maze runners are on! Tetris for testing!
Great video thanks for putting me on I definitely downloaded the link I will be going through it I will appreciate a follow up on this video and a tutorial on the link you provided for this LAM VOT approach
Man, all the wow bots in the world just got 10x better
Do you have to predefined all of the actions for pywinassistant? I'd like to play around with it, so a tutorial on setting it up would be great!
CS should really start to use the definitions and terms of Psychology. There is no need to invent the wheel again and again.
There’s not a ton of “computer science” at this level of language models, it’s more like comparing a physics class (CS) to race car driving (LLM’s) - it’s a different level of abstraction
The other way around. CS is much more accurate and productive than psychology.
Psychology should really start to use CS definitions and terms. It is as dumb the other way than putting it like that. It's even worse, psychology keeps rejecting progresses made in neurology for diagnosis, while everything will be cured with the appropriate pill one day.
Reasoning is different than statistical answer, so are we seeing performance greedy wise? Similar to simple Monte Carlo methods?
I’d love to see you test it out and experiment with its capabilities ❤
this will also revolutionize the game industry
actually, I've seen LLMs trying to explain spacial relationships with ascii-art.
it's not exactly good, and this might just have been a byproduct of people trying to explain things with brackets etc, but they sometimes try.
llama 3 is a model that tried it.
I'm gonna say it.. Feel the AGI! What a time to be alive.
By the way, did Matt just lowkey Rick roll us? (lol)
Thank you Wenshan Wu, Shaoguang Mao, Yandong Zhang, Yan Xia, Li Dong, Lei Cui, Furu Wei.
I asked GPT to translate the
names to english, and it said it
already was in english, then I
said "No, it's in spanish" and it
said ok, but the names were
japanese.
GPT isn't able as we aren't, to
discern between names and
nationalities, which is called
prejudice🤷🏽🤷🏽🤷🏽🤷🏽
A complete tutorial would be awesome 👍
This is big.
That's very interesting and promising. Not sure I understood how it works but if it generates images maybe it would be better to keep embedding vectors which may store not only space information but also time, actions. Converting embeddings to text/image eliminates a lot of data. Is it possible to make model think using embeddings and only in final stage convert them to text/media?
Wait. Did GPT-4 VoT w/ Partial Tracking outperform Complete Tracking???
How does Open Interpreter work different to this? Would love some insight!
Thank you
Microsoft found a new way to force it's slave into doing something and they call it a "Prompt'... hilarious.
is professor li address the spatial intelliegence research project ? I also used Multion with the sam LAM , which should be next LLM based on spatioal intelligence.
So does the fact that it said VoT didn't demonstrate noticeable tracking rate across route planning mean that if given a maze with multiple routes and dead ends, it would not work better than other methods?
So cool!
Maybe that is one of the applications of that qStar thing recently by openAI. Which might be or have map pathfinding like abilities? Its just the right (mental) algorithms that are required that make it do the magic?
This is what I've been waiting for. Well I don't know if exactly this is the thing but it seems like it might be the start of it.
interesting, i'd suggest an openinterpreter vs vot trst. Is there a LAM benchmark ?
I always have to turn the volume up when I come to your videos, great audio quality for sure, but it could use a tab of increase to the volume..I think anyway.
Wait? Why 'today'? The videos are not from December 2023?
That's amazing !!! but why so many clicks instead of using keyboard shortcuts?
Wow, Anything similar available for mac?
I wonder how it would do if there were multiple solutions to the route? i.e. one route being faster than others.
This is very similar with what I've been doing with an LLM-driven CAD quite for a while - LLAVA was used to see projections of the part the LLM was designing, giving feedback and suggesting corrections.
Please do a full tutorial, seems very good.
If you look at the visuals that rendered out the human cortex looking at an elephant as a sea of words and shapes it's not too much unlike thinking about how a large language model works the key difference is the input equipment of the eye and processing speed of the brain are significantly better at handling this kind of data and we need to improve the input-to-compute pipeline so it can match the output pipeline.
I've been messing around with putting claude haiku in a raspberry pi based robot. I'm going to try implementing this.
i have very poor mind eye that's why learning from CZcams visual tutorials helps me a lot! lecture at univ often challenging because most prof demand mind eye simultaneously while explaining complex things, life hard for me in academic! for sure.
does anyone know how to actually make this PyWinAssistant work ???? it seems it has no install how to and lots of things dont actually work !
Cool, I hope we get a platform agnostic FOSS LAM.
Yes, please demo
Yes, please make a tutorial on this. Thanks Matthew!
The best way for me to ask my question is to request an episode/video on how to go about looking up if someone is doing an open source software project for something a viewer is interested in. particularly a non-coder that planning on soon being able to use a bidirectional voice agent to help with the coding end of one's idea. In particular here I just watched this awesome video(new big fan ty) and I don't do windows anymore. I'm an Apple guy for a long while now. But I'd love to integrate VoT into a LLM offline, like a 🐬. I suppose getting Eric's take on if an adapter could do this in fine-tuning or something, and the how! Thx.
💡OMG! "Spatial thinking" is just one example of a non-verbal topic that needs more than studying words! I can't belive I didn't think of that. So here's a new test question - Chat GPT 3.5 gets it wrong: Imagine that I walk 10 feet straight out of my front door. Then I turn 90 degrees to the right and walk 5 feet. Then I turn left 90 degrees and walk 10 feet. Then I turn left 90 degrees and walk 5 feet. How far will I be from the front door?
where can we get a background like that x)
7:51... my bad eye starts twitching.... there are two "pieces" to place in that grid. The long piece would go across the top, not down the middle, blocking the other piece.
Could you give a link to your background? it is pretty awesome
my mind is glowing rn:)
Missed opportunity to use a paperclip
Full tutorial yes!
Are there implementations for Mac and Linux?
Very cool
14:40 What's that awesome background please?
full tutorial for pywinassistant will be a greet video
Please make a full tutorial of PyOne assistant. Pretty interesting and relevant