MASSIVE Step Allowing AI Agents To Control Computers (MacOS, Windows, Linux)

Matthew Berman

zhlédnutí 83 403

Přidat do
- Můj playlist
- Přehrát později
Sdílet

Sdílet

Vložit

Velikost videa:

Zobrazit ovladače přehrávání

Automatické přehrávání

Přehrát

čas přidán 4. 06. 2024
OS World gives agents the ability to fully control computers, including MacOS, Windows, and Linux. By giving agents a language to describe actions in a computer environment, OS World can benchmark agent performance like never before.
Try Deepchecks LLM Evaluation For Free: bit.ly/3SVtxLJ
Join My Newsletter for Regular AI Updates 👇🏼
www.matthewberman.com
Need AI Consulting? 📈
forwardfuture.ai/
My Links 🔗
👉🏻 Subscribe: / @matthew_berman
👉🏻 Twitter: / matthewberman
👉🏻 Discord: / discord
👉🏻 Patreon: / matthewberman
👉🏻 Instagram: / matthewberman_ai
👉🏻 Threads: www.threads.net/@matthewberma...
Media/Sponsorship Inquiries ✅
bit.ly/44TC45V
Věda a technologie

Komentáře • 282

@donrosenthal5864 Před měsícem ⁺⁵⁴
OSWorld project video? Yes, please!!!
@reidelliot1972 Před měsícem ⁺⁴
Yes, tutorial please! Please elaborate more on the relationship to CrewAI-like frameworks and potential implications for the rumored YAML endpoints!
@user-wz3qe3vw6h Před měsícem ⁺³
@@reidelliot1972 Yes Matthew, pls!
@Carnivore69 Před měsícem ⁺⁶⁴
User: What happens between the steps in these Ikea instructions?
Agent: A fuckton of swearing!
User: Test passed.
@BangaloreYoutube Před měsícem ⁺¹
Legit laughed for 10 mins 😂
@KCM25NJL Před měsícem ⁺⁵⁵
It's great an all, but I kinda think one of two things will end up happening:
1. An AI layer will become a standard for interoperability as part of the OSI and App Dev Stacks
2. A whole new OS will be developed that serves this very purpose.
I suspect we may start with 1 and end up with 2 in the longer term.
@theterminaldave Před měsícem ⁺⁵
When i was helping to write test steps for an automated software testing app, I was required to basically open up the developer tools and get the name of the object that was needed to be interacted with, the HTML code/name for a particular button, or a certain drop down textbox.
I don't understand the whole "lay a grid over the screen and guess the coordinates." That's just the user interface, the computer utilizes all the code in the background I don't get why the AI isn't navigating by looking at the underlying code for the page instead of the graphical output of the page?
@DaveEtchells Před měsícem
@@theterminaldave Interesting point. I'd say though that the point is to have the AI interact with the UI based on what a human would see. on a related note, there have been tools for doing software regression testing dating back many years that'd let you interact with UI elements, but it was a PITA to write the scripts for them and they were very fragile in that tiny changes could send them off the rails.
@Daniel-jm8we Před měsícem ⁺¹
@@theterminaldave Would the AI always have access to the code?
@ich3601 Před měsícem
@@Daniel-jm8we Allmost. When using RPA-Tools you're scanning the HTML, the OS-Events, or the application events. Would be great if an AI would eat this stuff, because nowadays RPA-Tools are very sensitive to changes.
@theterminaldave Před měsícem
@@Daniel-jm8we Open any webpage, and press f12, and click on the inspector tab, that's the code I'm referring to.
It's basically the code for the graphical interface, so yes, the AI would always have "access" because if you don't have access it's because it's not appearing on the page.
After you open the inspector, click on any line and hit delete, it will disappear from the page. If you hit refresh it will come back.
@jamesheller2707 Před měsícem ⁺⁶⁶
Please make more videos testing and running this yourself🙏🏼, I'll be great
@reidelliot1972 Před měsícem
Yes, tutorial please! Please elaborate more on the relationship to CrewAI-like frameworks and potential implications for the rumored YAML endpoints!
@haroldpierre1726 Před měsícem ⁺²⁸
It would be helpful to have a catalog of pre-built open source AI agents that can be easily downloaded and used for specific tasks. My brain shuts off trying to follow video tutorials on programming my own AI agent from scratch.
@lorenzoleongutierrez7927 Před měsícem
Yes !
@BlankBrain Před měsícem ⁺⁵
The most difficult part of making something like OSWorld is security. When you open your OS to computer manipulation, it's a lot easier for computers to manipulate it.
@alanhoeffler9629 Před měsícem ⁺²
This was good video showing what had to be done to make LLM’s agentic using computer OS’s. It showed me two things. The first was why self autonomous cars are so hard to set up. The auto system has to not only know what the “rules of the road” are, what the automobile’s driving characteristics are, and how to make the car do what it needs to do, but it has to be able to correctly parse at high speed what a situation that it has never encountered before is, what is the correct action to take is and pull off executing it in real time. The second is that a system that can do that well is way closer to AGI than any LLM.
@threepe0 Před měsícem ⁺⁴
Really look forward to your videos. You’ve helped me get the gist of developments as they come out and determine which technologies are useful and worth spending my time on, and which ones I am equipped to handle, for my personal use-cases.
I have and will continue to recommend your channel to friends and co-workers.
Seriously man when I see your name, I click. Thank you for continuing to do what you do.
@ScottzPlaylists Před měsícem ⁺¹⁵
Yes Please 👍 Need lots of OSWorld Videos ❗❗❗
We need a video tutorial watching AI, that creates a training set item for OSWorld on how to do X, by watching a video on how to do X (and fills in missing details not shown). 🤯🤯🤯🤯❗❗❗❗
@AGIBreakout Před měsícem ⁺⁸
Great Idea!!!!
@CryptoMetalMoney Před měsícem ⁺⁷
YT Tutorials videos would be a huge ready to go dataset... Great Idea
@CryptoMetalMoney Před měsícem ⁺⁵
Continuous learning will be huge in the future, and using computers will be a big part of that.
@NWONewsGod Před měsícem ⁺⁵
YT is a treasure trove for more Advanced forms of AI Training and even Training Now.
@pvanukoff Před měsícem ⁺⁵¹
Not long before we have star-trek style computers, where we just say "computer ... do x, y and z for me".
@theterminaldave Před měsícem ⁺²
That's the goal. Agentic AI.
@ericspecullaas2841 Před měsícem ⁺²
You can do that now. Although food replicator and hollowdecks are still far off
@shooteru Před měsícem ⁺⁶
Working on it, many of us
@JBulsa Před měsícem
2 - 9 years
@tomaszzielinski4521 Před měsícem
Who do you mean by "we"?
@justjosh1400 Před měsícem ⁺¹
Can't wait for the tutorial. Wanted to say thanks for the videos Matthew.
@reidelliot1972 Před měsícem ⁺¹
Yes, tutorial please! Please elaborate more on the relationship to CrewAI-like frameworks and potential implications for the rumored YAML endpoints!
@marshallodom1388 Před měsícem ⁺⁷
Computer! Computer?
[Handed a mouse, he speaks into it]
Hello, computer.
The Dr. says just use the keyboard.
Keyboard. How quaint.
@AhmedMagdy-ly3ng Před měsícem ⁺¹
I will be more than happy to see you testing it in real world examples, not complex task but just everyday tasks, like summarize a bunch of pdfs or make a research, and things like that.
And also i need to say that a really appreciate your work❤
@jimbo2112 Před měsícem ⁺²
Yes please! Tutorial on this would be great. I see agents as being a driving force behind vast amounts of commercial AI adoption. Companies want greater efficiency and agents are the tools to bring this.
@JandJActionPlay Před měsícem ⁺²
always awesome and informative videos Matt, love it brother. And feel like that much smarter after watching them. Keep up the awesome work!
@darwinboor1300 Před měsícem
Thanks Matt.
The change-the-background task is like an Optimus realworld task. Using the mouse requires a collection of basic motion skills (eg move in XY, click right/left, scroll up/down, etc). Moving and activating the mouse on a screen are simple subtasks necessary to build actual realworld tasks (on the PC these basic skills and subtasks and more can be accomplished using AutoHotkey). The reactive sequence of mouse subtasks (including motions) is the equivalent of FSD navigation from location A to B in the realworld or Optimus stepping through a set of realworld subtasks to complete a realworld task. The advantage for a change-the-background task AI is the paucity of edge cases that make realworld tasks so difficult for Optimus and for FSD. All three AI systems need to evaluate the realworld changes they evoke before executing the next subtask. Optimus and FSD repeatedly face infinite realworld variations between subtasks. These variations are introduced by independent external agents (cars, animals, fallen trees, etc.) The change-the-background task AI will mostly face changes due to software upgrades and due to different starting states. Most computer issues can be resolved by deeper searches on the web. AutoHotkey can programatically solve simple issues (hiding open windows). Having an AI to navigate the process would fundamentally change the ability to execute complex computer tasks based upon simple sequences of verbal commands.
Here is an example: Convert the most recent Matt Berman CZcams to mp4 and then extract unique screenshots to a Powerpoint file and the youtube transcript without timestamps to a text file. The filename for each file is MB1.
@iwatchyoutube9610 Před měsícem ⁺⁵
I was waiting for your own test the whole video. Git'r done son!
@BThunder30 Před měsícem ⁺¹
This is amazing. I think you need a team to help you set it up fast. We want to see a demo!
@PhoebusG Před měsícem ⁺¹
Yes, def set it up that would be a good video. Keep up the cool videos :)
@DefaultFlame Před měsícem ⁺¹
Nice! I'd love to see you test it out.
@rupertllavore1731 Před měsícem
NICE is see you getting Brand deals! May your channel Keep getting more brand deals!
@arinco3817 Před měsícem
This is really interesting. I've been thinking for ages about how to go from vllm to action. It's a bit like us sitting in front of your computer and describing what you want to happen.
@nqnam12345 Před měsícem ⁺¹
Great ! Pls more on this topics
@dilfill Před měsícem
Would love to see you test this out doing a few different tasks! Also curious if this could run someone's social media etc.
@wardehaj Před měsícem
Great explanation video. Thanks a lot!
@timduck8506 Před měsícem
Are we able to programme new action's? or create new connection? like what we can already do with macros?
@EduardoJGaido Před měsícem
Great video!
@CharlesFinneyAdventure Před měsícem
I would love to watch you setting up OS world on your own
machine testing it out and using it to create
a tutorial from it of it.
@luxaeterna00 Před měsícem
Any link to the presentation? Thanks!
@tigs9573 Před měsícem
Yes I would like to learn more about OSworld , keep up with the great content
@AGI-Bingo Před měsícem ⁺¹
A new golden age of open source is upon us ❤
@scottwatschke4192 Před měsícem
Very interesting. I would love a testing video.
@roharbaconmoo Před měsícem
Does anything change for your video with their addition of memory sharing?
@LauraMedinaGuzman Před měsícem
Amazing! I want to try it for Revit, a software for architecture. Actually I did try something that worked! However I truly need more knowledge, so your help is very very aprecciated! Thanks!
@galaxymariosuper Před měsícem
16:40 think of temperature as of maneuverability. the higher it is the more flexible the system, which is basically a closed loop control system at this point.
@moses5407 Před měsícem
Great presentation! Too bad the accuracy levels are currently so low but this seems to be a framework that can self-grade and, hopefully, self-adjust for improvement.
@yenielmercado5468 Před měsícem
Excited for the Humane Ai pin Agents feature coming .
@joe_limon Před měsícem ⁺¹¹
How close until I have a locally run agentic system that can install all future improved agentic systems and/or github projects autonomously?
@fullcrum2089 Před měsícem ⁺²
With this, a person's ideas, dreams and personalities can become immortal.
@nickdisney3D Před měsícem
Id share my repo but i think youtube comments deletes it automatically.
@electiangelus Před měsícem
Already there. Im actually passed this.
@fullcrum2089 Před měsícem
@@nickdisney3D yes, i can't see it, just share the path repo/name.
@electiangelus Před měsícem
@@fullcrum2089 Apotheosis was thinking that 6 months ago.
@yugowatari2935 Před měsícem
Yes.. please do a tutorial in osworld. Have been waiting for this for some time.
@gotemlearning Před měsícem
great vid!
@ThinkAI1st Před měsícem
You are a very good teacher…so keep teaching.
@nangld Před měsícem ⁺⁸
20% success rate is super impressive a start. As soon as they iterate on that and train a proper model, it will reach 99%, leading to all office workers getting fired.
@andrada25m46 Před měsícem
Yeah prolly not.
I use AI at work, I’m one of the few who do. A lot of data is confidential and extra security measures are needed, sth like this breaches contractual agreements since the AI provider would have access to the data.
Not to mention proprietary apps running in containers which the AI wouldn’t be able to navigate..
@marcussturup1314 Před měsícem ⁺⁶
@@andrada25m46 Local LLM's could fix the data access issue.
@WolfeByteLabs Před měsícem ⁺¹
This.
@stefano94103 Před měsícem
@@andrada25m46 All of the big player MicroSoft, IBM, Google all have enterprise software that is data privacy compliant. The price varies with the solution. The only problem with the enterprise LLMs are they do not move at the speed of other models for obvious reasons. But open source or enterprise is the way to go if your company has compliance requirements.
@greenleaf44 Před měsícem ⁺¹
@@marcussturup1314 I feel like people underestimate how possible it is for large businesses to run their own inference
@systemlord001 Před měsícem
I think temp is set to 1 because if it fails and does another attempt it will have different approaches. When temp is set to lower values it might not get to a working solution because the tried method’s are not divergent enough to contain a valid solution.
But i think having an llm fine tuned on datasets generated by humans in the format of OSWorld (the tree, screenshots ect…). Could improve the succes rate.
If I am not mistaken this is what Rabbit R1 was doing. It’s basically teach mode but with more examples then just the one you give it.
@alpineparrot1057 Před měsícem
I enjoy your content Matt. You put me on to LM Studio, then Ollama, then Crewai. CrewAI has excellent case use, so thank you so much. Could you please do some more stuff with CrewAI (I have mine setup in the one file approach, but am not too sure how to set it up with multiple files and calling to and from (I'm not to familiar with Python, chat gpt is excellent help, but it still only goes so far)..
@2106chrissas Před měsícem
Great Project,
it would be interesting to have a video on RAG and programs available for the RAG (example H2OGPT)
@beckettrj Před měsícem
OSworld project videos please! This could be a series of videos?
I could see this helping me do my job five times faster ! Helpdesk support tool to check and update XYZ application user account then email user letting them know we have updated their account and that they should be able to login. Complicated processes, such as opening VPN connection and checking active directory account settings, and then logging into administrative program(s) to search and open users account to check their settings. The user account Settings in active directory must match the user login settings in the application(s). Email the findings and let them know what was altered or changed, etc..
@camilordofficial Před měsícem
This video was great thanks. Could this work with IoT like devices?
@ThomasEWalker Před měsícem
Cool - This is moving SO fast! I think we will get AIs with the ability to recognize what is on the screen more directly, much like a self-driving car sees the world. This would become 'go click the button that does X', without screenshots. I bet that happens this year. Real world agents with AGI for a Christmas present!
@alexalex4192 Před měsícem
hello i'v registred on massedcompute but couldn't find you preinstall system. Any tips? And may be you have a tutorial?
@kevinehsani3358 Před měsícem
can a multimodal model scroll up or down on a screen and see more than just wha tis displayed? Can they actually read the text on cmd terminal and then act on it instead of us copy and paste it the reply into an input cotext?
@OSWALD569 Před měsícem
For performing actions on desktops there is a macro recorder available and suitable.
@mshonle Před měsícem
16:38 It depends on the specific formula used for the temperature setting, so a 1 here is by no means the maximum. The use of top-p implies there is nucleus sampling being used, which prevents the most improbable completions from even being considered. They are looking for a wider sampling to establish a baseline and setting the temperature too low would create more repetitive results (repeats across different runs and also repeating the same phrase in a single run until the context is full) and thus would be too easy dismiss as a strawman.
@marcfruchtman9473 Před měsícem
Thanks for the video! Yes, this seems like it will be very useful.
@awakstein Před měsícem
Good! So, how do we test it?
@ScottSummerill Před měsícem
Actually your video, specifically the table, convinced me that agents at least in this interaction are not all that spectacular. They will likely get there but right now it’s a lot of hype.
@dreamphoenix Před měsícem
Awesome Thank you.
@AetherTunes Před měsícem
ive always wondered if you could incorporate vision for llm into something like shadowplay
@spikezz29 Před měsícem
Do you have plan for taking about DSPy?
@gatesv1326 Před měsícem
Very similar to RPA (Robotic Process Automation) that I’ve been developing for 10 years now. Nothing new, but being able to do this with a typed or vocal prompt is what’s going to be interesting when it does get as good as a human can do (which is what RPA has been successful at doing for a long time), also understanding that RPA licences are expensive.
@adtiamzon3663 Před 17 dny
Good start. Excellent. 🤫 🌞👏👏
@Treewun2 Před měsícem
Please do a series on Fine Tuning open source models!
@christopheboucher127 Před měsícem
Of course we want to see more about that ;) thx 4 all
@buggi666 Před měsícem
Soooo we basically arrived at Reinforcement Learning using LLMs? Thats sounds so awesome!
@davidhoracek6758 Před měsícem
This only needs to work once and you basically build the universal installer. Soon you just tell a computer "make the latest stablediffusion (or whatever) work on my computer, including all the hardware-specific optimizations that apply to my specific system. Then it just needs to bootsrap in the newest interaction AI for my OS, have a little conversation with the system, try promising settings, and if they fail, come up with others, and (importantly) update the weights of the remote installer system based on the successes and errors of this particular interaction.
@BelaKomoroczy Před měsícem
Yes, test it out, go deeper, it is a very interesting project!
@ericgoz3858 Před měsícem
What Python version as a component of OSWorld is required to launch in a Linux Arch Zen Kernel environment?
@japneetsingh5015 Před měsícem
I am already waiting for a linux where i could enter commands in natural language and the llm model gemerates a set of possible true commans and i just have to choose one or make a minor change
@monnef Před měsícem
Very nice project. I would find interesting to see success rates in different OSes (or in case of Linux even DE/WM). Also GUI vs CLI - I can imagine on some tasks CLI would be a king, while in others it could fail miserably. Still, it could be useful to see for which use cases different OSes or GUI/CLI are better and might be worth of trying to utilize an AI for them.
@scotter Před měsícem
With regard to difficulty of an AI to access the desktop, is there an exception if we are talking about just manipulating a browser window through the use of selenium?
@byrnemeister2008 Před měsícem ⁺¹
You can build tools for an Agent using Selenium as a browser Automator. There is also the likes of RPA apps like Power Automate.
@MeinDeutschkurs Před měsícem ⁺¹
Temperature of 0.1 could lead to “I cannot click, I’m just an LLM.”
@Maisonier Před měsícem
This is great! I'm going to wait for a Linux distro that has these agents built-in to automatically configure Wi-Fi, printers, drivers, or even VMs with Windows (for specific programs that don't work in Wine).
@settlece Před měsícem
i would definitely like to see more OSWorld
thank for bringing this exciting news to us
@francoislanctot2423 Před měsícem
Thanks Yes please install it and show us the procedure. I think it is going to be useful for a lot of people.
@oratilemoagi9764 Před měsícem
So which team are u on:
OS World or Open interpreter 01lite
@ayreonate Před měsícem
I think they set the temp @ 1.0 to test how hard it will hallucinate if given more creative freedom, then added it to the presentation just to show off
@mikey1836 Před měsícem
Copilot on Windows already allows control of the OS. For example, you can ask it to switch to night mode and it will.
@slomnim Před měsícem
That's pretty simple compared to where this project is going. Maybe soon yeah Microsoft will have copilot do some of this stuff but so far this seems like the first real attempt
@ma77yg Před měsícem
would interesting to have a tutorial on this setup
@gokudomatic Před měsícem
Nice, but does it support ollama?
@user-lb5cp5mw4u Před měsícem
Often restricting model to output code only reduces the accuracy, especially on complex tasks. It's worth trying to allow it to print chain of thought (even better if there is a self-critical inner dialogue loop) and then output the final code piece.
@andreluistomaz3930 Před měsícem
Ty!
@DonDeCaire Před měsícem
This is why simulated data is so important, if you can replicate REAL world environments you can test an infinite amount, of environmental conditions and infinite amount of times.
@paketisa4330 Před měsícem
Considering a project where a person documents daily experiences, thoughts, feelings and personal history in a diary specifically for a future AGI’s learning. Do you think such a personalised dataset could enhance an AGI’s ability to understand and interact with individuals on a deeper level? And lastly, is it feasible to expect an AGI to become a close, personal companion based on this method, or would it somehow be redundant useless data? Thank you for the answer.
@Justin-1111 Před měsícem
Let's see it!
@tonysolar284 Před měsícem
I already have this. My AI controls my home with my special logic prompt.
@DamielBE Před měsícem
hopefully one day we'll get agents like the Muses in Eclipse Phase or the Alt-Me in Peter F Hamilton's Salvation trilogy
@canadiannomad2330 Před měsícem
In Linux there is the xserver.. I've been thinking it would be neat to plug a system into the xserver backend, and have an llm communicate with that directly... Somewhat bypasses most visual interpretation, except what is actually rendered as graphic
@johnkintree763 Před měsícem
I want the digital agent in my phone to download my monthly invoice from the electric utility, merge that and other data I want recorded publicly into a decentralized graph representation that is maintained in collaboration with digital agents running in other personal devices to create a shared world model for planning collective action.
@youjirogaming1m4daysago Před měsícem
Taking a screenshot and guessing is an impractical implementation, for desktop agents to truely work we would totally have to create new apis that directly alters the desktop state and best operating system to do this is Linux right now, but if max and Windows also provide them I think then it is possible for agents to make a significant impact
@xxxxxx89xxxx30 Před měsícem
Interesting take, but again, trying to go to general. I am curious if there is a team working on a real "AI OS". Not using screenshots and these half-solutions, but actually having predefined built in functions that control the device through code and track the progress in the same way to do the "grounding" step?
@jamalnuh8565 Před měsícem
Update us always like this, especially the new research papers
@DailyTuna Před měsícem
I think as this evolves it’s time for somebody to create a Linux system that would work directly with this, you need an operating system, catering directly to the agents
@ThomasTomiczek Před měsícem
I think a lot of the current problems are training - if GPT-5 is trained on videos from youtube and that includes a lot of videos of people USING THE COMPUTER - the AI may be more prepared for this.
@NoahtheGameplayer Před měsícem
I have no idea what is going on, especially of not knowing what agents means, is it like a another word for chatgpt or something else?
@ktolis Před měsícem
will be interesting to see ReALM geting benchmarked
@DJStompZone Před měsícem
Is it affectors or effectors?
@interchainme Před měsícem
Feels like a talk form the future o_O
@cmelgarejo Před měsícem
MASSIVE agents, noice
@iseverynametakenwtf1 Před měsícem
why not link the project in your description?
@ayreonate Před měsícem
maybe the LLMs are vastly better in the daily and professional tasks because thats whats widely available online aka their training data. while workflow based tasks dont have that much resources. case in point, the example they used (viewing photos of receipts and logging them on a spreadsheet) that wont have the same amount of online resources as daily or professional tasks.

Další v pořadí

Automatické přehrávání

Unlimited AI Agents running locally with Ollama & AnythingLLM