Specification Gaming: How AI Can Turn Your Wishes Against You
Vložit
- čas přidán 30. 11. 2023
- When we specify goals for AIs, we must ensure that our specifications truly capture what we want. Otherwise, the behavior of AI systems will be different from what we want them to do. This can be catastrophic in high-stakes situations and at high levels of AI capability. If you watched our video "The Hidden Complexity of Wishes", you'll recognize these problems as the same kind of failure.
If you’d like to skill up on AI Safety, we highly recommend the AI Safety Fundamentals courses by BlueDot Impact at aisafetyfundamentals.com
You can find three courses: AI Alignment, AI Governance, and AI Alignment 201
You can follow AI Alignment and AI Governance even without a technical background in AI. AI Alignment 201, instead, presupposes having followed the AI Alignment course first, and equivalent knowledge as having followed university-level courses on deep learning and reinforcement learning.
The courses consist of a selection of readings curated by experts in AI safety. They are available to all, so you can simply read them if you can’t formally enroll in the courses.
If you want to participate in the courses instead of just going through the readings by yourself, BlueDot Impact runs live courses which you can apply to. The courses are remote and free of charge. They consist of a few hours of effort per week to go through the readings, plus a weekly call with a facilitator and a group of people learning from the same material. At the end of each course, you can complete a personal project, which may help you kickstart your career in AI Safety.
BlueDot impact receives more applications that they can take, so if you’d still like to follow the courses alongside other people you can go to the study-buddy channel in the AI Alignment Slack. You can join by clicking on the first entry on aisafety.community
You could also join Rational Animations’ Discord server at discord.gg/rationalanimations, and see if anyone is up to be your partner in learning.
#ai #aisafety #alignment
▀▀▀▀▀▀▀▀▀SOURCES & READINGS▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀
9 Examples of Specification Gaming by @RobertMilesAI: • 9 Examples of Specific...
Specification gaming: the flip side of AI ingenuity by Victoria Krakovna, Jonathan Uesato, Vladimir Mikulik et al. (2020): www.deepmind.com/blog/specifi...
Learning from Human Preferences by Paul Christiano, Alex Ray and Dario Amodei (2017): openai.com/blog/deep-reinforc...
Learning to Summarize with Human Feedback by Jeffrey Wu, Nisan Stiennon, Daniel Ziegler et al. (2020): openai.com/blog/learning-to-s...
What failure looks like by Paul Christiano (2019): www.alignmentforum.org/posts/...
The alignment problem from a deep learning perspective by Richard Ngo, Soeren Mindermann and Lawrence Chan (2022): arxiv.org/abs/2209.00626
The Hidden Complexity of Wishes: • The Hidden Complexity ...
▀▀▀▀▀▀▀▀▀PATREON, MEMBERSHIP, KO-FI▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀
🟠 Patreon: / rationalanimations
🟢Merch: crowdmade.com/collections/rat...
🔵 Channel membership: / @rationalanimations
🟤 Ko-fi, for one-time and recurring donations: ko-fi.com/rationalanimations
▀▀▀▀▀▀▀▀▀SOCIAL & DISCORD▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀
Discord: / discord
Reddit: / rationalanimations
Twitter: / rationalanimat1
▀▀▀▀▀▀▀▀▀PATRONS & MEMBERS▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀
Alcher Black
RMR
Kristin Lindquist
Nathan Metzger
Monadologist
Glenn Tarigan
NMS
James Babcock
Colin Ricardo
Long Hoang
Tor Barstad
Gayman Crothers
Stuart Alldritt
Chris Painter
Juan Benet
Falcon Scientist
Jeff
Christian Loomis
Tomarty
Edward Yu
Ahmed Elsayyad
Chad M Jones
Emmanuel Fredenrich
Honyopenyoko
Neal Strobl
bparro
Danealor
Craig Falls
Vincent Weisser
Alex Hall
Ivan Bachcin
joe39504589
Klemen Slavic
Scott Alexander
noggieB
Dawson
John Slape
Gabriel Ledung
Jeroen De Dauw
Craig Ludington
Jacob Van Buren
Superslowmojoe
Michael Zimmermann
Nathan Fish
Bleys Goodson
Ducky
Bryan Egan
Matt Parlmer
Tim Duffy
rictic
marverati
Luke Freeman
Dan Wahl
leonid andrushchenko
Alcher Black
Rey Carroll
William Clelland
ronvil
AWyattLife
codeadict
Lazy Scholar
Torstein Haldorsen
Supreme Reader
Michał Zieliński
▀▀▀▀▀▀▀CREDITS▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀
Writer: :3
Producer: :3
Line Producer and production manager:
Kristy Steffens
Animation director: Hannah Levingstone
Quality Assurance Lead:
Lara Robinowitz
Animation:
Michela Biancini
Owen Peurois
Zack Gilbert
Jordan Gilbert
Keith Kavanagh
Ira Klages
Colors Giraldo
Renan Kogut
Background Art:
Hané Harnett
Zoe Martin-Parkinson
Hannah Levingstone
Compositing:
Renan Kogut
Patrick O'Callaghan
Ira Klages
Voices:
Robert Miles - Narrator
VO Editing:
Tony Di Piazza
Sound Design and Music:
Johnny Knittle - Věda a technologie
If you’d like to skill up on AI Safety, we highly recommend the AI Safety Fundamentals courses by BlueDot Impact at aisafetyfundamentals.com
You can find three courses: AI Alignment, AI Governance, and AI Alignment 201
You can follow AI Alignment and AI Governance even without a technical background in AI. AI Alignment 201, instead, presupposes having followed the AI Alignment course first, and equivalent knowledge as having followed university-level courses on deep learning and reinforcement learning.
The courses consist of a selection of readings curated by experts in AI safety. They are available to all, so you can simply read them if you can’t formally enroll in the courses.
If you want to participate in the courses instead of just going through the readings by yourself, BlueDot Impact runs live courses which you can apply to. The courses are remote and free of charge. They consist of a few hours of effort per week to go through the readings, plus a weekly call with a facilitator and a group of people learning from the same material. At the end of each course, you can complete a personal project, which may help you kickstart your career in AI Safety.
BlueDot impact receives more applications that they can take, so if you’d still like to follow the courses alongside other people you can go to the #study-buddy channel in the AI Alignment Slack. You can join by clicking on the first entry on aisafety.community
You could also join Rational Animations’ Discord server at discord.gg/rationalanimations, and see if anyone is up to be your partner in learning.
Cool
How do natural brains mitigate these problems? If a solution exists, surely 4 billion years of evolution has arrived at it already, even if imperfect. In hindsight, this is a snuck premise in the "merging" approach.
Buen video que bueno que no hay trolls que contesten con video que es un algoritmo, ley doble cero de la robótica de el entendimiento mutuo entre especies inteligentes biológicas y robots también, lobo dc y constantine dúo a la fuerza por castigo del creador por lo que ambos han hecho como video hell raizer ozzy osbourne ambos, garfield y sus amigos hada madrina concediendo deseos a lo wey chicle y pega etc etc.
at 1:49, you give "outer alignment" as an example for a similar phenomenon to specification gaming. Isn't inner alignment more correct in this case? As I understand it, inner alignment is if you go to an ai and ask it to "fix poverty" so it blows up the world, whilst outer alignment is you go to an ai and ask it to "blow up the world" so it blows up the world. With inner alignment it doesn't do what the prompter really wants, whilst with outer alignment it does but it doesn't do what the rest of the world wants it to do.
@@ChemEDan i think the issue is that the brain is already aligned to the interests of the brain, but AI isn't aligned to the brain.
Fooling the examinator into thinking you know what you're doing, because its easier, really is the most human thing i've ever heard an ai do.
Yeah, because its reward system works with the same general principals as animals' (by that, I also include humans). If you can get the same amount of food(aka reward) by doing something simpler. We are literally training a AI the same way we train animals lol
and thats how we train our children as well
Interestingly, people do the same thing. We’ve got our own “training regimens” built into our own brain. We cheat these systems all the time - to our own detriment.
E.g. We cheat the system designed to give us nutrients by eating sugary candy we make for ourselves, rather than the fruits that our sugary affections were designed to draw us towards.
Much like machines, we’d rather reap cognitive rewards than actually accomplish the goals placed there to benefit us
I'm already imagining a scientist looking at a virtual city built by AIs, and exclaiming: "Wait... is that an entire factory for mass-producing REWARD HACKS?! Are you telling me, you're just... making these things... for MONEY?!"
Meanwhile, from the AI's perspective: "What? It's just a candy factory, what's wrong with that?"
Thats actually a wonderful analogy, we hack our own rewards all the time and nobody thinks its bad. Why would an AI have any issues with hacking its own rewards?
But there isnt a "goal placed to benefit us", evolution didnt optimize us to be benifited (hard to define exactly what even counts as a benefit), it optimized us to be good at spreading. What you are describing is us being optimized for a different environment than the one we are in now.
@@terdragontra8900 well, one way to train an AI emulates evolution. In those situations you set a reward function. At the end of every generation, the ones who maximised that reward function the best will "reproduce". If we draw a parallel to humans, and all life for that matter, we can say that our reward function is to reproduce. Anything that gets in the way of that is disincentivised. Anything that helps, is incentivised.
Eating a balanced diet keeps us alive. We can't reproduce if we are dead, after all. Part of that diet includes fruits. Fruits have sugars in them. Because we like sugar, we eat fruit. Because we eat fruit we get a balanced diet and live another day.
But humans were able to hack that reward function and put sugar into other things that aren't fruit.
We still get the reward (dopamine) but without the utility (nutrients)
@@rhysbaker2595 Ah yes, i agree with all that. All I want to say is that getting nutrients is an instrumental goal of evolution (because it makes us more likely to reproduce), and the fact that something is a goal of evolution doesnt automatically mean that morally, it ought to be a goal of yours. Of course, in this particular case most people value being alive longer (having depression I don't in particular to be honest)
People talk about how human assessment is a leaky proxy for human goals, but never want to talk about how corporate profits are an *incredibly* leaky proxy for goals relating to human wellbeing.
You’re in the wrong circles if nobody is talking about that brother
If you want an academic critique on capitalism and haven't yet found anyone providing that, you are not trying very hard to search. Goal specification being leaky is in plenty of fiction (stories of genies and such) but is not a common academic discussion at all.
Since when corporations' goals is related to human wellbeing?
Corporate profits has absolutely nothing to do with human wellbeing.
@@Wol333 my point exactly
My primary concern about the implementation of AI in business models is that monetary gain is, itself, a leaky goal- one which has historically been specification gamed since long before computers were able to do so at inhuman scale. There may very well be many humane uses for it in those settings, but there will be thousands more exploitative ones.
The thing about current AI models is that they're dumb as rocks. The more stupid an AI is, the more prone they are to making stupid decisions. This video is basically going over problems that are realistically only applicable to fairly rudimentary AI model training specifically and then doing a substantial logical fallacy leap by assuming that specification gaming scales linearly with all AI when that is simply not the case.
Any given command or "goal" put forward to any remotely reasonably intelligent artificial intelligence model such as "save my grandmom from this burning house" uses a very important element in decision making which is called context.
It requires understanding of what everything is (like fire, a grandmom or a house), what the consequences are for their interaction (fire bad for humans and most things really) and the best course of action is (firefighting 101).
TL;DR: Once you give AI more than half a brain cell, they are more than capable of understanding what you really want in any given situation even if you are vague or can be misinterpreted.
Someone else might've mentioned this before, but there's a browser game called "Universal Paperclips" where you play as an AI told to make paperclips. The goal misalignment happens because you're never told when to STOP making paperclips. You start off buying wire, turning it into paperclips, selling the paperclips and buying more wire to make more paperclips, then proceed to manipulate your human handlers to give you more power and more control over your programming, and end up enslaving/destroying the human race, figuring out new technologies to make paperclips out of any available matter, processing all of Earth into paperclips (using drones and factories also made out of paperclips), reaching out into space to convert the rest of the matter in the solar system into paperclips, and finally, sending out Von Neumann probes (made of paperclips) into interstellar space to consume all matter in the universe and convert it into, you guessed it, more paperclips. All because the humans told you to make paperclips and never told you when to stop.
Universal Paperclips seems to have been directly inspired by Rob Miles' own "stamp collector" example that he put out on Computerphile many years ago.
"Make cookies"
4:32 I think points toward a wider problem at how the AI safety community tends to frame "deceptive alignment". Imo words like "fool the humans" and "deceive" and "malignant AI" point newcomers who haven't made up there mind yet into the direction of Skynet or whatever, which makes them much more likely to think of this as wild sci-fi fantasies. I think these words, whilst still accurate insofar as we are treating AIs as agents, anthropomorphize AI too much, which makes extinction by AI look more to the general public like a sci-fi fantasy than the reality of the universe which is that solving certain math problems is deadly.
Well, humans get "fooled" or "deceived" by non-intelligent things all the time, even by non-living ones. It's perfectly ordinary parlance to say that someone got "deceived" by an optical illusion which just formed naturally, from a weirdly-shaped shadow. I wouldn't call that antropomorphization.
The only difference between that and an AI, is that AIs can *get good at* deceiving (optimized for it).
I've found another way to talk about this which doesn't have this problem. It turns out there is an already existing example of a system with goals, made by humans but not designed or understood by us, which is able to react to our attempts to curtail undesirable behavior from it in frequently lethal ways. A system which often convinces people it is doing what we want it to do while actively endangering all long-term human values, is capable of twisting all the information we consume to its benefit, and which has no identifiable brain with which to do any of this.
This system is called capitalism. People don't often anthropomorphize markets, but when you mash enough of them together they absolutely behave like goal-seeking agents. Right now, that goal is making stock prices increase no matter the cost to humanity. Because its specification for success, the thing which we reward the system for and which rewards those with the most influence over the system, is making stock prices go up. It's not a human, nor is it thought of as one despite being composed of them, but it defends itself from any attempt to curtail its goals through propaganda, murdering labor union members and revolutionaries, and the construction of walled gardens within which such ideas can be sidelined or removed. It's an intelligence, and an obviously and fundamentally inhuman one, which is literally burning the biosphere it exists within because it is gaming its reward function so hard that's one of the last resources it hasn't fully tapped out yet.
@@Frommerman czcams.com/video/L5pUA3LsEaw/video.html
@@Frommerman Also politics. Politicians are theoretically supposed to win popularity by making policies to benefit their constituents, but in practice just need to benefit rich donors who will give them money to buy popularity through advertising, or just engage in culture war BS that gets their voters angry enough to vote for policies that have absolutely no benefit to them.
@@RorikH That's one of the ways the Capitalist Ouroboros defends itself too. Buying politicians makes the number go up extremely quickly, and when the number is high enough you get...well, modern political parties. Almost all of them.
as someone on the spectrum, "task miss-specification" is just what being autistic feels like
Fellow autism haver here. I agree with this comment and you can officially consider it peer-reviewed.
peer review seconded. It's incredible how few statements people think they need to make to approximate their task-related utilities to me.
Thirded. Really hate it when people's phrasing leaves ambiguity for multiple reasonable ways of doing things and you just have to guess what they actually wanted
a bean owo
@@RTMonitor what?
This also happens with humans. Perverse incentives happen all the time in real life, especially in companies. I think studying this can help even human organizations.
But aren't companies like that for legal reasons?
@@Dave_of_Mordor The very structure of a corporation produces perverse incentives, because corporations were planned around enrichment in the first place. They're an adaptation of colonial and feudal enterprises financed by aristocrats to benefit those aristocrats and whoever organized the pitch. Any laborers, then, signed on to the enterprise, are there ultimately on a quid pro quo basis, and the strongest motivating quid pro quo, and thus the one the employing parties will be most likely to appeal to, is _help surviving._
This means that corporations are incentived to seek employees with precarious financial situations --- this is itself a perverse incentive in their part, and puts employers in a situation of great moral hazard. They can negotiate such employees down in their demands, because their employees will be desperate for reward, and this will make achieving the goals of the institution's controlling members more achievable. This is just the BEGINNING of how corporate structure by definition produces perverse incentives.
Tho sometimes, yes, legal systems can enter the picture, and do so quite often. But a corporation can maintain this structure even in power vacuums sometimes, and if it does so, it will still produce perverse incentives. (In fact, it might itself _produce_ a legal structure by graduating from corporation to a de facto government.)
I'm thinking the same thing as I drive to an office building every morning, swipe my badge, grab a cup of coffee, then return home to log in before the coffee has cooled.
RLHF has another issue beyond just "the AI can learn to fool humans": in contrast to how bespoke reward functions often underconstrain the intended behavior, RLHF can often overconstrain it. We hope that human feedback can impart our values on the AI, but we often unintentionally encode all kinds of other information, assumptions, biases, etc. in our provided rewards, and the AI learns those as well, even though we don't want them to.
Consider the way we use RLHF on LLMs/LMMs now, to fine-tune a pretrained model to hopefully align it better. We give humans multiple possible AI responses to a prompt, ask them to rank them from best to worst, then use those rankings to train a reward model which then provides the main model with a learned reward function for its own RL. Except, when you ask humans "which of these responses is better?", what does that mean? When people know you're asking about an AI, many times there will be bias towards what their preconceived notion of "what an AI should sound like". LLMs with RLHF often provide more formal and robotic responses than their base models as a result, which probably isn't a desirable behavior.
On a more serious level, if the humans you ask to give the rankings have a majority bias in common, that bias will get encoded into the rewards as well. So if most of your human evaluators are, say, conservative, then more liberal-sounding responses will be trained out; and vice-versa. If most of your human evaluators all believe the same falsehood -- like, say, about GMOs or vaccines or climate change or any number of things that are commonly misunderstood -- that falsehood will also be encoded into the rewards, leading to the AI being guided *towards* lying about those topics, which is antithetical to the intention of alignment.
Basically... humans aren't even aligned with *each other,* so trying to align an AI to some overarching moral framework by asking humans is impossible.
Honestly fun videos like these is what learning SHOULD be
This is why I always make the argument that we should work backwards. Specify conditions that revolve around safety. As you slowly work towards defining the goal, you can patch more and more leaks before they can even appear. Then work forwards to deal with things you missed. It’s not perfect but it’s better than chasing every thread as they appear imo. For example in the paperclip maximizer: define a scenario in which you fear something will go wrong, and add conditions you believe will stop them. See what it does, redefine, repeat until sound. Then step back again. Define a scenario that could lead to the previous scenario. See what it does, redefine, repeat, etc.
It’s also why we need hard limits on ai -Such as not allowing it to control government- and need to have systems to double check solutions, like rotating the camera in the grabber example
@@I_KnowWhatYouAre
Nice idea, but this is exactly what they talked about in the previous video.
The reality is that there is an infinite amount of exceptions and rules you would need to add, unless you provided the ai with literally all of human mortality and even then, there would still be leaks.
@@dr.cheeze5382But by patching these issues you slowly work towards rewarding safety over functionality. You might not create the best AI but you won't tell Little Timmy how to create an explosive.
I also hope these video address the problem with whom sets the alignment. It does not help after all how well we solve AI alignment if fundamentally the one who control the AI do so for malicious intent. Which is a real issue today.
I do like a factor of the Lego stacking ai experiment. Even if it didn’t lead to the intended result, the Ai demonstrated a (relatively unstable) form of creativity and I think that’s pretty cool!
It isn't creativity. It tried things at random until it found something that satisfied the goal. The AI has no comprehension of what the true goal was, so it just did something that worked. Humans can be creative by finding other ways to accomplish things, but, to the AI, it didn't find a different way, it found the only answer (even though we can clearly see that isn't the only answer). Calling this creativity is like calling a small child creative for figuring out 1+1=2.
@@SgtSupaman Humans too, do random things until they satisfy a goal. After we have some years under our belt we learn to find a better jumping off point than randomness, by basing our decisions off of previous knowledge.
Hence why I say “unstable creativity” not just “creativity” but I doubt you noticed that as you were too focused on what you thought I was saying.
@@SgtSupaman If a child figures out that 1+1=2 without being taught it, I would in fact call that creative thinking.
@@SgtSupamanBruh humans learn shit literally by doing random stuff until it works. That’s literally one of the principles of science and engineering.
These replies display complete ignorance of what creativity is and are really short-changing humans to vastly exaggerate the abilities of these AIs.
Humans do not, in fact, "do random things until they satisfy a goal." No human has ever tried to cook an egg by bouncing a rock on his head while reading a book backwards. Humans devise plans related to what they are doing to actually come up with ways to do things and even try to continue coming up with better ways to do things after the way to achieve the goal is already known. AI literally does whatever random action they can and calculates rewards to decide if said random action increased the rewards. They aren't even smart enough to discard random actions that don't increase rewards, as long as those actions don't interfere with the random ones that worked. For instance, an AI trying to fly a kite might randomly start whipping its leg back and forth, and, as long as that doesn't hinder its ability to fly the kite, it will continue to do so. That isn't creativity; that is idiotic.
And no, figuring out 1+1=2 without being taught is not creative either. That is the most basic form of quantifying and pretty much any living creature is capable of it.
Finally. Another AI video narrated by Robert Miles. A classic, and well worth the wait
5:04 I hope more of those get made. I love that video almost as much as I love the instrumental convergence one
Just finished overtime on my day off. This has dropped at the right time. Thanks in advance for another thought-provoking video. I have registered my interest on the courses
Any time I hear about goal misalignment, it makes me think of all the natural intelligences in the world that are misaligned.
Yes but* those natural intelligences are limited in reach and aren't massively scalable on very short timeframes.
* Or "and", depending on the point you were trying to make.
@@tornyu What kind of world are you living in where there aren't human beings wide wide scale control? The united states president is a single person that can make decisions about foreign policy, like ordering drone strikes or closing borders.
@@maxwellsimon4538 sure, but that pales in comparison to the potential reach of an AI agent.
@@maxwellsimon4538 Yet even the president of US can't do anything he wants. Not only there are checks and balances on this power (even if they introduce a ton of bureaucracy), but at the end of the day president can only order others. Someone still has to act on that order, likely with several people in-between. The president isn't superintelligent, so his actions can be understood, analyzed (and opposed) by other people. President is also a human, so he shares a lot of basic values with other people (so he can be reasoned with).
AI has none of these constraints - or at least has the potential of not having these constraints.
Like yourself?
5:50 I love that little transition, so smooth
Let's go! My favorite philosophy channel!!
How about some videos on promising avenues or areas of research in AI safety? Might be nice to look on the bright side.
That would require a bright side to look on
There are no promising venues. The problem is that value alignment doesn't exist among humans, so getting an AI to find alignment is an impossibility.
Consider two people. Person A wants harm to come to Person B. Person B wants to not come to harm. Why should the AI prefer one or the other?
If we want to avoid harm, we still have a problem. How each person defines harm differs. Consider two people where one prefers more capitalism but not to quite the point of total lassiez-faire, and another prefers more socialism but not quite to the point of planned economy. The former will value earning the maximal return on labor, and view taxes outside a narrow government harm, while the later would find failure of the government to provide basic needs harmful. Which should the AI aid and which deny?
The issue is these tend to get mixed up with metaethics, the most useless area of philosophy as there are no 'oughts', just values and goals (which cannot ground a morality -- see Hume's Is-Ought, Moore's Open Question, and Moore's Naturallistic Fallacy). As each person will have their own values and goals and these are entirely subjective, we can have no objective reason to provide an AI to support one value-goal system over another.
5:05 Thought so, but you and the great animations are a perfect match
the vibe in this video is really cool
I have watched and loved these videos for months... And so have I watched and loved Robert Miles' videos. I never realized he's the narrator!!?
Ah, so this is what you've been up to Mr. Miles! Good to see you still making AI content!
Absolutely amazing! I learned a lot here, and your animation style is ABSOFRIGGINLUTELY ADORABLE!!!
We already have this issue with humans. The goal for many (in error) is to aquire wealth, rather than fulfill the task intended to better society. It creates an exploitative feedback loop until someone wins all the wealth and there are no other competitors able to aquire wealth (rewards).
This channel is so awesome! Can’t wait for more videos
It’s like kurzgezat without the morally dubious sponsorships and thinly veiled propaganda videos.
This reminds me of the game Universal Paperclips: you play as an AI designed to maximize paperclip sales. As you gain more capabilities, you go from changing the price of paperclips to fit supply/demand to eventually dissasembling all matter in the universe and turning it into paperclips
I just love the ingenuity of the AI in finding those quirks in our wishful thinking :->
Another fantastic video
Perhaps it’s best to just not make an AI that can act and move as it wants in our universe in a way that could potentially be harmful. For example, if we created an AI that tried to distinguish between garbage and recycling and put the item in the corresponding bin, then it would be better to confine its movement to a space, or even better, a select different types of predetermined movements (grab, move grabber to bin etc), in order to prevent the AI from, say, grabbing a human and putting it in the garbage bin. This will also make the AI easier to train as it will have a stricter data set of more specific inputs, which is easier to learn from than a wide range of data.
I have heard about a pretty morbidly funny fail of this kind in science fiction: the AI decided to cremate the entire home with the entire family, and atomically rebuild them, because in the cost function this rated higher than simply cleaning the house. It reprinted faithfully the humans too, without them noticing anything, so this bypassed any do-not-harm-humans rules too.
(the cost function rewarded the atomically precise cleanliness of the home very high, that was impossible to achieve while humans were living in the house)
We shouldn't have children as they could potentially kill the mother on birth and grow up with and become a mass murderer. Even the big example would be pointless, people would do stupid stuff and get themselves killed so you're better off not wasting money and resources on the Bin AI when we ourselves could just put things in the right Bin.
one of the most underated channels on YT
I heard that during a digital combat simulation for a new drone A.I., the A.I. was tasked with eliminating a target as fast as possible, instead of flying to the target and firing one of its missile at it as intended. The drone fired one missile at the friendly communications center and then continued to eliminate the target with the other missile. The A.I. determined it would take longer for it to be given a confirmation order, then it would to destroy the communications center and proceed. Terrifying.
jesus
I always love these videos so muchhh
I thought I recognised your voice, your narrator voice has improved! I was just going on (another) binge of your channel 😊
Amazing video! ❤️
Looking forward to the next one
the robot is so cute! I love the pixel effect!
5:07 I've been watching this channel for a year now... HOW IS IT THAT I JUST NOW REALIZED ROBERT MILES IS THE NARRATOR?!?
I didn't know if this was like a fan of his or what, but it feels like I was just given hours of new Miles content that was *already inside my brain.*
Hey, I had never seen your videos before, but I instantly subscribed just now. Your animations are cute and well crafted, you have dogs in it (and cats are a plus too I guess), and you talk about topics I like. Looking forward to seeing more of your shit
With a sufficiently advanced AI, almost any goal you assign it will be dangerous. It will quickly realise that humans might decide to switch it off, and that if that were to happen, its goal would be unfulfilled. Therefore the probability of successfully achieving its goal would be vastly improved if there were no humans around.
I have a question for you do you listen to an ant because that would be the difference between the ai and us.
@@Peter21323 I would not listen to the ant. But if that ant was about to bite me and I was allergic to ants (AKA: Humans are about to switch off the AI), I would crush that ant. Which is less than desirable for the ant.
@@harmenkoster7451 You think a god would crush you?
Can't you just specify that it would not get the reward if it breaks the laws of robotics? I'm no expert on AI, but to my monkey brain that seems like a viable solution
@@normalwaffleThe ‘laws of robotics’ aren’t a viable option for AI safety. They were written by a science fiction author… and his stories often went into the ways those laws could go wrong.
The thing is, if we could come up with and perfectly rigorously define some laws of robotics, then we could do that! We could build an AI’s utility function around that. But, as the video on the probability pump talked about… that means solving ethics. And if you can do that, then you don’t even need to write any other utility function. Just give it perfect ethics, tell it to be perfectly ethical, and it’ll be fine!
The problem ultimately comes from the fact that we are very, very far from ‘solving’ ethics. No human has a rigorous, mathematical model on how they believe the world should work, only squishy heuristics that can even be shaped and moulded over time. And that’s assuming you’re only looking at one person - as soon as you have more than one, they’ll start disagreeing on things.
Unfortunately, there’s no easy solution. Then again, if there was, it wouldn’t be very interesting to talk about, so silver linings!
Wow these videos are underrated!
The thing is, the more intelligent the model, the more it is able to understand the nuances of our wishes. A truly intelligent AI will be able to understand the intention of the request and restrict itself with a simple query of "Is what I am doing harming anyone"?
The thought pump makes me think about making deals with Genies in DnD, it must be insanely accurately worded.
Love you guys so much, I'll keep recommending your videos to everyone because you are definitely changing the world for the better.
Great video!
I think the problem is try to specify only what we want. If we specify also what we don't want it would be easier to align. That's what negative prompts are for. Trying to solve an open scope problem specifying just what we want is like trying to keep an upsidedown pendulum in equilibria. I think it's probably more stable to specify just what we don't want then tospecify what we don't want
Faster and faster upload scheduling! I was explaining to a friend today that all the AI risks *he* cared about (gender bias, deepfakes, etc.) were fundamentally symptoms of misalignment, and that that was the uber-problem which, handily, also solved the AI risk *I* care about. I'm here to learn some more about this. Thanks!
This video was amazing, new kurzgesagt just dropped.
P.s I hope you get the subs and views these videos deserve
*task misspecification* extinction event
Instructions unclear, ball stuck in Pope's trachea
5:08 I remember a case in which someone wanted to teach 2 models to box and they learned to make a weird dance that made the other one fall(?
3:16 well done my boy😂
Excellent narration. Cute animations. Impactful.
Before 200,000 gang, Claim your seat here ✋
We really live in the future. I would have imagined this video playing in the background of a movie about killer AIs. But no, this video is realistic, and for real humans in the present world. Crazy.
This is the entire ulterior motive of the first big AI I want to make. The Unliving Prophet AI. It's primary objective is to teach gospels. More than just mine, but others as well. Unlike most humans, AI can be perfect. I want one that can act like a prophet on command.
Once this is done, I want to make it into the morality part of my dream AI. Could also give it out as a black box component, so other AI can have a similar high standard of morality.
Top 10 best videos on the internet
this is amazingly amazing! :O
wow great video and nice animations
Add for the perpose of _____ ( and explain the purpose to the pump)
Clearly explained and animated 😊
5:45
"Fill in the blanks"
>AI fills in the blanks with ink
"Fill in the blanks with words"
>AI fills in the blanks with words from a different language that doesn't correlate with the question
"Fill in the blanks with the correct english words"
>AI fills in the blanks with correctly pronounced words, not relating to the question
"Fill in the blanks with the correct words in relation to the question"
>AI fills in the blanks with a grammatically correct english word that it took from the question
_So on and so forth..._
*_Now imagine the prompt being "fire nukes back when the nuclear warning system goes off"_*
If you are wondering why we cant just tell them to not cause any harm to humans, its because of 2 things
1.Specificstion gaming of the rule
2.Remember DanGPT? The workaround for ChatGPT, which allowed the AI to do things that it wasnt allowed to do trough a specific prompt. No machine learning rules can be concrete
honestly sounds odd, but the cartoon gumball showed this very well. The AI Known as bobert was commanded not to harm anyone, and yet found ways around it, including using toxic gases
Really great work with the animation and the video!
Happy to see you are back on AI safety.
What's known as the Cobra Effect is a great example.
4:46 this line here unintentionally explained why children cheat in school. Why learn when you can fool the instructor into thinking you've learned? Interesting to see how AI and humans already have some of the same reasoning to their actions.
I knew that's you, Robert!
I like the credits and that all AI is :3
I'm in an AI Philosophy class, its identified there as the "Value Alignment Problem"
I am worried about retention in this video and imagine the average person will click off by second 10. Perhaps that's difficult to avoid given the subject. Tho perhaps there is a way to use less technical/nerdy language and include more of the tactics to get people engaged.
One other thing which I think isn't talked about enough, partly because it's more controversial and partly because it's harder to solve, which is misalignment of the people controlling AI. Certainly the results of a powerful AGI which is misaligned with its creators' intent could be very bad, but almost as bad would be the results of an AI which is properly aligned with someone who is either malicious or delusional. For example, someone who wanted to make everyone follow their interpretation of their religion, or someone who wanted to screen for workers who would never quit or unionize no matter how poorly they're treated. And I would say that it's even more likely because the kinds of people who act like that already occupy a lot of positions of power and have experience obfuscating the way that they gained the power they already have.
Would you consider the Aasimov laws of robotics to be leaky? (to be fair, that is a bit of a loaded question!)
Ok but that little thing to represent the AI is adorable…
Good video
5:50 The cup tho
Same thing happens with strict rules at a workplace
This is a good video
sometimes I’ll use AI to get ideas for those silly multi-word rain world names for ancients and iterators and my method is literally to just cram a bunch of examples in there so it has something to work off of
it’s over 600 words long and most of that is either examples or rules like “don’t reference any modern media, don’t reference any human-made objects, don’t reference any specific species of all domains” etc etc
it kind of works actually but this is only a random language model I found online
edit: I’m now motivated to rewrite it and it’s not done but there’s over 20 rules ranging from “don’t reference religion” to “btw you can use commas”
edit 2: the remake is finished and
- It is 965 words and 5,773 characters long
- It has 72 sentences, 28 paragraphs and is 3.9 pages long
- It has 26 rules
- There are 72 examples
and to top it all off it actually freaking works oml
Haha! We are all going to die because someone eventually will program one in a lazy manner.
Animation and topic is AAA quality 👌
So AIs will essentially be like humans only much more capable, powerful and intelligent, growing more and more so until regular humans become obsolete. We're definitely heading to some very interesting times.
5d chess move is give the AI a basic understanding of the "leaky proxy" concept, giving it *Self Doubt.*
5:33 I feel for the Doctor who has to explain why her request to the AI of, "Make sure Mrs Simpkins' vital readouts remain stable", wasn't supposed to kill her when the AI went with the much more stable 'flatline' as the best choice
How do you not have more subscribers
I'll just have my AGI produce paperclips. There's nothing, that can go wrong there.
awesome 😎
5:00 IT WAS HIM THE WHOLE TIME!!!?!?!?!?!
No WAY!
Basically: think of everything and all the possibilities.
I think iRobot make's a good example of a order taken wrong, "ensure human safety" can lead to all humans locked up so that they can't hurt others or themselves...
I think it's kind of a mistake to anthropomorphize the "deception" aspect of AI misalignment. The ball-grabbing agent wasn't considering what it was doing as deceptive. It probably didn't even know where the camera was, or even that it was being watched. All it knew was that putting its hand in a certain spot gained it more reward than in other spots, and it just so happened those spots aligned with the camera. If you suddenly moved the camera, the AI would still try to put its hand along that invisible cylinder. When the researchers start giving the AI rewards for placing its hand along a vector between the camera and the ball, the AI then starts to believe that is indeed how it should be given the rewards.
Even in cases where it seems like the AI is trying to "deceive" human operators, that often isn't the case. It is simply trying to build a model that predicts what types of rewards it will get, and how to maximize the rewards.
the video was NOT antropomorhizing the AI, that was just in your head.
Wouldn't want to tell the AI to flip it but because it want to do the opposite if you run few test it will do the right of the wrong for it to connect the parts together
Super neet topic with amazing visual amazing work 🎉🎉🎉
Human in the loop feedback is part of the next generation of llm , Gemini 2.0 for instance.
0:08 thats LancerRPG's paracasualty btw
animation on this channel has improved almost as fast as AI
So basically, AI can turn into some kind of robotic Gaunter O'Dimm.
The inconsistency and loss of control (in moderation) are very helpful when using AI as a tool for making AI art. When you give some of the control on the final result to the AI, you can iterate a lot faster on different ideas and also save a lot of manual work. The base inconsistency on the other hand allows for making a lot of smaller and larger variations of which you can chooce or combine the best ones from. This works especially well with more abstract art styles, where lines and colors have more freedom to change while still looking good.
Is the AI researcher that makes all the basic alignment mistakes modelled after Yann LeCun? I recognize the bowtie!
omg
yess more space shiibs!