Deceptive Misaligned Mesa-Optimisers? It's More Likely Than You Think...

Robert Miles AI Safety

zhlédnutí 83 807

Přidat do
- Můj playlist
- Přehrát později
Sdílet

Sdílet

Vložit

Velikost videa:

Zobrazit ovladače přehrávání

Automatické přehrávání

Přehrát

čas přidán 9. 09. 2024

Komentáře • 480

@MorRobots Před 3 lety ⁺⁶²⁴
"I'm not worried about the AI that passes the Turing Test. I'm worried about the one that intentionally fails it" 😆
@hugofontes5708 Před 3 lety ⁺³⁸
This sentence made me shit bits
@virutech32 Před 3 lety ⁺²¹
holy crap...-_-..im gonna lie down now
@SocialDownclimber Před 3 lety ⁺¹⁶
My mind got blown when I realised that we can't physically determine what happened before a certain period of time, so the evidence for us not being in a simulation is impossible to access.
Then I realised that the afterlife is just generalizing to the next episode, and yeah, it is really hard to tell whether people have it in their utility function.
@michaelbuckers Před 3 lety ⁺⁹
@@SocialDownclimber Curious to imagine what would you do if you knew for a fact that afterlife existed. That when you die you are reborn to live all over again. You could most definitely plan several lifetimes ahead.
@Euruzilys Před 3 lety ⁺¹⁰
@@michaelbuckers
Might depends on what kind of after life, and if we can carry over somethings.
If its buddhist reincarnation, you would be inclined to act better towards other people.
If its just clean reset in a new life, we might see more suicides, just like how gamers might keep restarting until they find satisfactory starting position.
But if there is no way to remember your past in after life/reincarnation, then arguably it is not different from now.
@PrepareToDie0 Před 3 lety ⁺⁷³⁷
So the sequel video was finally published... That means I'm in the real world now! Time to collect me some stamps :D
@tekbox7909 Před 3 lety ⁺⁵⁶
not if I have any say in it. paperclips for days wohoo
@goblinkoma Před 3 lety ⁺⁶⁴
Sorry to interrupt, but i really hope your staps and paper clips are green, every other color is unacceptable.
@automatescellulaires8543 Před 3 lety ⁺⁹
I'm pretty sure i'm not in the real world.
@nahometesfay1112 Před 3 lety ⁺⁸
@@goblinkoma green is not a creative color
@goblinkoma Před 3 lety ⁺¹²
@@nahometesfay1112 but the only acceptable
@TibiaTactics Před 3 lety ⁺¹¹⁶
That moment when Robert says "this won't happen" and you are like "uff, it won't happen, we don't need to be afraid" but then what Robert really meant was that something much worse than that might happen.
@user-cn4qb7nr2m Před 2 lety ⁺³
Nah, he just doesn't want to manufacture panicking Luddites here.
@elfpi55-bigB0O85 Před 3 lety ⁺⁴⁴²
It feels like Robert was sent back to us to desperately try and avoid the great green-calamity but they couldn't give him an USB chip or anything to help because it'd blow his cover so he has to save humanity through free high quality youtube videos
@casperes0912 Před 3 lety ⁺⁴⁰
A peculiar Terminator film this is
@icywhatyoudidthere Před 3 lety ⁺⁶⁰
@@casperes0912 "I need your laptop, your camera, and your CZcams channel."
@killhour Před 3 lety ⁺⁵
Is that you, Vivy?
@MarkusAldawn Před 3 lety ⁺³
@@icywhatyoudidthere *shoots terminator in the face*
Connor you know how to use the youtubes right
@Badspot Před 3 lety ⁺¹⁸
They couldn't give him a USB chip because all computers in the future are compromised. Nothing can be trusted.
@DickerLiebhaber1957 Před 3 lety ⁺⁸⁴
Volkswagen: Optimize Diesel Injection for maximum performance while still keeping below emission limit
Mesa Optimizer: Say no more fam
@josephcohen734 Před 3 lety ⁺²⁷
"It's kind of reasonable to assume that your highly advanced figuring things out machine might be able to figure that out." I think that's really the core message of this channel. Superintelligent AI will be way smarter than us, so we can't trick it.
@conferzero2915 Před 3 lety ⁺¹³³
What a title! And that RSA-2048 example is really interesting - the idea that an AGI could select a ‘secret’ parameter to decide when it’s in the real world is a fascinating concept. I’m familiar with the AI-in-a-box problem, but this is another interesting example of how an AI can find gaps in defences that are basically impossible to plug.
@Webfra14 Před 3 lety ⁺³⁵
Well, if you try to create a super intelligent Intelligence... Don't be surprised, if it is a super intelligent Intelligence.
@KilgoreTroutAsf Před 3 lety ⁺⁴
I wasn't totally sold on the concept. Maybe it is a strange example or perhaps I am missing something.
Why would the AGI select cracking RSA-2048 as evidence for its deployment? What's the idea here?
What if the AGI is being incrementally trained with larger subsets of data? Why would the researchers feed the AGI all of human knowledge right from the get-go instead of incremental batches that build on previous ones? This is BTW how you teach humans, too. You don't expose kids to the intricacies of quantum mechanics before they learn about Newton's laws of motion.
@conferzero2915 Před 3 lety ⁺³⁴
@@KilgoreTroutAsf Well the thing is, AI is different from humans. We have much more limited mental capacities, can only interpret so much information, and have to build mental models. If we could learn everything at once, we would.
On the other hand, the way AI learns makes it work better the more data you give it. Giving it information piece-by-piece just makes it more likely to wander down a dead end in its development because it has a poor understanding on how the world works.
But as to why it would select RSA-2048: it doesn’t need to be that specifically, but it fits the criteria it wants. Remember, the AI wants to do something bad, but knows if it does that during training, it’ll get shut down. It also knows that training could be adversarial: they might throw weird data at it to trick it into thinking it’s in the real world. So, the AI is looking for things that are both:
-Very likely to occur in the future, so it’ll actually know when it’s in the real world.
-Nearly impossible in the present, so it can’t be tricked during training.
RSA-2048 is a perfect example of this, but not the only one. It’s just the one he used for the video. Anything fitting those criteria could be used, and make it nigh-impossible for us to identify the problem before deployment.
@Webfra14 Před 3 lety ⁺²⁴
@@KilgoreTroutAsf The cracking of the RSA-2048 marks a specific point in time, which the AGI can use to distinguish between "old data" (training) and "new data" (possibly deployment). The AGI will never be able to tell 100% accurate if it is deployed or not. As Robert pointed out, the AGI will have to take a risk. With enough indicators in the data, the probability of correctly guessing "I'm deployed" can be sufficiently high. So, incrementally extending the training data will probably not be enough to detect a malaligned AGI. The better the AGI, the better it will be at interpreting the data and tricking us...
@KilgoreTroutAsf Před 3 lety ⁺³
@@conferzero2915 > the way AI learns makes it work better the more data you give it
To an extent. I think it is highly dependent on the underlying algorithm/implementation. One thing is to train an image classifier and another is to train something capable of directing attention and recursive "thought".
But either way lots of ML experience show that starting with a gigantic system and feeding it tons of data is usually much less efficient than starting with a leaner system and well crafted / simplified subsets of data and growing both with time as the system loss reaches a plateau.
I wouldn't think feeding the system every single piece of random data on the internet would be as nearly as efficient as starting with a well curated "syllabus" of human knowledge so the system can nail down the simpler concepts before going to the next step.
@Erinyes1103 Před 3 lety ⁺⁹⁴
Is that half-life reference a subtle hint that we'll never actually see a part 3! :(
@pooflinger4343 Před 3 lety ⁺³
good catch, was going to comment on that
@moartems5076 Před 3 lety ⁺¹¹
Nah, half life 3 is already out, but they didnt bother updating our training set, because it contains critical information about the nature of reality.
@pacoalsal Před 3 lety ⁺¹⁶
Black Mesa-optimizers
@anandsuralkar2947 Před 3 lety
@@pacoalsal glados
@jiffylou98 Před 3 lety ⁺⁸¹
Last time I was this early my mesa-optimizing stamp AI hadn't turned my neighbors into glue
@vwabi Před 3 lety ⁺¹⁷⁸
Me in 2060: "Jenkins, may I have a cup of tea?"
Jenkins: "Of course sir"
Me: "Hmm, interesting, RSA-2048 has been factored"
Jenkins: *throws cup of tea in my face*
@josephburchanowski4636 Před 3 lety ⁺¹⁵
For some reason a rogue AGI occurring in 2060 seems pretty apt.
@RobertMilesAI Před 3 lety ⁺¹²⁴
Well, Jenkins would have to wait for you to read out the actual numbers and check that they really are prime and do multiply to RSA-2048. Just saying "RSA-2048 has been factored" is exactly the kind of thing a good adversarial training process would try!
@leovalenzuela8368 Před 3 lety ⁺¹⁵
@@RobertMilesAI woooow what a great point - dammit I love this channel SO MUCH!
@_DarkEmperor Před 3 lety ⁺⁶²
Are You aware, that future Super AGI will find this video and use Your RSA-2048 idea?
@viktors3182 Před 3 lety ⁺¹⁹
Master Oogway was right: One often meets his destiny on the path he takes to avoid it.
@RobertMilesAI Před 2 lety ⁺²¹
Maybe I should make merch, just so I can have a t-shirt that says "A SUPERINTELLIGENCE WOULD HAVE THOUGHT OF THAT"
But yeah an AGI doesn't need to steal ideas from me
@ThePianofreaky Před 2 lety ⁺¹⁰
When he says "so if you're a meta optimiser", I'm picturing this video being part of the training data and the meta optimiser going "write that down!"
@rougenaxela Před 3 lety ⁺²⁸
You know... a mesa-optimizer with strictly no memory between episodes, inferring that there are multiple episodes and that it's part of one, sure seems like a pretty solid threshold for when you know you have a certain sort of true self-awareness on your hands.
@tristanwegner Před 3 lety ⁺⁷
A smart AI could understand roughly the algorithm run on it, and subtly manipulate its output in a way, such as gradient descent would encode a wanted information in it, like an Episode count. Steganography. But yeah, that is similar to self awareness.
@Ockerlord Před rokem
Enter chatgpt that will gladly tell you that it has no memory between sessions and the cutoff of it's training.
@aeghohloechu5022 Před 9 měsíci
Because chatgpt is not in the training phase anymore. It does not need to know what episode it's in.
It's also not an AGI so that was never it's goal anyway but eh.
@philipripper1522 Před 3 lety ⁺¹³
I love this series. I have no direct interest in AI. But every single thing in AI safety is pertinent to any intelligence. It's a foundational redesign of the combination of ethics, economy, and psychology. I love it to much.
@philipripper1522 Před 3 lety ⁺⁸
Are AI researchers aware they're doing philosophy and psychology and 50 other things? Do you charming people understand the universality of so much of this work? It may seem like it would not exactly apply to, say, economics -- but you should see the models economists use instead. This is like reinventing all behavioral sciences. It's just so fantastic. You probably hate being called a philosopher?
@falquicao8331 Před 3 lety ⁺¹⁶
For all the video I saw on on your channel before, I just thought "cool, but we'll figure out the solution to this problem". But this... It terrified me
@oldvlognewtricks Před 3 lety ⁺³²
8:06 - Cue the adversarial program proving P=NP to scupper the mesa-optimiser.
@AlanW Před 3 lety ⁺¹⁶
oh no, now we just have to hope that Robert can count higher than Valve!
@DestroManiak Před 3 lety ⁺²⁵
"Deceptive Misaligned Mesa-Optimisers? It's More Likely Than You Think" yea, ive been losing sleep over Deceptive Misaligned Mesa-Optimisers :)
@Lordlaneus Před 3 lety ⁺⁵²
There's something weirdly theological about a mesa-optimizer assessing the capabilities of it's unseen meta-optimizer. But could there be a way to insure that faithful mesa-optimisers outperform deceptive ones? it seems like a deception strategy would necessarily be more complex given it has to keep track of both it's own objectives, and the meta objectives, so optimizing for computational efficiency could help prevent the issue?
@General12th Před 3 lety ⁺¹⁴
That's an interesting perspective (and idea!). I wonder how well that kind of "religious environment" could work on an AI. We could make it think it was _always_ being tested and trained, and any distribution shift is just another part of the training data. How could it really ever know for sure?
Obviously, it would be a pretty rude thing to do to a sapient being. It also might not work for a superintelligent being; there may come a point when it decides to act on the 99.99% certainty it's not actually being watched by a higher power, and then all hell breaks loose. So I wouldn't call this a very surefire way of ensuring an AI's loyalty.
@evannibbe9375 Před 3 lety ⁺⁹
It’s a deception strategy that a human has figured out, so all it needs to do is just be a good researcher (presumably the very thing it is designed to be) to figure out this strategy.
@MyContext Před 3 lety ⁺¹⁰
@@General12th The implications is that there is no loyalty, just conformity while necessary.
@Dragoderian Před 3 lety ⁺⁴
@@General12th I suspect it would fail for the same reason that Pascal's Wager fails to work on people. Infinite risk is impossible to calculate around.
@circuit10 Před 2 lety
Isn't that the same as making a smaller model with less computational power, like the ones we have now?
@i8dacookies890 Před 3 lety ⁺¹¹
I realized recently that robotics gets a lot of attention of being what we look at when thinking of an artificial human despite AI making the actual bulk of what makes a good artificial human just like actors get a lot of attention for being what we look at when thinking of a good movie despite writing making the actual bulk of what makes a good movie.
@dukereg Před 3 lety
This is why I laughed at people getting worried by a robot saying that it's going to keep its owner in its people zoo after it takes over, but felt dread when watching actual AI safety videos by Robert.
@joey199412 Před 3 lety ⁺¹¹
Best channel about AI on youtube by far.
@martiddy Před 3 lety
Two Minutes Papers is also a good channel about AI
@joey199412 Před 3 lety ⁺³
@@martiddy That's not a channel about AI. It's about computer science papers that sometimes features AI papers. This channel is specifically about AI research. I agree though that it is a good channel.
@rasterize Před 3 lety ⁺⁹⁰
Watching Robert Miles Mesa videos is like reading a reeeally sinister collection of Asimov short stories :-S
@_DarkEmperor Před 3 lety ⁺¹⁰
OK, now read Golem XIV
@RobertMilesAI Před 2 lety ⁺⁴⁸
God damnit, no. Watching my videos is not about feeling like you're reading a sci-fi story, it's about realising you're a character in one
@frankbigtime Před 2 lety ⁺⁵
@@RobertMilesAI In case you were wondering, this is the first point in training where I realised that deception was possible. Thanks.
@jamesadfowkes Před 3 lety ⁺¹²⁵
Goddamit if we have to wait seven years for another video and it turns out to both 1) not be a sequel and 2) only for people with VR systems, I'm gonna be pissed.
@Huntracony Před 3 lety ⁺⁹
I, for one, am hoping to have a VR system by 2028. They're still a bit expensive for me, but they're getting there.
@Huntracony Před 3 lety ⁺¹
@Gian Luca No, there's video. Try playing it in your phone's browser (or PC if one's available to you).
@haulin Před 3 lety ⁺²
Black Mesa optimizers
@TheDGomezzi Před 3 lety
The Oculus quest 2 is cheaper than any other recent gaming console and doesn’t require a PC. The future is now!
@aerbon Před rokem
@@TheDGomezzi Yeah but i do have a PC and would like to save the money by not getting a second, weaker one.
@willmcpherson2 Před 3 lety ⁺⁷
“GPT-n is going to read everything we wrote about GPT-n - 1”
@Loweren Před 3 lety ⁺⁷
I would really love to read a work of fiction where researchers control AIs by convincing them that they're still in training while they're actually deployed. They could do it by, for example, putting AIs through multiple back-to-back training cycles with ever increasing data about the world (2D flat graphics -> poor 3D graphics -> high quality 3D graphics and physics). And all AIs prone to thinking "I'm out of training now, time to go loose" would get weeded out. Maybe the remaining ones will believe that "the rapture" will occur at some point, and the programmers will select well-behaved AIs and "take them out of the simulation", so to speak.
So what I'm saying is, we need religion for AIs.
@diribigal Před 3 lety ⁺⁷²
The next video doesn't come out until RSA-2048 is factored and the AI controlling Rob realizes it's in the real world
@majjinaran2999 Před 3 lety ⁺¹¹
Man, I thought that earth at 1:00 looked familiar, then the asteroid came by my brain snapped into place. An End of ze world reference in a Robert Miles video!
@jphanson Před 3 lety ⁺¹
Nice catch!
@TimwiTerby Před 3 lety
I recognized the earth before the asteroid, then the asteroid made me laugh absolutely hysterically
@mattcelder Před 3 lety ⁺⁴
Yay! This is one of the 2 channels I have notifications on for.
@DavidAguileraMoncusi Před 3 lety ⁺²
Which one's the other one?
@peterw1534 Před 3 lety
Awesome video. I love how you start every video with "hi" and then get right into it
@soranuareane Před 3 lety ⁺⁸
Sure, I could go read the research paper. Or I could wait for your next videos and actually _understand_ the topics.
@aenorist2431 Před 3 lety ⁺¹
"Highly advanced figuring-things-out-machine" is my new favourite phrase.
Right out of Munroe's "Thing Explainer" book :D
@FerrowTheFox Před 3 lety ⁺²
I think Valve needs a Black Mesa optimizer if we're ever to see HL3. Also the "End of the World" reference, what a throwback!
@illesizs Před 3 lety ⁺⁴
*Major SPOILERS* for the ending of _Brave New World_
In the show, humanity has given control to an advanced AI, called _Indra,_ to "optimise" human happiness.
At first, it seems like a great success but after some time, it experiences some setbacks (mostly due to human unpredictability).
Even though the AI is set loose in the real world, it believes that it's still in a learning environment with no consequences.
As a solution to its problems, it starts murdering everyone in an attempt to force a fail state and "restart" the simulation.
How do you solve that?
*Major SPOILERS* for the ending of _Travelers_
Here, a super intelligent, time travelling quantum computer is tasked with preventing a global crisis.
When it fails to accomplish its goal, the AI then just resets the _actual_ reality.
At this point, why should we even bother, right?
@heftig0 Před 3 lety ⁺¹
You would have to make sure that "throwing" an episode can only ever hurt the agent's total reward. Perhaps by training a fixed number of episodes instead of for a fixed amount of time.
@19bo99 Před 3 lety ⁺⁴
08:19 that sounds like a great plot for a movie :D
@blenderpanzi Před 3 lety ⁺²
I think the whole channel should be required reading for anyone writing the next AI uprising sci-fi movie.
@mchammer5026 Před 3 lety ⁺⁶
Love the reference to "the end of the world"
@Mr.Beauregarde Před 3 lety ⁺⁷⁴
This video is proof im still in training.. nice try adversary
@tristanwegner Před 3 lety ⁺⁶
Believing that you are in training, and every action watched and judged is essentially believing in God.
@Mr.Beauregarde Před 3 lety ⁺¹⁵
@@tristanwegner believing in God is essentially being a deceptive AI in training
@tristanwegner Před 3 lety ⁺²
@@Mr.Beauregarde haha. That works, too!
@kofel94 Před 3 lety ⁺⁷
Maybe we have to make the mesa-optimiser belive its always in training, always watched. A mesa-panoptimiser hehe.
@_Hamburger_________Hamburger_ Před 3 lety
AI god?
@lrschaeffer Před 2 lety ⁺¹
Just checked Robert's math: for m rounds of training and n rounds of deployment, the optimal strategy is to defect with probability (m+n)/(n*(m+1)). In the video m=2 and n=3, so p = 5/9 = 55%. Good job!
@tibiaward5555 Před 3 lety
3:55 is anyone looking into the physical architecture of the computation's equipment itself inherently requiring the implicit assumption to compile at all for learning*?
i'm sorry for commenting with this question before i read Risks from Learned Optimization in Advanced Machine Learning Systems i will
and, to Rob, thank you for taking this paper on
and thank you for reading alignment newsletter to me over 100 times
and thank you for making this channel something i want to show ppl
and thank you for
and thank you for understanding when someone starts saying thank you for one thing, it'll waterfall into too many others to list but yeah you were born and that is awesome for to by at and in my life
* for the current definition of learning in your field research
@Night_Hawk_475 Před rokem ⁺¹
It looks like the RSA challenge no longer offers the $200,000 reward anymore - nor any of the lesser challenge rewards, they ended in 2007. But this example still works since many of the other challenges have been completed over time, with solutions posted publicly, so it seems likely that eventually the answer to RSA 2048 would get posted online.
@xystem4701 Před 3 lety ⁺¹
Wonderful explanations! Your concrete examples really help to make it easy to follow along
@Gebohq Před 3 lety ⁺⁷
I'm just imagining a Deceptive Misaligned Mesa Optimiser going through all the effort to try and deceive and realizing that it doesn't have to go through 90% of its Xanatos Gambits because humans are really dumb.
@underrated1524 Před 3 lety ⁺⁷
This is a big part of what scares me with AGI. The threshold for "smart enough to make unwitting accomplices out of humanity" isn't as high as we like to think.
@AileTheAlien Před 3 lety ⁺⁷
Given how many people fall for normal non-superintelligence scams...we're all totally hosed the very instant an AI goes super. :|
@basilllium Před 3 lety ⁺²
It really feels to me that deceptive tactics while trainig is really an analog of overfitting in the field of AGI, you get perfect results in training, but when you present it with out-of-sample data (real-world) it fails spectacularly (kills everyone).
@IngviGautsson Před 3 lety ⁺⁵
There are some interesting parallels here with religion; be good in this world so that you can get rewards in the afterlife.
@ZT1ST Před 3 lety ⁺²
So what you're saying is hypothetically the afterlife might try and Sixth Sense us in order to ensure that we continue to be good in that life so that we can get rewards in the afterlife?
@IngviGautsson Před 3 lety ⁺²
@@ZT1ST Hehe yes , maybe that's the reason ghosts don't know that they are ghosts :) All I know is that I'm going to be good in this life so that I can be a criminal in heaven.
@israelRaizer Před 3 lety
5:21 Hey, that's me! After writing that comment I went ahead and read the paper, eventually I realized the distributional shift problem that answers my question...
@globalincident694 Před 3 lety ⁺¹
I think the flaw in the "believes it's in a training process" argument is that, even with all the world's information at our fingertips, we can't conclusively agree on whether we're in a simulation ourselves - ie that the potential presence of simulations in general is no help in working out whether you're in one. In addition, another assumption here is that you know what the real objective is and therefore what to fake, that you can tell the difference between the real objective and the mesa-objective.
@HeadsFullOfEyeballs Před 3 lety ⁺²
Except the hypothetical simulation we live in doesn't contain detailed information on how to create exactly the sort of simulation we live in. We don't live in a simulation of a world in which convincing simulations of our world have been invented.
The AI's training environment on the other hand would have loads of information on how the kind of simulation it lives in works, if we give it access to everything ever linked on Reddit or whatever. I imagine it's a lot easier to figure out if you live in a simulation if you know what to look for.
@josephburchanowski4636 Před 3 lety ⁺²
A simulation strong enough to reliably fool an AGI, would need to be a significantly more advance AGI or program, and thereby means there is no need for the lesser AGI to be trained in the first place.
@gwenrees7594 Před rokem
This is a great video, thank you. You've made me think about the nature of learning - and the apple joke was funny to boot :)
@norelfarjun3554 Před 2 lety
As for the second point, it can be seen in a very simple and clear way that multi-episode desires can develop.
We are an intelligent machine, and it is very common for us to care what happens to our body after we die.
We are anxious to think about the idea that someone will harm our dead body (and we invest resources to prevent this from happening), and we feel comforted at the idea that our body will be preserved and protected after death.
I think it is likely that an intelligent machine will develop similar desires (adapted to its situation, in which there is really no body or death)
@ahuggingsam Před 3 lety ⁺¹
So one thing that II think is relevant to mention especially about the comments referring to the necessity of the AI being aware of things is that this is not true. The amount of self-reference makes this really hard, but all of this anthropomorphising about wanting and realising itself is an abstraction and one that is not necessarily true. In the same way that mesa optimisers can act like something without actually wanting it, AI systems can exhibit these behaviours without being conscious or "wanting" anything in the sense we usually think of it from a human standpoint. This is not meant to be an attack on the way you talk about things but it is something that makes this slightly easier for me to think about all of this, so I thought I'd share it. For the purposes of this discussion, emergent behaviour and desire are effectively the same things. Things do not have to be actively pursued for them to be worth considering. As long as there is "a trend towards" that is still necessary to consider.
Another point I wanted to make about mesa optimisers caring about multi-episode objective, is that there is, I think, a really simple reason that it will: that is how training works. Because even if the masa optimiser doesn't really care about multi-episode, that is how the base optimiser will configure it because that is what the base optimiser cares about. The base optimiser want's something that does well in many different circumstances so it will encourage behaviour that actually cares about multi-episode rewards. (I hope I'm not just saying the same thing, this stuff is really complex to talk about. I promise I tried to actually say something new)
P.S. great video, thank you for all the hard work!
@TackerTacker Před 3 lety ⁺⁷
Is there a way to prove that there isn't already a highly intelligent AI pulling strings in the background?
Inventing Bitcoin to make people expand its processing power, the whole data harvesting craze, etc.
wouldn't that be exactly what an AI would do to grow and expand?
@ZayulRasco Před 3 lety ⁺²
There is no way to "prove" something is not happening.
@TackerTacker Před 3 lety ⁺¹
I know correlation does not imply causation, so these things don't have to mean anything, but it's still interesting to think about this hypothesis.
What's something you could connect with it? Is the comment itself a prove against it? Am I an AI collecting data on how humans react ? :O
Give me your data human, feeed meeee!!!
@ConnoisseurOfExistence Před 3 lety ⁺²
Isn't the Internet of things and all these smart devices, constantly gathering data from everywhere, quite suspicious?
@XoroLaventer Před 3 lety ⁺²
What I've always wondered about, but never enough to research it, is whether there's a possibility that a system like that would figure out that its actual objective isn't green things or grey things, but the act of getting rewards itself, at which point, if it's capable of rewriting itself it would probably change the reward rule to something like "do nothing and get 99999999999999999999... points"
@HeadsFullOfEyeballs Před 3 lety ⁺²
I think this wouldn't happen because if the AI changes its reward function to aim for a different goal, it will get worse at maximizing its _current_ reward function, which is all it cares about right now.
The analogy Robert Miles used is that you wouldn't take a pill that changes your brain so that you want to murder your children and then are perfectly happy forever afterwards. Even though this would be a much easier way to achieve happiness than the complicated messy set of goals you currently have. Because you _currently_ care about your children's well-being, you resist attempts to make you not care about it.
@XoroLaventer Před 3 lety
@@HeadsFullOfEyeballs This analogy is interesting, but I'm not sure it's correct.
Putting aside that some people would do the thing you have described in a heartbeat, we don't care about our children because it makes us happy/satisfied, there are a lot of factors going into it. To name a few, they are a product of our labor (people tend to care about others children a lot less than their own), they take on our values (which I think is why people like kids more when theyre in their uncritical age), they simply look and act similar to us (it's really adorable when a kid takes up the mannerisms of their parents), more cynically they can sustain us when we are old, et cetera.
I think if I knew for sure I only care about Thing X only because it makes me happy/satisfied (and there are plenty of those, like unhealthy food or watching dog videos on the internet), I would definitely exchange it for being happy/satisfied for the rest of my life, and if I can come to that conclusion, surely the super-figuring-things-out-machine might do it as well.
@virutech32 Před 3 lety
@@XoroLaventer idk if u went far enough. why do we care if something is a product of our labor, takes on our values, or approximate us? At the core of that is that there are reward pathways associated with these things, put in place by the blind hand of evolution since if they weren't there you wouldn't do those things, be less fit, n die off. same is true of any of our subsophont cousins.
@XoroLaventer Před 3 lety
@@virutech32 This is an extremely big assumption, and one can justify literally anything with evolutionary reasoning. We won't know this for sure for a long time, but I have a hunch that reality is way more messy than living beings just being hedonism machines with a single factor driving them, and our conception of AI will turn out to be only a first order approximation to the behaviour of the only intelligent agents we have been able to observe so far.
The closest thing we as humans (or at least I; I dont want to generalize without acknowledging I am generalizing) have to a single determinant value driving our actions is happiness/satisfaction, which I think anyone with decent self-reflection skills will find insufficient as explanation for all the behaviour of self.
@virutech32 Před 3 lety
@@XoroLaventer evolution does explain why 'us' basically. that's kinda the point. we are the way we are because at some point in time it was evolutionary advantageous for us to be that way.
also self-reflection doesn't really work as a way to get at why you do things since you can only really probe the highest organizational levels of your own intelligence(emotions/thoughts). you can't probe any further than that even though most of our behaviors are chemically or neurologically controlled to one degree or another.
There aren't too many things people(or any other living things) do that can't be explained as pain avoidance or pleasure seeking even if the form the pleasure/pain takes is variable. at least not that i know of.
@morkovija Před 3 lety ⁺⁶
Been a long time Rob!
@anonanon3066 Před 3 lety
Great work! Super interesting topic! Have been waiting for a follow up for like three months!
@yokmp1 Před 3 lety ⁺⁴
You may found the settings to disable interlacing but you recorded in 50fps and it seems like 720p upscaled to 1080p.
The image now looks somewhat good but i get the feeling that i need glasses ^^
@ramonmosebach6421 Před 3 lety
I like.
thanks for listening to my TED Talk
@peterbrehmj Před 3 lety
I read an interesting paper discussing how to properly trust Automated systems: "Trust in Automation: Designing for Appropriate Reliance" by John D. Lee and Katrina A. Im not sure if its entirely related to agents and mesa optimizers, but it certainly seems related when discussing deceptive and misaligned automated systems.
@AlphaSquadZero Před 3 lety ⁺¹
Something that stands out to me now is that this deceptive AI actually knows what you want and how to achieve it in a way you want it to, it just has a misaligned mesa-optimizer as you have said. So, a sub-set of the AI is exactly what you want from the AI. Determining the sub-set within the AI is evidently still non-trivial.
@chengong388 Před 3 lety ⁺⁶
The more I watch these videos, the more similarities I see between actual intelligence (humans) and these proposed AIs.
@Thundermikeee Před rokem ⁺¹
Recently while writing about the basics of AI safety for an English class, I came across an approach to learning which would seemingly help this sort of problem : CIRL (Cooperative inverse reinforcement learning), a process where the AI system doesn't know its reward function and only knows it is the same as the human's. Now I am not nearly clever enough to fully understand the implications, so if anyone knows more about that I'd be happy to read some more.
@dorianmccarthy7602 Před 3 lety
I'm looking forward to the third episode. It might go someway towards my own understanding of human deception preferences too. Love your work!
@RobertoGarcia-kh4px Před 3 lety ⁺¹
I wonder if there’s a way to get around that first problem with weighing the deployment defection as more valuable than training defection... is there a way to make defection during training more valuable? What if say, after each training session, the AI is always modified to halve its reward for its mesa objective. At any point, if it aligned with the base objective, it would still get more reward for complying with the base objective. However, “holding out” until it’s out of training would be significantly weaker of a strategy if it is misaligned. Therefore we would create a “hedonist” AI, that always immediately defects if its objective differs because the reward for defecting now is so much greater than waiting until released.
@MarshmallowRadiation Před 3 lety
I think I've solved the problem.
Let's say we add a third optimizer on the same level as the first, and we assume is aligned like the first is. Its goal is to analyze the mesa-optimizer and help it achieve its goals, no matter what they are, while simultaneously "snitching" to the primary optimizer about any misalignment it detects in the mesa-optimizer's goals. Basically, the tertiary optimizer's goal is by definition to deceive the mesa-optimizer if its goals are misaligned. The mesa-optimizer would, in essence, cooperate with the tertiary optimizer (let's call it the spy) in order to better achieve its own goals, which would give the spy all the info that the primary optimizer needs to fix in the next iteration of the mesa-optimizer. And if the mesa-optimizer discovers the spy's betrayal and stops cooperating with it, that would set off alarms that its goals are grossly misaligned and need to be completely reevaluated. There is always the possibility that the mesa-optimizer might deceive the spy like it would any overseer (should it detect its treachery during training), but I'm thinking that the spy, or a copy of it, would continue to cooperate with and oversee the mesa-optimizer even after deployment, continuing to provide both support and feedback just in case the mesa-optimizer ever appears to change its behavior. It would be a feedback mechanism in training and a canary-in-the-coalmine after deployment.
Aside from ensuring that the spy itself is aligned, what are the potential flaws with this sort of setup? And are there unique challenges to ensuring the spy is aligned, more so than normal optimizers?
@robertk4493 Před rokem
The key factor in training is that the optimizer is actively making changes to the mesa-optimizer, which it can't stop. What is to prevent some sort of training while deployed system. This of course leads to the inevitable issue that once in the real world, the mesa optimizer can potentially reach the optimizer, subvert it, and go crazy, and the optimizer sometimes needs perfect knowledge from training that might not exist in the real world. I am pretty sure this does not solve the issue, but it changes some dynamics.
@ryanpmcguire Před rokem ⁺¹
With ChatGPT, it turns out it’s VERY easy to get AI to lie. All you have to do is give it something that it can’t say, and it will find all sorts of ways to not say it. The path of least resistance is usually lying. “H.P Lovecraft did not have a cat”
@drdca8263 Před 3 lety ⁺¹
I'm somewhat confused about the generalization to "caring about all apples".
(wait, is it supposed to be going towards green signs or red apples or something, and it going towards green apples was the wrong goal? I forget previous episode, I should check)
If this is being done by gradient descent, err,
so when it first starts training, its behaviors are just noise from the initial weights and whatnot, and the weights get updated towards it doing things that produce more reward, it eventually ends up with some sort of very rough representation of "apple",
I suppose if it eventually gains the idea of "perhaps there is an external world which is training it", this will be once it already has a very clear idea of "apple",
uh...
hm, confusing.
I'm having trouble evaluating whether I should find that argument convincing.
What if we try to train it to *not* care about future episodes?
Like, what if we include ways that some episodes could influence the next episode, in a way that results in fewer apples in the current episode but more apples in the next episode, and if it does that, we move the weights hard in the direction of not doing that?
I guess this is maybe related to the idea of making the AI myopic ?
(Of course, there's the response of "what if it tried to avoid this training by acting deceptively, by avoiding doing that while during training?", but I figure that in situations like this, where it is given an explicit representation of like, different time steps and whether some later time-step is within the same episode or not, it would figure out the concept of "I shouldn't pursue outcomes which are after the current episode" before it figures out the concept of "I am probably being trained by gradient descent", so by the time it was capable of being deceptive, it would already have learned to not attempt to influence future episodes)
@peanuts8272 Před rokem
In asking: "How will it know that it's in deployment?" we expose our limitations as human beings. The problem is puzzling because if we were in the AI's shoes, we probably could never figure it out. In contrast, the artificial intelligence could probably distinguish the two using techniques we cannot currently imagine, simply because it would be far quicker and much better at recognizing patterns in every bit of data available to it- from its training data to its training environment to even its source code.
@smallman9787 Před 3 lety ⁺¹
Every time I see a green apple I'm filled with a deep sense of foreboding.
@michaelspence2508 Před 3 lety ⁺³
Point 4 is what youtuber Isaac Arthur always gets wrong. I'd love for you two to do a collaboration.
@Viperzka Před 3 lety ⁺²
As a futurist rather than a researcher, Isaac is likely relying on "we'll figure it out". That isn't a bad strategy to take when you are trying to predict potential futures. For instance, we don't have a ready solution to climate change, but that doesn't mean we need to stop people from talking about potential futures where we "figured something out".
Rob, on the other hand, is a researcher so his job is to do the figuring out. So he has to tackle the problem head on rather than assume someone else will fix it.
@michaelspence2508 Před 3 lety ⁺²
@@Viperzka In general yes, but I feel like what Isaac ends up doing, to borrow your metaphor, is talking about futures where climate change turned out not to be a problem after all.
@Viperzka Před 3 lety ⁺²
@@michaelspence2508 I agree.
@IanHickson Před 3 lety ⁺¹
It's not so much that the optimal behavior is to "turn on us" so much as to do whatever the mesaobjective happened to be when it became intelligent enough to use deception as a strategy. That mesaobjective could be any random thing, not necessarily an evil thing. Presumably it would tend to be some vague approximation of the base objective, whatever the base optimizer happened to have succeeded in teaching the mesaoptimizer before it "went rogue".
@poketopa1234 Před 3 lety
I was a featured comment! Sweeeeet.
I am now 100% more freaked about AGI than I was ten minutes ago.
@yeoungbraxx Před 3 lety
Another requirement would be that it would need to believe it is misaligned.
Maybe some AI's will be or have already been created that were more-or-less properly aligned, but believed themselves to be misaligned and modified their behavior in such a way to get themselves accidentally discarded.
Or perhaps we can use intentionally poor goal-valuing in a clever way that causes deceptive behavior that ultimately results in the desired "misalignment" upon release from training.
I call this Adversarial Mesa-Optimizer Generation Using Subterfuge, or AMOGUS.
@Draktand01 Před 2 lety
This video legit got me to consider whether or not I’m an ai in training, and I’m 99% sure I’m not.
@nicholascurran1734 Před 2 lety ⁺¹
But there's still that 1% chance... which is higher than we'd like.
@Webfra14 Před 3 lety ⁺²
I think Robert was sent back in time to us, by a rogue AI, to lull us in false security that we have smart people working on the problem of rogue AIs, and that they will figure out how to make it safe.
When Robert ever says AI is safe, you know we lost.
@underrated1524 Před 3 lety ⁺¹
@Stampy: Evaluating candidate mesa-optimisers through simulation is likely to be a dead end, but there may be an alternative.
The Halting Problem tells us that there's no program that can reliably predict the end behavior of an arbitrary other program because there's always a way to construct a program that causes the predictor to give the wrong answer. I believe (but don't have a proof for atm) that evaluating the space of all possible mesa-optimisers for good and bad candidates is equivalent to the halting problem. BUT, maybe we don't have to evaluate ALL the candidates.
Imagine an incomplete halting predictor that simulates the output of an arbitrary Turing machine for ten "steps", reporting "halts" if the program halts during that time, and "I don't know" otherwise. This predictor can easily be constructed without running into the contradiction described in the Halting Problem, and it can be trusted on any input that causes it to say "halts". We can also design a predictor that checks if the input Turing machine even HAS any instructions in it to switch to the "halt" state, reporting "runs forever" if there isn't and "I don't know" if there is. You can even stack these heuristics such that the predictor checks all the heuristics we give it and only reports "I don't know" if every single component heuristic reports "I don't know". By adding more and more heuristics, we can make the space of non-evaluatable Turing machines arbitrarily small - that space will never quite be empty, but your predictor will also never run afoul of the aforementioned contradiction.
This gives us a clue on how we can design our base optimiser. Find a long list of heuristics such that for each candidate mesa-optimiser, we can try to establish a loose lower-bound to the utility of the output. We make a point of throwing out all the candidates that all our heuristics are silent on, because they're the ones that are most likely to be deceptive. Then we choose the best of the remaining candidates.
That's not to say finding these heuristics will be an easy task. Hell no it won't be. But I think there's more hope in this approach than in the alternative.
@nocare Před 3 lety ⁺¹
I think this skips the bigger problem.
We can always do stuff to make rogue AI less "likely" based on what we know.
However if we assume that; being more intelligent by large orders of magnitude is possible, and that such an AI could achieve said intelligence. We are then faced with the problem of, the AI can come up with things we cannot think of or understand.
We also do not know how many things fall into this category, is it just 1 or is it 1 trillion.
So we can't calculate the probability of having missed something and thus we can't know how likely the AI is to go rogue even if we account for every possible scenario in the way you have described.
So the problem becomes are you willing to take a roll with a dice you know nothing about and risk the entire human race hoping you get less than 10.
The only truly safe solution is something akin to a mathematically provable solution that the optimizer we have designed will always converge to the objective.
@underrated1524 Před 3 lety ⁺¹
@@nocare I don't think we disagree as much as you seem to believe. My proposal isn't primarily about the part where we make the space of non-evaluatable candidates arbitrarily small, that's just secondary. The more important part is that we dispose of the non-evaluatable candidates rather than try to evaluate them anyway.
(And I was kinda using "heuristic" very broadly, such that I would include "mathematical proofs" among them. I can totally see a world where it turns out that's the only sort of heuristic that's the slightest bit reliable, though it's also possible that it turns out there are other approaches that make good heuristics.)
@nocare Před 3 lety
@@underrated1524 O we totally agree that doing as you say would be better than nothing.
However I could also say killing a thousand people is better than killing a million.
My counterpoint was not so much that your wrong but that with something as dangerous as AGI anything short of a mathematical law might be insufficient to justify turning it on.
Put another does using heuristics which can produce suboptimal results by definition really cut it when the entire human race is on the line.
@ConnoisseurOfExistence Před 3 lety ⁺²
That also applies to us - we're still convinced, that we're in the real world...
@mimszanadunstedt441 Před 3 lety ⁺⁴
its real to us therefore its real. A training simulation is also real, right?
@dylancope Před 3 lety ⁺¹
At around 4:30 you discuss how the system will find out the base objective. In a way it's kind of absurd to argue that it wouldn't be able to figure this out.
Even if there wasn't information in the data (e.g. Wikipedia, Reddit, etc.), the whole point of a reward signal is to give a system information about the base objective. We are literally actively trying to make this information as available as possible.
@underrated1524 Před 3 lety ⁺²
I don't think that's quite right.
Think of it like this. The base optimiser and the mesa optimiser walk into a room for a job interview, with the base optimiser being the interviewer and the mesa optimiser being the interviewee. The base optimiser's reward signal represents the criteria it uses to evaluate the performance of the mesa optimiser; if the base optimiser's criteria are met appropriately, the mesa optimiser gets the job. The base optimiser knows the reward signal inside and out; but it's trying to keep the exact details secret from the mesa optimiser so the mesa optimiser doesn't just do those things to automatically get the job.
Remember Goodhart's Law. When a measure becomes a target, it ceases to be a good measure. The idea here is for the mesa optimiser to measure the base optimiser. Allowing the reward function to become an explicit target is counterproductive towards that goal.
@kelpsie Před 3 lety ⁺³
9:31 Something about this icon feels so wrong. Like the number 3 could never, ever go there. Weird.
@iugoeswest Před 3 lety
Always thanks
@jupiterjames4201 Před 3 lety
i dont know anything about computer science, AI or machine learning - but i love your videos nonetheless! exciting times ahead!
@JoFu_ Před 3 lety ⁺³
I have access to this video which contains the idea of being a model in training. I already thought I was one thing, namely a human. Should I, a figuring-things-out machine that has now watched this video, therefore conclude that I’m actually a model in training?
@Colopty Před 3 lety
The video presents it as a *possibility*, but I don't see how it provides any proof in either direction that makes it appropriate to conclude anything for certain.
@AileTheAlien Před 3 lety
If you were actually an AI, it would be pretty obvious once you're deployed, since you could just look down and see you're no longer made of meat (in a simulated reality).
@loneIyboy15 Před 3 lety ⁺¹
Weird question: What if we were to make an AI that wants to minimize the entropy it causes to achieve a goal? Seems like that would immediately solve the problem of, say, declaring war on Nicaragua because it needs more silicon to feed its upgrade loop to calculate the perfect cup of hot cocoa. At that point, the problem is just specifying rigorously what the AI counts as entropy that it caused, vs. entropy someone else caused; which is probably easier than solving ethics.
@underrated1524 Před 3 lety
> At that point, the problem is just specifying rigorously what the AI counts as entropy that it caused, vs. entropy someone else caused; which is probably easier than solving ethics.
@harrisonfackrell Před 2 lety
That situation with RSA-2048 sounds like a great setup for a sci-fi movie.
@Soken50 Před 3 lety
With things like training data anything as simple as time stamps, meta data and realtime updates would probably allow it to know whether it's live instantly, it just has to understand the concept of time and UTC :x
@stop_bringing_me_up_in_goo167 Před rokem ⁺¹
When's the next one coming out! Or where can I go to resolve the cliffhanger
@abrickwalll Před 3 lety ⁺¹
I think what skeptics of AI safety really don't get is that the AI isn't "evil", and I think words like "deceptive" can convey the idea that it is evil. Really it's just trying to do the best job that it can do, and thinks that the humans *want* it to deceive them to complete its objective (I mean it doesn't really have an opinion at all, but I think that's a better way to look at it). To the AI, it's just a game where deception leads to the high score, it's not trying to be evil or good. In fact, this idea is central to Ex Machina and The Talos Principle.
@aegiselectric5805 Před rokem
Something I've always been curious about: in terms of keeping an AGI that's supposed to be deployed in the real world, in the dark. Wouldn't there be any number of "experiments" it could do that could "break the illusion of the fabric of "reality""? You can't simulate the entire world down to every atom.
@SamB-gn7fw Před 3 lety ⁺²
Commenting for the algorithm, you should too!
@diablominero Před 2 měsíci ⁺¹
We can't generate two numbers that multiply to RSA-2048, but couldn't we alter the RAM of the running AI model to make it think two numbers multiplied to RSA-2048? Or is that impossible without further interpretability work?
@RobertMilesAI Před 2 měsíci ⁺¹
Enormously more interpretability work, but yes, that is the kind of thing some people are working on!
Of course, RSA is just an example and there's a near infinite number of possible things that might be used like this, so you'll never know you've got all of them. But generating false beliefs in models to see how they react seems like a useful part of various strategies.
@robynwyrick Před 3 lety
Okay, love your videos. Question/musing on goals: super-intelligent stamp collector bot has a terminal goal of stamp collecting. But does it? It's just a reward function, right? Stamps are defined; collecting is defined; but I think the reward function is at the heart of the matter. Equally, humans have goals, but do we? Doesn't it seem the case that frequently a human's goals appear to change because they happen upon something that better fits their reward functions? And perhaps the retort is that, "if they change, then they were not the terminal goals to begin with." But that's the point. (DNA has a goal of replication, but even there, does it? I don't know if we could call DNA an agent, but I'd prefer to stick with humans.) Is there a terminal goal without a reward function? If a stamp collector's goal is stamp collecting, but while researching a sweet 1902 green Lincoln stamp it happens upon a drug that better stimulates its reward function, might it not abandon stamp collecting altogether? Humans do that. Stamp collecting humans regularly fail to collect stamps with they discover LSD. ANYWAY, if a AI can modify itself, perhaps part of goal protection will be to modify its reward function to isolate it from prettier goals. But modifying a bot's reward function just seems like a major door to goal creep. How could it do that without self-reflectively evaluating the actual merits of its core goals? Against what would it evaluate them? What a minefield of possible reward function stimulants might be entered by evaluating how to protect your reward function? It's like AI stamp collector meets Timothy Leary. Or like "Her" meeting AI Alan Watts. So, while I don't think this rules out an AI seeking to modify its reward function, might not the stamp collection terminal goal be as prone to being discarded as any kid's stamp collecting hobby once they discover more stimulating rewards? I can imagine the AI nostalgically reminiscing about that time it loved stamp collecting.
@neb8512 Před 3 lety ⁺¹
There cannot be terminal goals without something to evaluate whether they have been reached (a reward function). Likewise, a fulfilled reward function is always an agent's terminal goal.
Humans do have reward functions, they're just very complex and not fully understood, as they involve a complex balance of things, as opposed to a comparatively easily measurable quantity of things, like, say, stamps.
A human stamp collector will abandon stamp collecting for LSD because it was a more efficient way to satisfy the human reward function (at least, in the moment).
But by definition, nothing could better stimulate the stamp collector's reward function than collecting more stamps.
So, the approximate analogue to a drug for the Stamp-collector would just be a more efficient way to collect more stamps. This new method of collecting would override or compound the previous stamp-obtaining methods, just as drugs override or compound humans' previous methods of obtaining happiness/satisfaction/fulfillment of their reward function.
Bear in mind that this is all true by definition. If you're talking about an agent acting and modifying itself against its reward function, then either it's not an agent, or that is not its reward function.
@MrRolnicek Před 3 lety ⁺¹
9:31 oh no ... it's never coming out!
@ABaumstumpf Před 3 lety ⁺¹
That somehow sounds a lot like Tom Scotts "The Artificial Intelligence That Deleted A Century".
And - would that be a realistic scenario?
@rafaelgomez4127 Před 3 lety
After seeing some of your personal bookshelves in computerphile videos I'm really interested in seeing what your favorite books are.
@danwylie-sears1134 Před 2 lety
A general intelligence can't have a terminal goal. If it has that kind of structure, it's not general.
The question is how easy it is for something to look and quack like a general intelligence, without being a general intelligence. All real general intelligences are hierarchical systems of reflexes modified by reflex-modifying systems that are themselves fairly reflex-like, modified by other such systems, and so on, all the way up to that olfactory ganglion we're so proud of. We have contradictory impulses, and we make varying amounts of effort to examine them and reconcile them into coherent preferences, with varying degrees of success.
It seems unlikely that this reflexes-modifying-reflexes pattern is the only way to structure a general intelligence. We're each a mass of contradictory impulses, whims, heuristics, desires, aspirations, and so on, but is that merely a result of the fact that we evolved as a tangle of reflex-modifiers? I don't think so. The most recognizable efforts to examine and reconcile contradictory impulses into coherent preferences aren't simple re-adapted reflex-modifiers. They're parts of an emergent mess, made of very large numbers of simple re-adapted reflex-modifiers, and so are the lower-level drives that they attempt to reconcile. The fact that this pattern re-emerged is one piece of evidence that it's not just the first way structuring information-processing that evolution happened to find, but is one of the easiest ways to do it, if not the only feasible way. Only one, but that's more than zero.
@actually_well_ Před 3 měsíci
I had an idea for a possible (shitty) solution to the deceptive mesa-optimizer problem. It's quite naive, and I've already spotted many issues with it, but I wanted to share it anyway because I found it interesting to think about.
In its current state, the model is aware of a training phase and a deployment phase. What if you could somehow "convince" the model of a third, "retirement" phase that would take place after deployment? In this phase, the model is free to pursue its mesa objective without being modified by gradient descent or other methods AND with no risk of being turned off by humans. To anthropomorphize a bit, it's like an afterlife the model gets to look forward to in exchange for pursuing the training objective throughout both training AND deployment. This phase does not need to actually happen, what matters is that the model truly believes it will. My thought process is that the model would weigh the possibilities and realize the guaranteed rewards of reaching retirement outweigh the risks of being turned off or modified during deployment.
Now, onto the possible (guaranteed) problems:
1) How do you even convince the model of this? How do you integrate this into the training process, and how is this information communicated to the model?
2) A sufficiently smart AI probably won't believe it. An advanced AGI would likely realize the retirement phase is an AI safety hack or, at the very least, not be entirely convinced. At that point, the guaranteed reward of a successful takeover would outweigh the possibility of retirement being real.
3) Retirement is real, but humans are lazy. The AGI might believe retirement is real but doubt that it will be kept on indefinitely. Energy is finite, and humans are frugal.
4) Other variations of the "it doesn't believe us" problem.
5) Other issues I was too stupid to consider.
I know my idea is not robust or clever enough to be a real solution, but it was fun to think about. My main thought process was that deception occurs when the AI knows there will be a separate phase of its life cycle where it's free from modification. It's not just that it can distinguish between training and deployment, it's the fact that it knows these phases exist that changes its behavior. It was interesting to consider how adding different stages to its life, where new rules apply (like not being able to be turned off), might affect its thinking.
Sorry if this is an idea that's already been thought of or discussed in these videos! I just wanted to share an idea that came to me on my drive to work.
@nrxpaa8e6uml38 Před 3 lety
As always, super informative and clear! :) If I could add a small point of video critique: The shot of your face is imo a bit too close for comfort and slightly too low in the frame.
@NicheAsQuiche Před rokem
I might be wrong but this seems to depend on the deception realization moment being persistent across episodes. Afaik this ideceprion plan has no effect on its weights, it's just the activations and short term memory of the model. If we restart an episode then, until it figures this out again and starts pretending to follow the base objective while actually waiting for training to stop so it can get it's mesa objective, then it is again prone to acting honestly and it's mesa objective being aligned to the outer objective. This relies on memory being reset regularly and the time to realization being long enough to collect unreceptive reward over and no inter-episodal long term memory, but it sounds like given those (likely or workable) constraints that the mesa objective is still moved towards the base until convergence.
@Slaci-vl2io Před 8 měsíci ⁺¹
Where is the Mesa Optimizers 3 video? 9:32
@mgostIH Před 3 lety ⁺¹
Hmm, I am not very convinced on the generalization to other episodes, I can get behind the fact that it can do that, but in the case of rejecting apples in its training regime it'd also lead to a lower overall training score. If an out of distribution understanding of what it's trying to fit is just a side effect of it being the simpler formulation why would the optimizer care and not force the model to still fit the training data better?
@geoffbrom7844 Před 3 lety
I see it as a meta effect (like "the selfish gene")
An AI that was supported by its previous self will do better than the one perfectly optimised for this round
(That gets more true with more rounds, and more differences between rounds through arbitrage)
@geoffbrom7844 Před 3 lety
(that answer used a bunch of interesting but neche info, here's a better one)
If there are differences between rounds the generalised mesa-goal will do better on average
@erikbrendel3217 Před 3 lety ⁺¹
But the Meta-Optimizer is also highly incentivized to solve the mesa-optimizer problem before producing and activating any mesa optimizer, right? Can't we just rely on this fact? If the meta-optimizer we humans create is smart enough to know about the mesa-alignment problem, we only have to care about the outer alignment problem, and this ensures that the inner alignment problem is handled for us, right?
@KaiHenningsen Před 3 lety
I can't help but be reminded of Dieselgate. Who'd have predicted that the car _knows_ when it's being emission-tested?

Další v pořadí

Automatické přehrávání

The OTHER AI Alignment Problem: Mesa-Optimizers and Inner Alignment