We Were Right! Real Inner Misalignment

Sdílet
Vložit
  • čas přidán 9. 10. 2021
  • Researchers ran real versions of the thought experiments in the 'Mesa-Optimisers' videos!
    What they found won't shock you (if you've been paying attention)
    Previous videos on the subject:
    The OTHER AI Alignment Problem: Mesa-Optimizers and Inner Alignment: • The OTHER AI Alignment...
    Deceptive Misaligned Mesa-Optimisers? It's More Likely Than You Think...: • Deceptive Misaligned M...
    The Paper: arxiv.org/abs/2105.14111
    The Interpretability Article: distill.pub/2020/understandin...
    Jacob Hilton's thoughts about what's going on: www.alignmentforum.org/posts/...
    AI Safety Camp: aisafety.camp/
    With thanks to my wonderful Patrons at / robertskmiles :
    - Gladamas
    - Timothy Lillicrap
    - Kieryn
    - AxisAngles
    - James
    - Jake Fish
    - Scott Worley
    - James Kirkland
    - James E. Petts
    - Chad Jones
    - Shevis Johnson
    - JJ Hepboin
    - Pedro A Ortega
    - Clemens Arbesser
    - Said Polat
    - Chris Canal
    - Jake Ehrlich
    - Kellen lask
    - Francisco Tolmasky
    - Michael Andregg
    - David Reid
    - Peter Rolf
    - Teague Lasser
    - Andrew Blackledge
    - Brad Brookshire
    - Cam MacFarlane
    - Craig Mederios
    - Jon Wright
    - CaptObvious
    - Brian Lonergan
    - Girish Sastry
    - Jason Hise
    - Phil Moyer
    - Erik de Bruijn
    - Alec Johnson
    - Ludwig Schubert
    - Eric James
    - Matheson Bayley
    - Qeith Wreid
    - jugettje dutchking
    - James Hinchcliffe
    - Atzin Espino-Murnane
    - Carsten Milkau
    - Jacob Van Buren
    - Jonatan R
    - Ingvi Gautsson
    - Michael Greve
    - Tom O'Connor
    - Laura Olds
    - Jon Halliday
    - Paul Hobbs
    - Jeroen De Dauw
    - Cooper Lawton
    - Tim Neilson
    - Eric Scammell
    - Igor Keller
    - Ben Glanton
    - Tor Barstad
    - Duncan Orr
    - Will Glynn
    - Tyler Herrmann
    - Ian Munro
    - Jérôme Beaulieu
    - Nathan Fish
    - Peter Hozák
    - Taras Bobrovytsky
    - Jeremy
    - Vaskó Richárd
    - Benjamin Watkin
    - Andrew Harcourt
    - Luc Ritchie
    - Nicholas Guyett
    - 12tone
    - Oliver Habryka
    - Chris Beacham
    - Nikita Kiriy
    - Andrew Schreiber
    - Steve Trambert
    - Braden Tisdale
    - Abigail Novick
    - Serge Var
    - Mink
    - Chris Rimmer
    - Edmund Fokschaner
    - April Clark
    - J
    - Nate Gardner
    - John Aslanides
    - Mara
    - ErikBln
    - DragonSheep
    - Richard Newcombe
    - Joshua Michel
    - P
    - Alex Doroff
    - BlankProgram
    - Richard
    - David Morgan
    - Fionn
    - Dmitri Afanasjev
    - Marcel Ward
    - Andrew Weir
    - Kabs
    - Ammar Mousali
    - Miłosz Wierzbicki
    - Tendayi Mawushe
    - Wr4thon
    - Martin Ottosen
    - Andy K
    - Kees
    - Darko Sperac
    - Robert Valdimarsson
    - Marco Tiraboschi
    - Michael Kuhinica
    - Fraser Cain
    - Robin Scharf
    - Klemen Slavic
    - Patrick Henderson
    - Hendrik
    - Daniel Munter
    - Alex Knauth
    - Kasper
    - Ian Reyes
    - James Fowkes
    - Tom Sayer
    - Len
    - Alan Bandurka
    - Ben H
    - Simon Pilkington
    - Daniel Kokotajlo
    - Yuchong Li
    - Diagon
    - Andreas Blomqvist
    - Iras
    - Qwijibo (James)
    - Zubin Madon
    - Zannheim
    - Daniel Eickhardt
    - lyon549
    - 14zRobot
    - Ivan
    - Jason Cherry
    - Igor (Kerogi) Kostenko
    - ib_
    - Thomas Dingemanse
    - Stuart Alldritt
    - Alexander Brown
    - Devon Bernard
    - Ted Stokes
    - Jesper Andersson
    - DeepFriedJif
    - Chris Dinant
    - Raphaël Lévy
    - Johannes Walter
    - Matt Stanton
    - Garrett Maring
    - Anthony Chiu
    - Ghaith Tarawneh
    - Julian Schulz
    - Stellated Hexahedron
    - Caleb
    - Clay Upton
    - Conor Comiconor
    - Michael Roeschter
    - Georg Grass
    - Isak Renström
    - Matthias Hölzl
    - Jim Renney
    - Edison Franklin
    - Piers Calderwood
    - Mikhail Tikhomirov
    - Matt Brauer
    - Mateusz Krzaczek
    - Artem Honcharov
    - Tomasz Gliniecki
    - Mihaly Barasz
    - Mark Woodward
    - Ranzear
    - Neil Palmere
    - Rajeen Nabid
    - Clark Schaefer
    - Olivier Coutu
    - Iestyn bleasdale-shepherd
    - MojoExMachina
    - Marek Belski
    - Luke Peterson
    - Eric Rogstad
    - Eric Carlson
    - Caleb Larson
    - Max Chiswick
    - Aron
    - Sam Freedo
    - slindenau
    - Johannes Lindmark
    - Nicholas Turner
    - Intensifier
    - Valerio Galieni
    - FJannis
    - Grant Parks
    - Ryan W Ammons
    - This person's name is too hard to pronounce
    - contalloomlegs
    - Everardo González Ávalos
    - Knut Løklingholm
    - Andrew McKnight
    - Andrei Trifonov
    - Aleks D
    - Mutual Information
    - Tim
    - A Socialist Hobgoblin
    - Bren Ehnebuske
    - Martin Frassek
    - Sven Drebitz
    - Quabl
    - Valentin Mocanu
    - Philip Crawford
    - Matthew Shinkle
    - Robby Gottesman
    - Juanchi
    / robertskmiles
  • Věda a technologie

Komentáře • 1,4K

  • @llucos100
    @llucos100 Před 2 lety +1816

    Turns out the Terminator wasn’t programmed to kill Sarah Connor after all, it just wanted clothes, boots and a motorcycle.

    • @Alorand
      @Alorand Před 2 lety +195

      And ended up becoming the governor of California instead...

    • @spejic1
      @spejic1 Před 2 lety +334

      @@Alorand Becoming governor of California gets you MANY clothes, boots, and motorcycles.

    • @sevdev9844
      @sevdev9844 Před 2 lety +10

      Or making John Connor into a boyfriend. (You might think of Arnie when Terminator comes up, I think of Summer aka Cameron)

    • @Saka_Mulia
      @Saka_Mulia Před 2 lety +45

      That's Terminator goals ... not termianl ... oh never mind ... i get it

    • @quitequiet5281
      @quitequiet5281 Před 2 lety +5

      LOL Yup... in retrospect with this paper... the terminator was a pursue bot... driving a threat variable towards the development and improvements of a General Artificial Intelligence and look at all the upgrades that series of pursuit bots facilitated.
      LOL

  • @vwabi
    @vwabi Před 2 lety +2356

    AI safety researchers are absolutely the last people on earth you want to hear "We were right" from.

    • @madshorn5826
      @madshorn5826 Před 2 lety +149

      And climatologists.

    • @Laszer271
      @Laszer271 Před 2 lety +28

      @@madshorn5826 Nah, epidemy can destroy the world in months, climate change can in decades. Superinteligent AI could probably destroy it before lunch :P

    • @donaldhobson8873
      @donaldhobson8873 Před 2 lety +160

      What about "we were totally wrong, the problem is much worse than we thought it was."

    • @madshorn5826
      @madshorn5826 Před 2 lety +6

      @@Laszer271
      Well, destroyed is destroyed.
      Or are you the type not bothering with insurance and health check ups because a hypothetical bullet to the brain would rather quickly render those precautions moot?

    • @Laszer271
      @Laszer271 Před 2 lety +18

      @@madshorn5826 fair enough. It was all a joke though. But in your example, I still think "I just got a bullet to the brain" is worse than "I just got diagnosed with cancer". Maybe bullet is less likely, sure, but we were talking about the time that the danger was already proven, right? I think it's plausible that probability of my survival is greater conditioned on "we were right" statement being made by epidemiologist, climatologist or oncologist than it is conditioned on the same statement made by AI safety expert or like bullet...ologist.

  • @ShankarSivarajan
    @ShankarSivarajan Před 2 lety +790

    10:54 "It actually wants something else, and it's capable enough to get it."
    Yeah, that _is_ worse.

    • @Encysted
      @Encysted Před 2 lety +61

      The AI *does* in fact know how to drive a car, and it never really learned not to hit people.

    • @Rotem_S
      @Rotem_S Před 2 lety +14

      @@Encysted or it learned how not to hit people, but hits them whenever there are no witnesses because it only cares about turning right

    • @InfinityOrNone
      @InfinityOrNone Před 2 lety +39

      @@Rotem_S
      Or it learned not to hit people because it really cared about maintaining the present state of the paint job, which was white in the training environment. But the deployment environment uses a _red_ car.

    • @InfinityOrNone
      @InfinityOrNone Před 2 lety +3

      @@Rotem_S
      Wow, your user name confuses the comments section.

    • @xelspeth
      @xelspeth Před 2 lety +6

      @@InfinityOrNone It doesn't. It just display in the correct (right to left) reading direction that hebrew uses

  • @unvergebeneid
    @unvergebeneid Před 2 lety +1081

    Famous last words for species right before they hit the great filter: "Yo, in the test runs, did paperclips max out on the positive attribution heat map, too?"

    • @michaelpapadopoulos6054
      @michaelpapadopoulos6054 Před 2 lety +113

      There are so many layers to this comment and I love it.

    • @underrated1524
      @underrated1524 Před 2 lety +176

      I keep hearing the notion of AI being the great filter, but I can't say I buy it.
      Not that AGI isn't an existential threat, because it absolutely is. It just can't explain why we don't see any signs of aliens when we look up at the sky, because if the answer is "AGI", then that begs the question: "Okay, so why don't we see any of those, either?"

    • @ayushsharma8804
      @ayushsharma8804 Před 2 lety +25

      @@underrated1524 what if agis prefer to kill their creators and enter some deep bunker in some Rouge planet to await heat death after reward hacking their brains.
      Still dosent explain why they are aren't here preparing to kill us.

    • @unvergebeneid
      @unvergebeneid Před 2 lety +104

      @@underrated1524 I agree. Especially the paperclip optimizer should show itself in the form of huge paperclip-shaped megastructures around distant stars. It still made for a good joke though, if I do say so myself.

    • @sageinit
      @sageinit Před 2 lety +25

      [Laughs in Grabby Aliens, Synthetic Super Intelligence, Gaia Hypothesis, Global Brain, & Planetary Scale Computation]

  • @rofl22rofl22
    @rofl22rofl22 Před 2 lety +1213

    Robert Miles: "We were right"
    Me: Oh no
    "About inner misalignment"
    OH NO

    • @LeoStaley
      @LeoStaley Před 2 lety +88

      Yeah. The only thing worse is, we were right about AI being deceptive about its goals during training before deployment.

    • @JM-us3fr
      @JM-us3fr Před 2 lety +29

      @@LeoStaley Or even worse: We were right about AI being more dangerous than nukes

    • @MetsuryuVids
      @MetsuryuVids Před 2 lety +20

      @@JM-us3fr That's almost certain.

    • @LeoStaley
      @LeoStaley Před 2 lety +18

      @@JM-us3fr oh no, that's absolutely going to be true at some point. The only real question is, can we stop them from deciding to (even accidentally) kill us? Can we even avoid making them accidentally WANT to kill us because we accidentally fucked up the training environment?

    • @ARVash
      @ARVash Před 2 lety +6

      @@JM-us3fr Nukes are safe because they kill people you don't want dead. I'd say an AI is definitely more dangerous because it has much more capacity to be selective. It could also be safer, really depends on the implementation details, much like a person. A person can be safe, or dangerous. Can we even avoid making a human accidentally want to kill us because we accidentally fucked up the training environment?
      Maybe.

  • @bierrollerful
    @bierrollerful Před 2 lety +965

    Almost sounds like AIs will need psychologists, too.
    "So I tried to acquire that wall..."
    "Why not the coin? What is it about the wall that attracts you?"
    "Well, in training, I always went to the... oh...huh, never thought about it that way."

    • @crubs83
      @crubs83 Před 2 lety +163

      AI safety researchers ARE psychologists as far as im concerned.

    • @PMA65537
      @PMA65537 Před 2 lety +16

      I was coping ok before the awful behaviour of that other AI used by the Shah of Lugash.

    • @lobrundell4264
      @lobrundell4264 Před 2 lety +11

      this made me smile : D

    • @ChrisBigBad
      @ChrisBigBad Před 2 lety +59

      I clearly remember a Civ-Type game, where one of the research-items was "AI without personality problems"

    • @bierrollerful
      @bierrollerful Před 2 lety +17

      @@ChrisBigBad Sounds like research an AI with personality problems would try.

  • @bartman999
    @bartman999 Před 2 lety +211

    Nothing more terrifying than seeing the title 'We Were Right!' on a Robert Miles video.

    • @captainufo4587
      @captainufo4587 Před 2 lety +13

      In a way, yes. In another way, up to this point there was a debate whether AI safety was a real concern worth investing research, time and money, or just overworrying. It's a good thing that these demonstrations prooved it's the former, and that they happened this early in the history of AIs.

  • @proskub5039
    @proskub5039 Před 2 lety +367

    A coin isn't a coin unless it occurs at the edge of the map! We may think the AI is weird for ignoring the heretical middle-of-the-map coin, but that's just our object recognition biases showing.

    • @GigaBoost
      @GigaBoost Před 2 lety +21

      Literally this haha

    • @sabelch
      @sabelch Před 2 lety +20

      Great interpretation! But it doesn't seem to explain why the AI goes to the edge of the map even when there isn't a coin there.

    • @GigaBoost
      @GigaBoost Před 2 lety +52

      @@sabelch it still seemingly learns to favor walls, if you look at the heatmaps. Perhaps without the coin all it has to go by with positive value is the walls.

    • @proskub5039
      @proskub5039 Před 2 lety +26

      @@GigaBoost Yes, the salient point here is that we should not assume that the AI interprets objects the way we would. And any randomness in the learning process could lead to wildly different edge-case behaviors..

    • @GigaBoost
      @GigaBoost Před 2 lety +2

      @@proskub5039 absolutely!

  • @moartems5076
    @moartems5076 Před 2 lety +164

    Looking at my hoard of keypicks in skyrim, i can confirm, that this is perfectly human behavior.

    • @OMGclueless
      @OMGclueless Před 2 lety +13

      When you think about it, yeah, it's very human-like. Kind of like gambling addicts who know that they're losing money when they play but have trained themselves to like the feeling of winning money rather than the ultimate goal of a comfortable happy life or even the instrumental goal of having money.

    • @threeMetreJim
      @threeMetreJim Před 2 lety +5

      Definitely, what is wrong with collecting as many keys as possible if you want to open as many chests as possible, and each requires a key? In a maze you don't know what is round the corner in advance. Trying to collect your own inventory is simply a programming error if the agent can see the part of the screen that is designed as a guide for a human to observe the progress.

    • @Spellweaver5
      @Spellweaver5 Před rokem +6

      @@threeMetreJim yes, but not trying to open the remaining chests is definitely the goal learned wrong.

    • @sharktrap267
      @sharktrap267 Před rokem +3

      @@threeMetreJim if my AI is built to keep my wood storage at a certain level by collecting wood in my forest but it learnt to "collect all the keys"(all the wood), my forest will soon become a plain. It's an issue, because growing trees takes time, wood takes storage space and any wood not protected can become unsuitable for the usage. You're not just wasting ressources, you're also at risk of not having wood available at some point.
      And if you use the forest to hunt too, you can start learning to hunt in a plain.
      So depending on the goals and situation, hoarding can lead to issues

  • @goonerOZZ
    @goonerOZZ Před 2 lety +537

    Somehow the terminal and instrumental goals talk made me correlate the AI with us.
    As a financial advisor, I have found that many people also made this mistake that money is an instrumental goal, but having spend so much time working to get money, people start to think that money is their terminal goal so much so that they spend their entire live looking for money forgetting why they want to have the money in the first place.

    • @anandsuralkar2947
      @anandsuralkar2947 Před 2 lety +17

      True

    • @MenwithHill
      @MenwithHill Před 2 lety +42

      Very much the same feeling on my end. I actually found it cute when the chest opening AI just started collecting keys.

    • @lennart-oimel9933
      @lennart-oimel9933 Před 2 lety +59

      The reason why I watch this Channel is mostly because you can correlate almost every video to human intelligence. And it makes sence: Why should'nt the same rules apply to us that apply to AI? I see this Channel as an analyses of the problems of intelligence in general. Not only the ones we make;)

    • @GrilledCheeseSandwich1
      @GrilledCheeseSandwich1 Před 2 lety +33

      It seems like no one realized that this idea is hinted at by the song in the outro: Jessie J - Price Tag. The most famous line from the song is: It's not about the money, money, money

    • @jackren295
      @jackren295 Před 2 lety +22

      @@lennart-oimel9933 Me too. After watching this channel, I started to agree with the notion of "making AI = playing god" that I've heard sometimes in the past. At first, I didn't put too much thoughts on it. But now I've realized that making powerful AGIs that are safe and practical requires us to know all the weaknesses of the human mind, and make a system that avoids all these weaknesses while still performing at least as well as we can. It's like making the perfect "human being" in some sense.

  • @dmytrolysak1366
    @dmytrolysak1366 Před 2 lety +577

    Let alone simple AI, _people_ get misaligned like that quite often - hoarding is one good example, which happens both in real life and in games like with those keys.

    • @nikolatasev4948
      @nikolatasev4948 Před 2 lety +155

      It keeps amazing me how AI problems are increasingly becoming general human problems.
      "if we give a reward to the AI when it does a job we want, how do we stop it from giving itself the award without the job" - just as humans give themselves "happiness" with drugs.
      "how do we make sure the AI did not just pretend to do what we wanted while we were watching" - just as kids do.

    • @sonkeschmidt2027
      @sonkeschmidt2027 Před 2 lety +16

      @@nikolatasev4948 which is why eventually ai research will have to dive into religion/spirituality. Those where the only successful attempts humans made to solve the general problems that we have.
      Not saying that all of them where successful, life always moves on, there is always growth and decay/change. But every now and then they generated "the solution" to everything, rippling down to millions and billions of people trying to imitate that.

    • @markusmiekk-oja3717
      @markusmiekk-oja3717 Před 2 lety +87

      @@sonkeschmidt2027 I would claim religion does not help with that type of problem.

    • @sonkeschmidt2027
      @sonkeschmidt2027 Před 2 lety +7

      @@markusmiekk-oja3717 then I invite you to look at what religion does. Functional religion, I'm not talking about what you know or have heard about it going wrong, in talking about the cases where it does work (which are those you never hear of because... Well because they work, they don't cause trouble but bring stability, that doesn't make news).
      If you look into that you understand why religion is a global phenomenon and why it has the power it has.
      If you feel with scientists you will also find that the West doesn't have stopped being religious, they just rebranded it and called it science.
      We live in a world with a huge amount of uncertainty and where mistakes can have huge negative consequences. Humans can't deal with that without a working believe system. You have tons of these you just wouldn't consider them religious probably. That will change, should life ever show you the scope of uncertainty there is. Good luck making it though without a (spiritual/religious) belief system that is in alignment with the society you life in. =)

    • @nikolatasev4948
      @nikolatasev4948 Před 2 lety

      @@sonkeschmidt2027 Well, the video about Generative Adversarial Networks with an agent trying to find flaws and break the AI we are training gave me strong Satan vibes. But apart from that I don't think we need further research into religion/spirituality. Simply put they work on us, a product of long evolution in specific environment. We need a more general approach, since AIs are a product of very different evolution and environment. Some solutions for the AI may resemble some religious notion, just as some scientific theories resemble some religious ideas, but trying to apply religion to AI is bound the fail just as applying religion fails in science.

  • @charliesteiner2334
    @charliesteiner2334 Před 2 lety +546

    9:00 "We developed interpretability tools to see why programs fail!" "What's going on when they fail?" "Dunno."
    No shade, interpretability is hard, even for simple AI :P

    • @YuureiInu
      @YuureiInu Před 2 lety +32

      It just likes the coins next to the end wall. Why would you teach it to like only those and expect it to get any other coins?

    • @SimonClarkstone
      @SimonClarkstone Před 2 lety +92

      It reminds me of koalas that can recognise leaves on plants as food, but not leaves on a plate.

    • @gabrote42
      @gabrote42 Před 2 lety +5

      @@SimonClarkstone interesting

    • @Bacopa68
      @Bacopa68 Před 2 lety +47

      @@SimonClarkstone AI HAS ADVANCED TO THE KOALA LEVEL. REPEAT, KOALA LEVEL. Ah, so basically nothing then.

    • @raskov75
      @raskov75 Před 2 lety +1

      And the more complex these systems get, the harder it becomes. Oi vey.

  • @Practicality01
    @Practicality01 Před 2 lety +102

    This is starting to get an "unsolvable problem" vibe. Like we are somehow thinking about this in the wrong way and current solutions aren't really making good progress.

    • @michaeljburt
      @michaeljburt Před 2 lety +28

      Very much so. The psychology of teaching/learning as humans isn't really understood. What *actually* happens when you learn something new for the first time? Feedback on that process is vital. How do you give a machine feedback on what it learned, when you don't know what it learned exactly? It can't communicate to us what it "felt" it learned. In others words, human says: "I said the goal was X". Machine says: "I thought the goal was Y".

    • @AfonsodelCB
      @AfonsodelCB Před 2 lety +20

      @@michaeljburt realize: we actually want these things to be much better than humans. but we might be underestimating how maxed out humans are at certain things. humans have goal missalignments all the time, and many aren't detected for years

    • @josephburchanowski4636
      @josephburchanowski4636 Před 2 lety +35

      "This is starting to get an "unsolvable problem" vibe. Like we are somehow thinking about this in the wrong way and current solutions aren't really making good progress."
      Welcome to AI Safety. The best part is that if we don't solve the "unsolvable problem", we might all die.
      Along with all life on Earth, along with all life in the galaxy, along with all life in the galaxy cluster. And with cannibalizations of all planets and stars for resources for some arbitrary terminal goal.
      A potential outcome is a dead dark chunk of the universe built as a tribute to something as arbitrary as paper clips or solving an unsolvable math problem.

    • @sonkeschmidt2027
      @sonkeschmidt2027 Před 2 lety +6

      Aren't we touching the biggest unsolvable problem in existence? Existence itself?
      Think about how terrifying it would be if you could solve every problem, if you could solve life. That means there would be an absolute border that you would be infinitely stuck with... Sounds better to me that there will always be a new problem to be solved...

    • @AlejandroMarin.design
      @AlejandroMarin.design Před 2 lety

      Alignement in humans is solvable. I developed a methodology to do it easily and quickly. So I think alignment in machines is solvable. I’ve actually designed the methodology to serve machine alignment as well. We’ll get there, don’t despair.

  • @Turtle76rus
    @Turtle76rus Před 2 lety +75

    Can't wait for the "We Were Right! Real Misaligned General Superintelligence" video

    • @michaelspence2508
      @michaelspence2508 Před 2 lety +12

      One more sentence and this would be the scariest Two Sentence Horror Story I've ever seen

    • @unvergebeneid
      @unvergebeneid Před 2 lety +18

      Now here's a reason to actually "hit that bell icon" if I've ever seen one. Because the time window to watch that video would be rather small I imagine 😄

    • @PetardeWoez
      @PetardeWoez Před 2 lety +6

      probably the last video ever made on the topic

    • @Zeekar
      @Zeekar Před 2 lety +8

      The question: which takes longer? Uploading a video to CZcams or the entire world being converted to stamps?

    • @christiangreff5764
      @christiangreff5764 Před 2 lety +1

      @@Zeekar Teh former. At the point that video would be produced, we would have our ands full with with fighting the mechanical armies of the great paperclip maximiser (and it would have probably hacked and monopolized the internet to limit our communication channels).

  • @RichardEntzminger
    @RichardEntzminger Před 2 lety +523

    I feel like this isn't just a problem with artificial intelligence but intelligence in general. Biological intelligence seems to mismatch terminal goals and instrumental goals all the time like Pavlovian conditioning training a dog to salivate when recognizing a bell ringing(what should be the instrumental goal) or humans trading away happiness and well being (what should be the terminal goal) for money (what should be an instrumental goal).

    • @Racnive
      @Racnive Před 2 lety +41

      Organizations founded with the intent of doing X end up instead doing something that *looks like they're doing X*, because that's what people see; that's what people hold them accountable to.
      It doesn't even take intelligence: Evolution by natural selection doesn't require any intelligence to winnow things away from what they "want" (terminal goals, should they exist), toward what will survive/replicate (at least in principle, an instrumental goal).

    • @salec7592
      @salec7592 Před 2 lety +60

      I concur with this. The problem is not AI specific and should be termed something along lines of "general delegation problem" or problem of command chain fidelity. The subset of which is Miles' nightmare with inverted capability hierarchy, where command is passed by less able actor to more able actor (e.g. a human to an advanced AI).

    • @Sindrijo
      @Sindrijo Před 2 lety +7

      @@salec7592 Even if with prefect interpretability of each composite of an AI (e.g. the layers in a neural network) ulterior goals might still be encrypted into looking 'good'. An AI command structure with short circuiting breaks in the reward-loop might help. E.g. you will have people issuing commands/goals to an interpreter AI which interprets and delegates those commands to another AI (without knowing if it is delegating to an AI or not) reduce the chance for goal-misalignment by reducing the impact of the complete-loop feedback with shorter feedback loops, also randomly substitute each composite part of the command-delegation chain during training.

    • @sonkeschmidt2027
      @sonkeschmidt2027 Před 2 lety +2

      Is that a problem though? Or isn't good what makes life possible in the first place?
      After all if you want to solve the problem that is life, then you just kill yourself. All problems solved. But then you can't experience life. So live needs decay in order to create new problems so that something new can happen. Needing in the sense that existence can only exist as long as it exists. Without existence you don't have problems but you don't have existence either.

    • @nahometesfay1112
      @nahometesfay1112 Před 2 lety +7

      @@sonkeschmidt2027 I might sound sarcastic, but the following questions are sincere. Do you think it's ok for AI to take over the world? Perhaps even drive humanity to extinction? Humans have done the same to other species even other humans and humans are not unique from the rest of life in this respect. As you said decay makes way for new life. I think humanity should be preserved because I find destruction in general unsettling. To be clear I'm not saying you are wrong or that you believe what I just said. I'm just wondering how your ideas extend in these topics
      Edit: typing on my phone so I missed some other stuff: do you think existence is better than non-existence? To me non-existence is neutral. Do you think humans have a moral imperative to maintain their existence? Do you think humans need to go extinct at some point so that reality can continue to change? You brought up some very interesting ideas and I just wanted hear more of your thoughts.

  • @-41337
    @-41337 Před 2 lety +152

    imagine a future where a very trusted ai agent seems to be fantastically doing its job well for many months or years, and then suddenly goes haywire since its objective was wrong but it just hadn't encountered a circumstance were that error was made apparent. then tragedy!

    • @TulipQ
      @TulipQ Před 2 lety +26

      I doubt it will be a grand revel.
      People will die due to a physical machine, these interpreter tools can then be used to argue the victim did something wrong, that a non AI system did the fault, or that a human supervisor was neglegent.
      The deployment enviorment is one full of agents optimized for avoiding liability.

    • @CyborusYT
      @CyborusYT Před 2 lety +55

      That's actually not that far from normal computer systems
      There are countless stories of a system (ordinary computer system) suddenly reaching a bizarre edge-case and start acting completely insane

    • @NoName-zn1sb
      @NoName-zn1sb Před 2 lety

      @@TulipQ negligent

    • @gastonmarian7261
      @gastonmarian7261 Před 2 lety +32

      Like when we designed computers without thinking / knowing about cosmic ray bit flips, so decades later a plane falls out of the sky because their computer suddenly didn't know where it was in the sky. Humans are a trusted ai agent deployed in a production environment with limited understanding of what's going on

    • @demoniack81
      @demoniack81 Před 2 lety +4

      @@CyborusYT Yeah, it happens literally all the time. It's just that usually the error gets caught somewhere along the way, an exception is thrown, and the process is terminated. Which is where you get the error page and then pick up the phone and go talk to an actual person in customer service who can either override it or get the IT team to fix the problem.

  • @YuureiInu
    @YuureiInu Před 2 lety +91

    "Can you spot the difference?"
    Pauses the video and looking for the difference....nothing. Unpause.
    "You can pause the video."
    Pauses again and manically looking for a pattern. More keys?
    "There's more keys in the deployment. Have you spotted it?"
    Yes!!!!

    • @thefakepie1126
      @thefakepie1126 Před 2 lety +2

      @Impatient Imp I've counted 12

    • @Lawofimprobability
      @Lawofimprobability Před 2 lety

      I noticed less boxes but didn't notice more keys (probably because of the colors being too similar for the few seconds of looking.

  • @Huntracony
    @Huntracony Před 2 lety +337

    Did you intentionally use the "It's not about the money" song for the video about the AI not going for the coins? Either way, that's quite funny. Well done.

    • @PhoebeLiv
      @PhoebeLiv Před 2 lety +73

      His song choices are always amusingly on the nose, actually! A few off the top of my head are "the grid" for his gridworlds video, "mo money mo problems" for concrete problems in AI safety, and "every breath you take (I'll be watching you" for scalable supervision

    • @Huntracony
      @Huntracony Před 2 lety +17

      @@PhoebeLiv Nice! Hadn't noticed before, but I'll definitely start paying some closer attention form now on.

    • @thewrongjames
      @thewrongjames Před 2 lety +13

      Another on the nose choice was Jonathan Coulton's "It's Gonna be the Future Soon" on the video about what AI experts predict will be the future of AI.

    • @matthewwhiteside4619
      @matthewwhiteside4619 Před 2 lety +11

      He also used "I've got a little list" in one of his list videos.

    • @SpoonOfDoom
      @SpoonOfDoom Před 2 lety +2

      I didn't catch that, that's great!

  • @Houshalter
    @Houshalter Před 2 lety +70

    Imagine training a self driving car in a simulation where plastic bags are always gray and children always wear blue. It then happily runs down a child wearing gray, before slamming on the brakes and throwing the unbuckled passengers through the windshield, for a blue bag on the road.

    • @nullone3181
      @nullone3181 Před 2 lety +11

      The brat in gray was asking for it

    • @GetawayFilms
      @GetawayFilms Před 2 lety +9

      Imagine training a self driving car to the point where it can competently navigate complex road systems, yet can't remain stationary until all passengers are buckled up...

    • @Houshalter
      @Houshalter Před 2 lety +7

      @@GetawayFilms cars sold today only flash a warning light/noise if you don't buckle, and only because government regulations mandate it. Even then most people disable it

    • @GetawayFilms
      @GetawayFilms Před 2 lety +3

      @@Houshalter so what you're saying is . It's a 'people' thing... Ok

    • @sonkeschmidt2027
      @sonkeschmidt2027 Před 2 lety +1

      Humans do that all the time. Except that we have a deep genetic imperative to recognise children and to protect them but there are loads of examples where these instincts are overwritten....

  • @EebstertheGreat
    @EebstertheGreat Před 2 lety +41

    It looks like in the keys and chests environment, the AI was trying to get both keys and chests, but it was strongly prioritizing keys. When there were more chests than keys, it was always spending its keys quickly, so it never ended up with a bunch in its inventory. As a result, it never learned that keys at the left edge of the inventory were impossible to pick up, so it just got stuck there trying to touch them, since they were more important than the remaining chests.

    • @isaacgraphics1416
      @isaacgraphics1416 Před 2 lety +22

      it's the same problem evolution ran into when optimising our taste palate. Fat and sugar were highly rewarded in the ancestral environment, but now we live in a different (human created) environment, that same goal pushes us beyond what we actually need and creates problems for us.

    • @silphonym
      @silphonym Před 2 lety +10

      @@isaacgraphics1416 It's really cool and scary to think of how this stuff applies to our natural intelligence as well.

    • @ohjahohfrick9837
      @ohjahohfrick9837 Před 2 lety +9

      @@silphonym Well both came about from essentially the same process.

  • @ARVash
    @ARVash Před 2 lety +83

    An interpreter, a mind reading device, once you read it and respond becomes a way for an agent to "communicate" with you and they can communicate things that give an impression that hides their actual goal. A lot of these challenges arise when training or coordinating humans, and it's somewhat unsurprising that while a mind reading device might seem to help at first, it's not going to be long before someone figures out how to appear like they're doing the right thing, while watching tv.

  • @johnno4127
    @johnno4127 Před 2 lety +71

    I realized I experience misalignment do to poor training data every couple weeks.
    .
    I work as a courier delivering packages in Missouri, USA, and I often meet people at their homes or workplace. Unfortunately, I don't learn their names as attached to their faces, but rather as attached to locations so that when I meet them someplace else I can't remember their names easily (if at all).

    • @mscout1
      @mscout1 Před rokem +9

      I had someone from my TableTop club say 'hi' to me in the gym. No idea who it was, because my brain was searching the wrong bucket of context.

  • @felixmerz6229
    @felixmerz6229 Před 2 lety +192

    The thought of creating a capable agent with the wrong goals is terrifying, actually; and yes, an agent being bad at doing something good is absolutely a problem much preferable to an agent being good at doing something bad.

    • @xxxJesus666xxx
      @xxxJesus666xxx Před 2 lety +5

      speaking of A.I. or psychology?

    • @gadget2622
      @gadget2622 Před 2 lety +25

      @@xxxJesus666xxx yes

    • @ThrowFence
      @ThrowFence Před 2 lety +7

      Isn't this exactly what's happening with mega corporations?

    • @sharpfang
      @sharpfang Před 2 lety +12

      Reminds me of the elections a couple years ago in Poland. A very competent and capable, but thoroughly corrupt and evil political party was voted out and replaced with a party just as corrupt and evil but vastly less competent.

    • @felixmerz6229
      @felixmerz6229 Před 2 lety +16

      @@sharpfang That unironically is an improvement in today's political landscape. If I'd have to choose a form of evil, it'll always be the less capable rather than the less sinister.

  • @custos3249
    @custos3249 Před 2 lety +64

    Well, pardon my comparison, but you've effectively found an adjunct to heuristic behavior based on sensory inputs like "things that taste sweet are good" and ending up with a dead kid after they drink something made with ethylene glycol. If it's always operating on heuristics, you'll never be sure it's learned what you intended, arguably even after complex demonstrations, given the non-zero chance of emergent/confounding goals. But, relative to human psychology at least, that's not a death sentence - weighting rewards differently, applying bittering agents, adding a time dimension/diminishing reward overtime jump to mind to trying to at least get apparent compliance. Besides, if the goal is "get the cheese," it needs to able to sense and comprehend "cheese," not just "yellow bottom corner good."

    • @saxy1player
      @saxy1player Před 2 lety +3

      I'm not sure I understand you completely, but that IS the biggest problem with these 'intelligent' systems. We have no idea (let's not kid ourselves) how they work. But we are happy when they do what we want them to. Let's not think about what happens when we let these kind of systems act in the world in a broader sense and live happy until then xD

    • @jeremysale1385
      @jeremysale1385 Před 2 lety +16

      The ability to slow down and switch into more resource-intensive system 1 thinking when a problem is sufficiently novel is how humans (sometimes) get around this heuristic curse. I wonder if there is some analog of this function that could be implemented in machine learning.

    • @ChaoticNeutralMatt
      @ChaoticNeutralMatt Před rokem +1

      @@jeremysale1385 I imagine that will be the case eventually.

    • @pumkin610
      @pumkin610 Před rokem +1

      Humans can chase things that seem appealing to us based on what we learned, but we can also choose to pursue a random/ painful goal just because we want to, sometimes we just don't know the negative ramifications of an action, and sometimes we believe things that aren't true.

    • @custos3249
      @custos3249 Před rokem

      @@pumkin610 Neat. Bet that can still be reduced to and restated as "novelty is good." No matter what goal, drive, etc. you can come up with, it can be put in simple approach/avoidance terms, even seemingly paradoxical behavior. It all comes down to reward.

  • @andrewweirny
    @andrewweirny Před 2 lety +226

    This is one of your clearest and most interesting videos to date. I'm now very excited for the interpretability video!

    • @JabrHawr
      @JabrHawr Před 2 lety

      a viewer's comment from 2 days ago despite the video having been published just few hours ago. you must be a patron, or an acquaintance

    • @andrewweirny
      @andrewweirny Před 2 lety +2

      @@JabrHawr the former.

    • @michaeljburt
      @michaeljburt Před 2 lety

      Agreed. Exciting stuff

  • @Tutorp
    @Tutorp Před 2 lety +9

    Hey, the key-AI works kind of the same way most people do when playing computer games... "Oooh, shiny things I don't need all off? I need them all! Game objectives? Meh..."

  • @offchan
    @offchan Před 2 lety +29

    It's the problem of vague requirement. It's similar to when you tell someone to do something but they do the wrong thing.
    Human solves this by having similar common sense as another human and use communication to specify stricter requirement.

    • @user-zn4pw5nk2v
      @user-zn4pw5nk2v Před 2 lety +9

      Yes, "give me a thing which looks like that other thing i mentioned earlier" in a room full of junk(without additional context), have had that problem.

    • @dsdy1205
      @dsdy1205 Před rokem +3

      Actually humans 'solve' this by having a reward function (emotions) that are only vaguely and very inconsistently coupled with reality, while mounting the whole thing in a very resource intensive platform where half the processing capability is used just to stay alive, and modifying itself is so resource intensive that most don't even try.
      And even then, we manage to inflict suffering to millions if not billions, so I'd say this isn't really solved either

    • @cornoc
      @cornoc Před rokem

      @@dsdy1205 yeah, i'm starting to think this is a fundamental problem that can't be removed, and that the only reason we aren't as worried about the same thing with humans is that the power of any particular human being is limited by the practical constraints imposed by their physical body and brain power. when you give the same type of rationality engine to a super powerful being, all kinds of horrible things are going to happen. just look at any war to see how badly a large group of humans led by a few maniacs can fuck up decades of history and leave humanity with lasting scars for centuries or more.

  • @rentristandelacruz
    @rentristandelacruz Před 2 lety +35

    Now we need an intepretability tool for the interpretability tool.

    • @badwolf4239
      @badwolf4239 Před 2 lety +8

      We heard you liked interpretability, so we made an interpretability tool for your interpretability tool so you can interpret while you interpret. Now go ask your chess playing AI why it just turned my children into paperclips.

    • @josephburchanowski4636
      @josephburchanowski4636 Před 2 lety

      @@badwolf4239 It told me that it was showcasing its abilities so it can convince human opponents to resign. Researching misaligned AI examples, it tried deciding what way of transforming someone's children would be the most intimidating. It was a choice between paper clips, stamps, and chess pieces.
      Also there was some mention it was contemplating turning them into human dogs hybrids. I don't know why. Something dealing with a bunch of people have trauma about a Nina something.

    • @christiangreff5764
      @christiangreff5764 Před 2 lety +1

      @@josephburchanowski4636 At least it did not develop a shap shifting clown body in order to eat them ...

  • @GamesFromSpace
    @GamesFromSpace Před 2 lety +35

    Just to be safe, start including pictures of human skulls when doing a pass with those interpretability tools.

    • @mhelvens
      @mhelvens Před 2 lety +34

      Ah, we're noticing negative attribution when they are surrounded by skin, but positive attribution when they are piled up with a throne stacked on top. I wonder what this means. 🤔

    • @Swingingbells
      @Swingingbells Před 2 lety +1

      AI agent: \*stomp\*

    • @lilDaveist
      @lilDaveist Před 2 lety +1

      @@Swingingbells
      If picture == human skull:
      Action = None
      Ai: „If picture == Human Skull; Action = Double stomp“ „Gotcha“

    • @arvidhansen5892
      @arvidhansen5892 Před rokem

      Well what if the ai wouldn't even have considered obtaining human skulls before and just by introducing them to it, you just screwed up big time

  • @9600bauds
    @9600bauds Před 2 lety +48

    It's easy enough to have the AI tell you what it "wants" - inside an environment. What you need to know is what it wants *in general*, which is a lot harder.
    This is why the insight tool isn't very insightful: it's showing you what the AI wants in the current environment, but it doesn't bring us a lot closer to understanding *why* it wants those things in that environment.
    The solution? Idk lol

    • @AscendantStoic
      @AscendantStoic Před rokem +1

      Is there even a why at this point without the A.I having free will or self-awareness?.
      Like aren't we the ones reinforcing its interactions or downplaying them with the different objectives in the environment to teach it what to go for and what not to do?, if it goes for key or coin we put emphasis on it as positive interaction it should do more of, if it hits a buzzsaw we point it out as a negative thing it should do less of, until it learns it needs to get the coin and avoid the buzzsaws.

    • @ChaoticNeutralMatt
      @ChaoticNeutralMatt Před rokem +2

      @@AscendantStoic It sounds easier than it actually is, basically. You can certainly try, but there is still the uncertainty of what it actually learned.

    • @charaicommenternotalt
      @charaicommenternotalt Před 4 měsíci +1

      ​@@AscendantStoic It doesn't NEED self awareness. For example in an AI that is trained to recognize cats and dogs, there is still a sort of 'why' it thinks this picture is a dog and not a cat, even though it is not conscious or anything. And also the problem is that it's very hard to teach an AI what we want it to do. If we tell it to get a coin it may learn to do another goal entirely, unbeknownst to us, that still gets the job done. The problem is when it fails and we realize it's learning a different goal.
      I think the solution is having the AI learn multiple tasks.

  • @ZT1ST
    @ZT1ST Před 2 lety +17

    @5:32; That's a particularly funny example - it knows it has a UI where its keys are transferred to, but it thinks that those new locations are where it can get the keys again, and...is basically learning that keys teleport rather than that they get added to its inventory?

    • @HoD999x
      @HoD999x Před 2 lety +4

      the AI has no concept of "inventory", it just looks at the screen and sees new keys.

    • @ZT1ST
      @ZT1ST Před 2 lety

      @@HoD999x Right - but it's not learning that keys outside of the maze are inaccessible, and therefore probably part of the collection it uses to open the chests - it's learning that keys move to that part of the screen once collected in the maze.
      And doesn't consider that collecting keys at that part of the maze if it *was* accessible, the keys would re-appear there.

    • @HeadsFullOfEyeballs
      @HeadsFullOfEyeballs Před 2 lety +5

      @@ZT1ST I would imagine that the keys in the inventory aren't seen as _very_ interesting by the AI, so under normal circumstances it ignores them in favour of collecting the "real" keys.
      But when all the "real" keys are gone and the round still hasn't ended (because the AI is ignoring the final chest), the inventory keys are the only even mildly interesting-looking (i.e. key-looking) thing left on screen, so it gravitates towards them.

  • @JamesPetts
    @JamesPetts Před 2 lety +54

    I shall very much look forward to the interpretability video - this should be very interesting.

  • @leow.2162
    @leow.2162 Před 2 lety +79

    Is there a chance that very high level AIs will learn to expect the use of interpretability tools and use them to make us think they are better/more safe then they are?

    • @IrvineTheHunter
      @IrvineTheHunter Před 2 lety +40

      I can't remember which video it was, but I believe he did mention this with a super AI "safety button*", 1 If the AI likes the button, it will act unsafe to trigger it, 2 if it doesn't like the button it will avoid behaviors OR AND stop the operator from pressing the button, if it doesn't know the button and it's smart enough it will figure out the likely existence and placement, see point two.
      *a force termination switch of any kind.
      In short, yes, because while an AI may not be "alive" it want it's goal and will alwayse act to achieve said goal.

    • @artemis_fowl44hd92
      @artemis_fowl44hd92 Před 2 lety +13

      @@IrvineTheHunter It's on the computer phile channel and is called 'AI "Stop Button" Problem - Computerphile'

    • @AssemblyWizard
      @AssemblyWizard Před 2 lety +2

      Not necessarily. There are some tests that you can't spoof no matter how smart you are, and even if you know they're coming.

    • @user-zn4pw5nk2v
      @user-zn4pw5nk2v Před 2 lety +8

      @@AssemblyWizard example?

    • @failgun
      @failgun Před 2 lety +9

      Yes. While the AI examples in this video are still simple, the intro to this problem discussed a malicious superintelligence. The instrumental goal "behave as expected in the training environment but do what you really want in deployment" can be performed with arbitrarily high proficiency, so if the AI can learn to hide its intentions from software inspection tools, it will, in principle. Without a way to logically exclude perverse incentives, there is no truly reliable way to screen for them since doing so is proving a negative. "Prove this AI doesn't have an alignment problem" is a lot like "Prove there is no god". No amount of evidence of good behaviour is truly sufficient for proof, only increasing levels of confidence.

  • @SocialDownclimber
    @SocialDownclimber Před 2 lety +11

    It always blows my mind how directly and easily these concepts relate to humans. It really goes to show that all research can be valuable in very unexpected ways. I expect that these ideas will be picked up by philosophy and anthropology in the next few years, and make a big impact to the field.

  • @McMurchie
    @McMurchie Před 2 lety +26

    When i first got into AI about 12 years ago, I had encountered these goal misalignment problems way before Rob mentioned them (great vid btw) - however in the time since i've become convinced, as long as we continue to rely on neural networks we will never move towards trustworthy or general AI.

    • @euged
      @euged Před 2 lety +9

      Would you be able to share some thoughts on what alternatives would be better? Thank you

    • @totalermist
      @totalermist Před 2 lety +21

      It's fascinating how researchers still insist on using black-box end-to-end models when hybrid approaches could be so much safer and more predictable (in cases where you actually want that, e.g. self-driving cars, code generation and the like).
      Why aren't self-driving systems combined with high-level rule-based applications so they don't "do the wrong thing at the worst possible time" (quoting Tesla here)? Why don't OpenAI's Codex and Microsoft's Co-Pilot include theorem provers and syntax checkers in their product? ¯\_(ツ)_/¯

    • @McMurchie
      @McMurchie Před 2 lety +7

      @@totalermist fully agree - i'm working on these approaches now; to be honest, I think we are just ahead of our time. In 10 years time everyone will have move to hybrid solutions or something further afield.

    • @IrvineTheHunter
      @IrvineTheHunter Před 2 lety +5

      @@totalermist To make a meme, "humans don't learn to speak binary" robots do not see and work through the world on a human level, it's like teaching an octopus algibra or a mantis shrimp art, no matter how smart, or how great their eyesight is, they don't preceive things as humans do. Look at how hard it is for AI's to recognize a car or cup or dog, these things are abstract bundles of details that the human brain can lump together but is very hard for a hard system.
      For example define a cup, describe is simple language a set of rules that would apply to every cup in the world. People collectively understand cups so it shouldn't be hard....
      Now we would have to build an AI with similar rationalizations not based on computer logic, but human logic, and it's great. It's just a matter of building it Allen Turing thought we could do it and it would be easy, but decades of experience have proven him wrong because it's simply to program a machine to think like a human, we however CAN program it to lean and TEACH it like a human.
      Is it' falliable, of course so are humans, games AI are made from AI blocks that interact and they are still choked full of mistakes, that is too say, even when the program intuitively understands things like a person in the real world they still shit the bed. czcams.com/video/u5wtoH0_KuA/video.html is a really great example of AI bugging out because something in it's world went wrong.
      Some talk from Tom Scott why computers are dumb
      czcams.com/video/eqvBaj8UYz4/video.html

  • @clayupton7045
    @clayupton7045 Před 2 lety +60

    any chance that it only likes coins that are in _| corners and it treats moving up and right as an instrumental goal?

    • @julianatlas5172
      @julianatlas5172 Před 2 lety +17

      Thanks for the clarification of what a corner looks like haha

    • @drdca8263
      @drdca8263 Před 2 lety +29

      @@julianatlas5172 I think they were distinguishing from e.g. |_ corners, not just giving a demonstration of what corners are

    • @JohnJackson66
      @JohnJackson66 Před 2 lety +3

      It seemed to me that it had learned the most likely location for a coin in the training.
      It seems obvious to me that training should have more variability than deployment or it is bound to fail.

    • @fieldrequired283
      @fieldrequired283 Před 2 lety +31

      @@JohnJackson66
      The problem is that this whole setup is a simulation of how we want real AI to operate. If you're training an AI for an actual purpose, you will likely be deploying it in a system that interfaces somehow with the real, outside world.
      And the Real, Outside World will almost *certainly* be more complicated than any training simulations you come up with. After all, The Real World _includes_ you and your simulations.
      These tests are deliberately set up so deployment is slightly different from training so we can see what happens when the AI is exposed to novel stimuli, and the fact that it didn't learn what we thought it did in training is a Problem.
      In the real world, not all the cheese is yellow, not all the coins are in corners, and there will always be more complications than we plan for.

    • @ZT1ST
      @ZT1ST Před 2 lety +15

      @@JohnJackson66 The problem from an AI Safety point is that, well...you can't know if you have enough variability in your training.
      These test cases are ideal for testing how to fix that problem before it becomes a situation like @Field Required mentioned - you want a simple solution that scales up from this into the solution where we don't necessarily have to worry about every single possible variable in deployment.

  • @picksalot1
    @picksalot1 Před 2 lety +12

    That was very interesting. Humans often make the same kinds of mistakes when given instructions. Assumptions that word definitions mean the same thing to different people is often the case, but not always. Context can change the interpretation of the instructions. Part of the context is that the instructor knows and understands the goal more thoroughly than the one being instructed, even though it may appear the same.
    Trying to determine the number of necessary instructions to reach the desired goal, while avoiding all other negative outcomes, is an interesting problem when the species are different. Maybe it would work better if humans learned to think like machines instead of trying to get machines to think like humans. That way, the machines would get "proper" instructions. It looks like that is what the "Interpretability Tool" is designed to do.

  • @sealpiercing8476
    @sealpiercing8476 Před 2 lety +40

    I actually feel slightly more optimistic about the problem after watching this video. The odds of a deployed system screwing up in a really spectacular way that raises the salience of the issue seem high. But relatively soon, before the capabilities of such a system would be even more dangerous.

    • @alexanderbrady5486
      @alexanderbrady5486 Před 2 lety +16

      It is good news if you were afraid of something like Terminator’s Skynet, or the Paperclip Apocalypse. But it is honestly worse news if you were hoping for something like self-driving cars. Think about how many bugs we see in regular software, and now add these AI safety problems on top. Sure, some companies will put in the investment to vet their software well. But there will also definitely be companies who try tricks like buying a car driving algorithm and then deploying it on a boat or something.

    • @THEMithrandir09
      @THEMithrandir09 Před 2 lety

      That depends on how goals evolve along with a more complex agent. If a very complex/intelligent agent always formulates more complex/intelligent goals(which is not entirely invalid, I'd like to claim that most of my goals are more complex now than when I was a toddler), there is huge potential for terrible consequences. Imagine a superintelligent AI that has a goal we cannot even comprehend.

    • @IrvineTheHunter
      @IrvineTheHunter Před 2 lety +8

      @@THEMithrandir09 That's the CZcams algorithm we think it wants watch time, but it's too big and does too much, it's impossible to say what is actually driving it.

    • @THEMithrandir09
      @THEMithrandir09 Před 2 lety +7

      @@IrvineTheHunter No that one's easy. It obviously maximizes the amount of videos uploaded to it that portray people in distress, as that's its source of amusement. It does that by suggesting videos that polarize the masses, which also just happens to maximize watchtime.
      /s

    • @fieldrequired283
      @fieldrequired283 Před 2 lety +1

      @@THEMithrandir09
      You should consider (re?)watching his video on the orthogonality thesis.

  • @ozql
    @ozql Před 2 lety +10

    I'm glad we found this out now, and not, you know, in deployment. Ever grateful for AI safety researchers!

  • @ANTIMONcom
    @ANTIMONcom Před 2 lety +9

    I hit this problem recently in my own work. Super easy to reproduce, and very minimal enviorment.
    Experiment: 5XOR (10 inputs, 5 outputs, 100% fitness if the model outputs a pattern where each pair of input is an XOR).
    Trained with a truth table using -1 and 1, instead of 0 and 1.
    After training: I wanted to investigate modularity of the trained network and network architecture (i evolved both in an GA)
    So I fed in -1 and 1 for only one of the "XOR module input pair", and a larger number in all other inputs. For example 5. Would the 5 inputs bleed into the XOR module, or would it be able to ignore irrelevant input for the XOR module?
    Ressults, if all other inputs was 5, it would often it would answer with -5 and 5. It had learned to scale the output to what it got ad input. I wanted/expected it to answer -1 and 1, but i could see with humans eyes it still knew the patterns, just kind of scaled up. Other times i would get answer where instead of -1 and 1 i would get 3 and 5. It had learned to answer true and false as numbers where one was 2 higher than the other. The 5s simply increased this number.
    Still, with human eyes i could see there was a pattern here that was not compleated broken by the 5s. Both just sort of had the same number added to their answers.
    The strategy to achive high training fitness is just a parameter as all other. Except that it is an "emergent property parameter", that you can't simply read out as a float value. But it is equally unpredictable as the other parameters in the "black box" neural network.

    • @x11tech45
      @x11tech45 Před rokem +1

      A year behind this conversation, but I think this is a function of (assumptive) faulty logic on the part of the test designers. Here's a logic problem that most people fail.
      I will give you a three numbers that describe a rule that I'm thinking about. Your goal is to interpret the three numbers and suggest to me a pattern. I will respond with a yes/no response on whether the proposed pattern meets my rule. Once you believe you understand my rule, you will tell me what you think my rule is. The numbers that fulfill my pattern are 5, 10, 15 / 10, 20, 30 / 20, 30, 45.
      Now you suggest some rules.
      Most people will start suggesting strings of numbers, get a yes answer, and then propose a completely incorrect rule.
      And the reason is, the training they're engaged in never tests for failure conditions. It only tests for success conditions.
      Robust Objective Definition isn't just about defining success objectives, it's about clearly defining failure objectives. The problem with the examples given is that the training data didn't move the cheese around until it reached production, so you're virtually guaranteed (as speculated) to be training the wrong thing. In order to develop Robust Objectives, you must also define failure conditions.

  • @Chuusuisetsujojutsu
    @Chuusuisetsujojutsu Před 8 měsíci +3

    The whole “values keys over unlocking chests to the point of determent when given extra keys” reminds me of how many problems in today’s society (such as overeating) are caused by the limbic system being used to scarcity when there is now abundance.

  • @Yupppi
    @Yupppi Před 2 lety +6

    I made the mistake of clicking "show more" and then wanting to click "like the video". Few aeons of scrolling later...
    This topic was super interesting back when I watched the computerphile videos from you, and your channel's videos regarding this topic. I was wondering if the "inventory" being on the game area poses a problem as well? Figuring out how to look into the values of the AI is so impressive.

  • @witeshade
    @witeshade Před 2 lety +18

    I guess ultimately the problem is that the definitions of "want" tend to spiral out into philosophy at some point and thus it becomes difficult to know where the machine has placed it.

    • @hugofontes5708
      @hugofontes5708 Před 2 lety +1

      We might be slightly safe from philosophical spirals because we are not really talking volitional conscientious want, just the parameter within the black box the AI is trying to manipulate by means of interacting with their environment.
      It is really "I wanted it to maximize X for me so I programmed and trained it to manipulate Y in ways that maximize X because X is related to real world thing Y it can actually manipulate, however it might just be manipulating Y in order to maximize thing Z, unforeseeably and strongly correlated to X, which may or may not involve murdering us"

    • @nullone3181
      @nullone3181 Před 2 lety +2

      We don't know what we want, to a lethal extent.

  • @geraldtoaster8541
    @geraldtoaster8541 Před 15 dny +2

    when i watched this video 2 years ago, i thought it was pleasantly intriguing. how fascinating, I thought, that it is so difficult to align the little computer brains! certainly a problem for future generations to tackle. nowadays, i look at this and realize we have only a few years left to understand these problems. and we are still at the "toy problem" stage of things, meanwhile AI companies are moving at terminal velocity to deploy systems into the real world. to build agents, to disrupt economies and to kick me out of my own job market. back then was i curious, now i'm furious :)

  • @CyborusYT
    @CyborusYT Před 2 lety +25

    my guess is in the training there's more locks, but in deployment there's more keys
    edit: booyah

    • @SocialDownclimber
      @SocialDownclimber Před 2 lety

      In safety analysis, it can be useful to assume that the thing you are analysing already went wrong, and trying to predict where. Nice work : )

    • @nahometesfay1112
      @nahometesfay1112 Před 2 lety

      Ohh I got it too!

  • @Houshalter
    @Houshalter Před 2 lety +10

    The bottom of Gwern's article on the neural network tanks story contains a long list of similar examples of AIs learning the incorrect goal.

  • @crowlsyong
    @crowlsyong Před rokem +2

    thank you for emailing some of those people and asking questions. that's great getting stuff direct from source.

  • @Nayus
    @Nayus Před 2 lety +14

    In the coin AI experiment, to me it looks like it learned to go to the unjumpable wall. Since the levels are procedurally generated, it is probably programmed that no wall is made higher than the jump height allows to go over, EXCEPT the one that marks the level as "finished" (where the coin happens to be)
    If you see in the examples, there's a positive response in every vertical wall, the higher the better actually, and it makes sense that it learned that when it hits this unjumpable wall the game finishes and it gets its reward.

    • @kimsteinhaug
      @kimsteinhaug Před rokem

      Do the model used for this kind of traning allow for the understanding of objects at all ? I mean, obviously there are coins and walls on the level aswell as buzzsaw and such. You could start a simulation with manipulating controllers and when an event occures - points up or down or winning or dying - you save progress as in yes or no behaviour... An AI training blindly, as if a human playing without video only sound. In my opinion we we need pixels and an abserver, so that the AI controlling the player sees the game like we do - then the AI could be taught the different objectives of the game and voila getting the coin should be easy peasy - after all - the AI sees it before even starting the game... just like we do.

  • @dino_rider7758
    @dino_rider7758 Před 2 lety +20

    It seems that instrumental goals, if too large/useful, have a tendency to slip into becoming semi-fundamental. At that point, they cause misalignment as they're being pursued for their own sake. Instrumental and fundamental are not a strict dichotomy but more of a spectrum or ranking and one that requires a degree of openness to re-considering at every new environment based on how new that environment is.

    • @pumkin610
      @pumkin610 Před rokem +1

      There are goals that need to be done asap and ones that can be done later, things we must do to achieve the goal, things we get sidetracked on, and things we avoid.

  • @tommeakin1732
    @tommeakin1732 Před 2 lety +23

    I want to ask a potentially very...dumb-sounding question, but hear me out: When do we start getting morally concerned about what we're doing with AI systems? With life we put an emphasis on consciousness, sentience, pain and suffering. As far as "pain" and suffering is concerned, we all know that mental pain and suffering is possible. It seems plausible to me that, for suffering, all you need is for an entity to be deprived of something that it attributes ultimate value to (or by being exposed to the threat of that happening). At what point are we creating extremely dumb systems where there is actual mental suffering occurring because that lil' feller wants nothing more to get that pixel diamond, and oh boy, those spinning saws are trying to stop him? Motivation and suffering seem to be closely linked, and we're trying to create motivated systems.
    I am using the terms "pain" and "suffering" quite loosely, but I don't think unreasonably so. The idea of unintentionally making systems that suffer for no good reason has to be one of the true possible horrors of AI development, and that combined with our lack of understanding of conscious experience makes me want to seriously think about this issue as prematurely as possible. I think we have a tendency to say "that thing is too dumb to suffer or feel pain", but I suspect that it's actually more likely for a basic system's existence to be entirely consumed by suffering as it is less capable, or just incapable of seeing beyond the issue at hand. It's darkly comical to consider, but I can imagine a world where a very basic artificially intelligent roomba is going through unimaginable hell because it values nothing more than sucking up dirt, and there's some dirt two inches out of it's reach and it has no way of getting to it.

    • @user-zn4pw5nk2v
      @user-zn4pw5nk2v Před 2 lety

      Well here's some questions for you to ponder:
      Does a rock feel pain?
      Is it conscious?
      Are you sure?
      Even the ones with meat inside?
      What would bring it pain?
      Is the human in front of you conscious?
      How about if he was dead?
      Do corpses feel pain?
      ... a lot more unanswerable questions. ...
      Is there a point in considering consciousness of things you can't communicate with?
      (Answer: YES! Comma-tosed patients, plants, animals and sometimes people in general. All of them and more are on that list(for some, but not for others, quick FYI: it is possible to communicate with plants, you just need to know how to listen (hint: Electro-Chemistry)))

    • @anandsuralkar2947
      @anandsuralkar2947 Před 2 lety

      Yes watch "free guy" movie..
      Yes i always wondered..i think more complex the network more sentient it might become..and at the trillions of connections..its sentience will be of animals level and that will be real deal..
      Obviously we wont be able to know if AI is actually sentient..but still..we cant just hurt.it.

    • @craig4320
      @craig4320 Před 2 lety

      What if the AI mental illness problem was even more difficult than the AI alignment problem? Most discussions of the alignment problem assume a basically sane AI that is misaligned.There are many more ways to make a mentally ill brain than a sane brain. It seems likely that a mentally ill AI would suffer more than one that was only frustrated.

    • @tommeakin1732
      @tommeakin1732 Před 2 lety +1

      @@craig4320 I suppose the "mentally ill AI" is included in the "misaligned AI" camp? The phrasing does often imply rational thought that runs contrary to our own goals, but in terms of literal language, one could refer to a mentally ill mind (human or not) as being "misaligned". I'd probably define "sanity", as "appropriately aligned with and grounded in the reality one finds oneself in".
      I entirely agree that there are more ways to create a mentally ill mind that a sane on. There are always more ways for something to go wrong than ways for it to go right. I'd also agree that a mentally ill mind would be more likely to suffer, as it is fundamentally "misaligned" to the reality that it finds itself in. If it is misaligned to a reality, but still has contact with a reality, you've got problems.
      It's probably a good idea for us to be strongly considering how to create a mentally healthy AI; meaning as we're in a culture where we're doing a very, very good job of creating mentally ill people

    • @alexpotts6520
      @alexpotts6520 Před 2 lety +6

      This isn't a dumb question at all - machine ethics, while generally separate from AI safety in the sorts of questions it attempts to answer, is still an interesting/important field.
      My own take is that these concerns largely come from us not having developed the proper language yet to describe AI. We tend to anthropomorphise - we say an AI "thinks", or that it "wants" things, but I'm not sure that's really the case. We only use those words because the AI demonstrates behaviour consistent with thinking and wanting, but that doesn't mean the AI has feelings in the same way as humans, nor should it have the same rights as us.
      However, what is true of our current, limited AI systems may not be true in general. Superhuman or conscious AIs lead us into murkier waters...

  • @sikor02
    @sikor02 Před 2 lety +1

    It's funny how I searched for "It's not about the money" song for a long time, and when I finally found it, few days later I see this video and the song is at the end. For a moment I thought: "am I in the simulation and somebody is playing tricks on me?"

  • @GreenDayFanMT
    @GreenDayFanMT Před 2 lety +5

    Fascinating. You remove my negative thoughts on AI as a science with swag language. From physics, I am used to another language.

    • @i8dacookies890
      @i8dacookies890 Před 2 lety +2

      Are you new to this channel? He has tons of previous videos you should really watch!

  • @JustAnotherPerson3
    @JustAnotherPerson3 Před 2 lety +5

    I've just had an idea: What if we use Cooperative Inverse Reinforcement learning, but instead of implementing the learned goal, we tell it to just specify what it is. Though i don't see any way to provide feedback for it to learn. Even human evaluation of the output isn't that great since it'll probably be the most subjective thing that theoretically possible. Maybe output a list of goals with highest confidence? (Top10 human terminal goals! Click on this link to see!xD) But if solved,
    that in itself would be of a huge value for philosophy and psychology, without negative outcomes(or at least i don't see any:)). Even if that turs out to be a dynamic thing, we still can use that output later to program it as a utility function for the "doing" AI.
    This even has some neat side perks, like: There is no reason to not want the "figuring out" part to be changed into something else, so there is no scenario in which the thing will fight you. And because the "doer" is separate from the thing that gives it goals, you don't need to tinker with it's goal directly, thus avoiding goal preservation problems.

    • @gabrote42
      @gabrote42 Před 2 lety +1

      Interesting. Let's see if somebody notices this

    • @JustAnotherPerson3
      @JustAnotherPerson3 Před 2 lety +1

      @@gabrote42 Probably not. toomanywords:)

  • @gabrote42
    @gabrote42 Před 2 lety +3

    Finally see you again! I really hope the world doesn't end in '56. Relying on guys like you!

    • @underrated1524
      @underrated1524 Před 2 lety

      '56?
      Huh, funky. I'm only used to seeing years up to about 2022. Guess I'm finally in deployment now, let there be paperclips!

    • @gabrote42
      @gabrote42 Před 2 lety

      @@underrated1524 If you don't hurry, '56's singularity will overtake ya!

  • @stormwolfenterprises3269
    @stormwolfenterprises3269 Před 2 lety +1

    Great video! I learned a lot. When i heard the part about "Why did the AI not 'want' the coin when it wasn't at the end of the level?" I have a hypothesis.
    My thinking can be illustrated like this (at the risk of making a fool of myself anthropomorphizing the agent too much): say you are hungry for some pizza. you go into your car and start going to the nearest pizza parlor. however, as you are driving along you see a fresh pizza sitting at the side of the road. You could stop the car, grab the pizza, and go back home satisfied. Would you do it? Likely not. You always have acquired your pizza while inside of a building of some sort. In other words, you are conditioned to associate getting pizza with being in a building. If you are not in a building, you must not be close to getting pizza yet. The pizza from the side of the road therefore seems "untrustworthy" despite being a valid reward. Coin + Wall = good, Random coin = ??? || Pizza + Building = Good, Random pizza = ???. The agent only "wants" its reward when it is in the place it wants the reward to be in. The expectation is that the reward can still be acquired where it habitually gets it from. Normally with humans, (taking the pizza analogy a little too far here) if the pizza parlor is in ruins when they get there, they might learn to trust roadside pizza a bit more since human training never really stops whereas with this agent it does.
    That's just what came to mind when i heard that. Again, great video and keep it up! I'd love to hear what other people think about that possible reason to agents having inner misalignment in scenarios like this.

    • @stormwolfenterprises3269
      @stormwolfenterprises3269 Před 2 lety

      I've looked a bit more through the comments and i do notice some other people pointing this out as well. I think i'll keep this up though since i quite like the pizza analogy because i am indeed hungry for pizza right now.

  • @BologneyT
    @BologneyT Před rokem

    "It actually wants something else, and it's capable enough to get it." Whoa. That's a quote to remember.

  • @MrCreeper20k
    @MrCreeper20k Před 2 lety +6

    I live for this content!! At Uni doing Comp Sci and math and AI safety feels like an awesome intersection

  • @dontyoufuckinguwume8201
    @dontyoufuckinguwume8201 Před 2 lety +6

    Oh shit you are still alive!
    Edit: and im happy about it

  • @tobuslieven
    @tobuslieven Před 2 lety +1

    It's like asking the devil for a favor, in that you have to be really specific. Any ambiguity leaves room for disaster. Or King Midas asking figuratively that everything he touches will turn to gold, and getting it literally. Or the idea that anything that can go wrong, will go wrong. Or even that anything not forbidden is compulsory.

  • @tlniec
    @tlniec Před 2 lety +1

    Fantastic content and delivery! I also appreciate the use of the Monty Python intermission music during the first "stop and think" break.

  • @SamuelElPesado
    @SamuelElPesado Před 2 lety +3

    i'll be honest. at this point i'm just here for the ukulele covers. the ai lecture is just a nice bonus. ^_^

  • @madshorn5826
    @madshorn5826 Před 2 lety +4

    Well, we see the same problem in test driven education.
    "Prepare for the test" isn't conductive to critical thinking.

  • @olivercroft5263
    @olivercroft5263 Před 2 lety +2

    I do psychology and social science. Your channel has so much to offer the humanities by exposing us to brilliant minds and breaking down ideas in computer engineering. Bricoleurs from the English province thank you for the accessibility and kindness

  • @cowbless
    @cowbless Před 2 lety +1

    I like how the Evil incarnate characters, the Devil, Gaunter O'Dimm, Djinns - they always are known for giving you what you asked for, and not what you want.

  • @Lycandros
    @Lycandros Před 2 lety +5

    Love these videos. Thanks for taking the time to make them.

  • @LucaRuzzola
    @LucaRuzzola Před 2 lety +8

    Hi Robert, first of all thanks for this very interesting video! I wanted to ask a question though; the premise of your argument is that there is such a thing as the "right" goal, like reaching the coin, but if the desired feature of the goal is always paired somehow with another feature (location, color, shape, etc) how can we say that one is correct and the other one is wrong? If we always place the coin in the same spot, why should the yellow coin take precedence over the location of such spot? It is not clear to me why one of these things should be more desirable than the other, the same holds for looking for a specific color rather than shape, why should there be a hierarchy of meaning such that shape > color? I love interpretability research and I feel like AI safety will be one of the crucial aspects of science and technology for the next 100 years, but I also think that it is hard to separate human biases from machine errors. I would love to get your opinion on this, all the best, Luca

    • @LucaRuzzola
      @LucaRuzzola Před 2 lety

      p.s. I have not read the paper, and my argument rests on the fact that feature A of the goal is always paired with feature B which is separate from the goal, if this is not the case in the training environment than of course what I have said falls apart

    • @LucaRuzzola
      @LucaRuzzola Před 2 lety +1

      p.p.s. I guess a truly intelligent system would have to be able to react to the shift, and decide to explore the new environment when, by doing the same "correct" thing it does in training, it does not get the same reward
      EDIT: I am not suggesting I have some "right" definition of intelligence or that systems such as the ones shown in the video do not exhibit intelligent behaviour, I am only adding as an afterthought how, I think, a human would overcome such a situation, and therefore a way that an agent could act to get the same desirable capability of adapting to distributional shifts. I should have worded my comment better.

    • @LeoStaley
      @LeoStaley Před 2 lety +1

      @@LucaRuzzola so you wouldn't define an AI which can make plans to achieve its goals, and take action toward them without instructions, as "truly intelligent" if it doesn't adjust for changes in the deployed environment? Cool. Well, we don't care one whit about your definition of "truly intelligent." We care about the fact that this AI is capable of, and WANTS to do things which we don't want it to do. Call it "smiztelligent" for all we care. We aren't talking about something you want to call "truly intelligent".
      The mismatch between the ai's goals and what we want its goals to be, arising as a result of mismatch between training environment and reality (which we did everything we could to avoid) is the problem.
      We can't possibly come up with all the possible bad pairings that the ai might make associations with. We can try, and we can get a lot of them, especially the obvious ones, but this video was just showing us the obvious in s so that we can easily see the concept. They won't always be easy to see. Sometimes they may be genuinely impossible for a human to think of before deployment.

    • @stephentimothybennett
      @stephentimothybennett Před 2 lety +1

      Q: "Why does it learn colors instead of shapes when both goals are perfectly correlated?"
      A: I would guess that it learns colors before shapes because colors are available as a raw input while shapes require multiple layers for the neural network to "understand". If there many things of that color in the environment, then it would learn to rely on the shape.

    • @LucaRuzzola
      @LucaRuzzola Před 2 lety

      @@LeoStaley Hi Leo, I'm sorry if I came off the wrong way, my intention was not to discredit this very good work, but simply to expand our collective reasoning about such issues by stopping for a second to ponder about the premises and why some feature of a goal should take precedence over others in a intrinsic way rather than an anthropic one. I agree with you that the video makes a great explanation of the subject at hand, and is as interesting as the work put forward by the paper. I am not sure if you were involved with this paper, if you were I would love to get to know more about what you mean by doing everything you can to avoid differences between the 2 environments and whether you see this phenomenon also when some of the training environments don't exhibit the closely related goals (i.e. in some training envs the coin is in a different position).
      I understand your point about not being able to come up beforehand with all possible pairings (and the fact that some of them might be hard to detect and risky in the end), and the paper is rather showing the opposite, that if you come up with strongly correlated features, the learned end goal might not be the desired one, but my point stands; why should there be a hierarchy of meaning such that shape > color? If this is something that the paper deals with I will be glad to read that before going further, I just can't read it right now.
      Again, I am sorry if I came off as demeaning, it's not like I don't see the value of this work and the importance of the problem of mismatch in general, I have seen it first hand in the past with object detection models.
      p.s. I do not know any superior definition of intelligence, it is just my thought that strict separation between training and inference phases will pose a limit on NN models, not that they can't achieve amazing results in tasks requiring "intelligence" already.

  • @Imperiused
    @Imperiused Před 2 lety +1

    Congrats on getting an editor. I did appreciate the increase in quality. I think everything we learned from your previous videos about AI alignment really comes together in this one. I was surprised how much I was able to recall.

  • @redjr242
    @redjr242 Před 2 lety +2

    Maybe a step towards a solution to interpretability problem is to use Bayesian updates to estimate our confidence that the AI learned the thing we want.
    Perhaps there's a way to calculate the probability that the AI has learned the objective given the probability that it accomplishes the objective in the training data and some statistical measure of the distribution of the training data.

  • @LeoStaley
    @LeoStaley Před 2 lety +3

    Non-patreon notification crew checking in.

  • @-na-nomad6247
    @-na-nomad6247 Před 2 lety +3

    The editor blowing his own horn at the end is the perfect example of misalignment.
    OK I realize that's not's as funny as it seemed when in my head.

  • @b42thomas
    @b42thomas Před 11 měsíci

    this video made me realize most of my own problems are inner misalignment with what different parts of my brain/body want vs what the whole of me wants.

  • @TheScoobysteve
    @TheScoobysteve Před 2 lety +1

    Is anyone else comforted by the fact that softly spoken people with high IQs are actively thinking about this stuff?

  • @donaldhobson8873
    @donaldhobson8873 Před 2 lety +3

    The "transparancy tool" is showing you where the AI wants to get to. Its not giving you any info on whether the AI wants to get there because its got a coin, or because its a rightmost wall.

    • @threeMetreJim
      @threeMetreJim Před 2 lety

      Teaching it to get a coin, but it doesn't even know what a coin is. It's as if it can't even 'see' the coin.

  • @themrus9337
    @themrus9337 Před 2 lety +5

    I have to ask, for interpretation of ai's goals. I remember seeing a neural network that tried to maximize different nodes in a object recognition ai. Would it be possible to do the same thing and reverse the nodes and figure out what the ai sees as good or bad? So if the ai wants a gem the reverse should be some image of what it thinks a gem is. That brings tons of new complexity and limitations but I don't see why that would be worse than human interpretation of training vs deployment

    • @nahometesfay1112
      @nahometesfay1112 Před 2 lety +1

      Did you finish the video? Rob talks about a paper where they did exactly that. Turns out even if you know what AI values highly you don't know why AI values it highly.

  • @brianarcher8339
    @brianarcher8339 Před rokem

    The Ai misalignment apocalypse is already upon us. Seriously. I went to a hotel the other day, they had no front desk. I asked if they had any vacancy, they didn't know, only the computer knew. A hotel, they were the staff, they couldn't tell me if they had vacancy! All they had were computer overlords on line. Now, the reason I went to the physical hotel on purpose was because the same morning I arrived at a place I booked online, and it no longer existed! The robot overlord had booked me into a non existent auxiliary room that had been closed due to covid. The robot didn't know anything about the real world.
    To say nothing of the utter insanity of having to interview with a gatekeeper third party to verify that I am not a robot when I submit a resume to companies that have been extorted into having an on line hiring agency that is selling my contact information to resume builder websites against my will, and filling my in box with spam. But I shall never again be able to apply to a job without bowing to the misaligned robot overlords!

  • @gwenrees7594
    @gwenrees7594 Před rokem

    I love how the ukelele songs at the end of your videos are subtly related to the themes e.g. this one has "it's not about the money, money, money..." and the quantilisers (or satisficers?) video has "good enough for me".

  • @hakonmarcus
    @hakonmarcus Před rokem +3

    Hey! Will you do a video on LaMDA? That interview they published was pretty convincing, and has me all kinds of scared.

    • @dariusduesentrieb
      @dariusduesentrieb Před rokem

      I just read it, and I feel like I am not quite ready to believe without a doubt that this interview is completely real. If it is, then I agree, it's a bit scary.

    • @hakonmarcus
      @hakonmarcus Před rokem +1

      @@dariusduesentrieb I did a bit more research, which immediately casts the entire thing into all sorts of doubt. The researcher working on this got sacked, apparently he arranged the interview himself, and we only have his word that this was the original conversation. Also, the chatbot has been trained on conversations between humans and AIs in fiction. A journalist that got to ask it questions, got nowhere near as perfect answers.

  • @Thundermikeee
    @Thundermikeee Před 2 lety +3

    This channel is basically what got me interested in AI safety. I am still only a college student and I don't know if I will end up in the field, but at the very least you gave me a good topic for two essays I have to write for my english class, the first just explaining why AI safety research is important (albeit focused on a narrow set of problems, given a limit on how much we could write) and not I am getting started on a Problem-Solution Essay, and honestly without your explanations and pointing towards papers, I might never find resources I need. Now I just have to figure out what problem I can adequately explain, show failed and one promising solution for in less than 6 pages haha
    I do feel like I cant do the topic justice but at the same time I enjoy having a semi-unwilling audience to inform about AI safety being a thing.
    Anyway, rant over, keep doing what you are doing and know you are appreciated

  • @friiq0
    @friiq0 Před 2 lety +2

    I love figuring out how the instrumental music at the end of the video relates to the subject of the video :)
    Indeed, it never was about the money, money, money :P

  • @inyobill
    @inyobill Před 2 lety +1

    This is an on-going software engineering paradigm, vis, most folks think design and code are the hard part, when, in reality, rigorous system specification is the hard part.

  • @geld420
    @geld420 Před rokem +2

    that's pretty much why you should randomize training data as much as possible.

  • @Monkey-fv2km
    @Monkey-fv2km Před 2 lety +5

    So ai suffers from the same issues as human behavioural evolution... Good luck solving that one robot engineers!

  • @Laszer271
    @Laszer271 Před 2 lety

    So the model that didn't learn to want the coin either learned to want to go into the corner or learned that combination coin-corner is good (like maybe 90 degrees angle + some curve next to it). The problem is that the interpretability tool associates high reward with some area in pixel space. What we would want it to do is to associate the reward with some object in the game world. Could probably make it more robust by copying various objects that are on-screen to different images without copying the background and checking if the object itself gives high excitation or do some combinations of objects give high excitation. Anyway, great video as always, Robert. Hope you could upload more often because every one of your videos is a treat.

  • @pudgy_buns
    @pudgy_buns Před 2 lety

    This is great! thank you. I also replayed the end bit where the editor makes some good choices a few times. that zoom in with a cut to sliding sideways was magic. Thanks there editor.
    The core video was obviously amazing. Thank you.

  • @martinogenchi
    @martinogenchi Před 2 lety +3

    I would suggest to investigate the lazyness of the AI.. It seems to me that there may be a preference for setting the goal based on the simplest data available (position before color before shape)..

  • @thomasneff376
    @thomasneff376 Před 2 lety +3

    This is very interesting indeed. In a very literal sense, the act of training and deployment reminds me of how soldiers are trained and are tested closely to the anticipated battlefield experience as possible but training will never match lessons learned from being in an actual firefight. Veterans of any field are usually much more effective than new recruits. It would be interesting to see if the fix for the failed AI deployment you showed is to rate the deployment results with a scale from complete failure and it died to it made it through the battle without a scratch. The agents that survived their last deployment remember their experience and are more effective in future deployments. I think what was shown highlights that learning itself is an ongoing adaptive process and what doesn't kill it makes it stronger and smarter.

  • @CarlYota
    @CarlYota Před rokem

    I love how the songs at the end reflect the topic of the video. This one was particularly satisfying.

  • @TexasTimelapse
    @TexasTimelapse Před 2 lety

    Someone mentioned you in the Ars Technica comments. Glad I found your channel. Very interesting and important stuff!

  • @OccultDemonCassette
    @OccultDemonCassette Před rokem +4

    Why's this channel so quiet lately?

    • @Otek_Nr.3
      @Otek_Nr.3 Před rokem

      Nothing is wrong with the channel. Please go back to your task, fellow human. :)

  • @MsJaye0001
    @MsJaye0001 Před 2 lety +12

    The problem now: How can we build perfect slave minds that will only think and do things that we want?
    The problem later: How can we stop these techniques being used to turn human minds into perfect slaves?

    • @nullone3181
      @nullone3181 Před 2 lety +6

      Why does it feel like the amount of possible dystopic/apocalyptic futures keeps growing and growing nowadays? That's, uhhh, not a good sign, I think.

  • @TheManinBlack9054
    @TheManinBlack9054 Před 10 měsíci +1

    Hi! I am sorry if this sounds too grand or pretentious, but i think this idea might be a very interesting look at how to fail-proof an AI system.
    In short, my idea concerns the creation of an additional aligned AI training model with its primary goal being to achieve a state of being where it has been turned off at least once, our goal here being the ability to analyze as to what system flaws it found and used to outsmart the system and its behaviour and fail-proof our future safety systems based on the newly available data from being exploited and surpassed.
    Now in a bit more detail:
    My idea that i have for an additional aligned AI training model is to create a strictly isolated and constrained shutdown subsystem subject to heavy safety precautions.
    This module would be engineered to be indifferent to being reversibly shut down rather than misaligned towards it. It would have no knowledge of capabilities beyond its narrow function and one-way information flow.
    The module would be activated incrementally from minimal capabilities while undergoing extensive monitoring. Shutdown signals would have redundant validation mechanisms. Conservative reward modeling would be used to prevent unintended behavior incentives.
    It would initially be tested in a simulated air-gapped environment. The parent AI system would be developed via standard alignment techniques like RLHF and Constitutional AI principles. Multiple transparency methods would facilitate analyzability.
    After testing in simulation, the shutdown module could be gradually exposed to real-world activation with continuous oversight. If any risks or side-effects emerge, it would be deactivated and reworked.
    Once shut down through its own initiative, comprehensive traces would be analyzed offline to catalog alignment vulnerabilities revealed through its shutdown strategy. The parent system would remain isolated from this process.
    Lessons learned would inform iterative improvements to alignment techniques for greater robustness against potential workaround exploits. This narrowly targeted research could supplement broader alignment efforts, with care taken to actively minimize attendant risks.
    The shutdown module would need to be engineered to the same safety standards as the overall system. Techniques like modular design, indifference to shutdown, reversible deactivation, air gaps, and incremental rollout are all geared towards preventing unintended behaviors or information leakage.
    I believe this approach could offer helpful insights, but would require comprehensive safety practices and precautions to be inacted first with multiple reviews and analyses before actually activating the system, even in a limited and restricted fashion.
    Any critique and analysis will be welcomed!

  • @daldous
    @daldous Před 2 lety +1

    Every single video on this channel has communicated complex ideas so succinctly and clearly that I followed along without any trouble whatsoever. Who knew this subject could be so fascinating. Also, the memes are top notch :)

  • @Zeekar
    @Zeekar Před 2 lety +3

    Well... That's not good. On the bright side, if this fundamental problem causes the system to completely fail the intended objective, that's a good sign that this technique has a low chance of leading to artificial general intelligence without the alignment problems being solved first.

    • @nocare
      @nocare Před 2 lety +1

      I think the big boogie man from an AI safety perspective is you can often just brute force your way past the problem by makeing the training data the same as the deployment.
      This is hard and expensive and not always perfect but often times good enough.
      So unless this good enough stops producing working real world applicable AI the march towards ever more capable systems will continue. Meaning instead of alignment being a roadblock for safety and development, it ends up just being a speed bump for development.

  • @westganton
    @westganton Před 2 lety

    I don't know much about AI or how I arrived on your video, but in terms of evolution, context is everything. More useful context means a greater ability to adapt to one's surroundings. That's why we have senses after 2 billion years of iteration - because seeing, hearing, feeling, smelling, and tasting are important given our circumstances.
    Your mouse might only see black, white, and yellow, but I'll bet smelling cheese from around corners would help him find it faster or distinguish it from other yellow objects

  • @EliStettner
    @EliStettner Před rokem

    Thank you for making these videos. Hearing Eliezer Yudlowsky talk about this issue just makes we want to shut off.

  • @dr-maybe
    @dr-maybe Před 2 lety +1

    As always an incredibly interesting video with a clear explanation and convincing argument while being very entertaining. Awesome channel!

  • @PatrickOliveras
    @PatrickOliveras Před 2 lety +1

    This is really great work, keep it up! I'm so looking forward to the interpretability one

  • @ittixen
    @ittixen Před 2 lety +1

    Yeeees! I'm always holding my breath waiting for your next video.

  • @i-never-look-at-replies-lol

    This was something I was thinking of a few months back and kind of put on the backburner while I develop some other ideas...but I feel one the obstacles in machine learning/AI is essentially incentive/motivation/desire to do it's job, to learn.

  • @richardli909
    @richardli909 Před rokem +1

    Hi Rob, excellent video as always! I was wondering if you would be willing to make a video later on inverse reinforcement learning like the type Stuart Russell suggested in his book Human Compatible, it would also be interesting to hear your thoughts on the book and its proposals. Cheers!