This Algorithm Could Make a GPT-4 Toaster Possible

Sdílet
Vložit
  • čas přidán 19. 06. 2024
  • The Forward-Forward algorithm from Geoffry Hinton is a backpropagation alternative inspired by learning in the cortex. It tackles several issues with backprop that would allow it to be run much more efficiently. Hopefully research like this continues to pave the way toward full-hardware integrated AI chips in the future.
    Outline
    0:00 - Intro
    1:13 - ClearML
    2:17 - Motivation
    5:40 - Forward-Forward Explained
    13:54 - MNIST Example
    18:54 - Top-Down Interactions
    26:00 - More Examples / Results
    27:41 - Sleep & Phased Learning
    29:36 - Related Ideas
    30:38 - Learning Fast & Slow
    32:35 - Mortal Computation
    ClearML - bit.ly/3GtCsj5
    Social Media:
    CZcams - / edanmeyer
    Twitter - / ejmejm1
    Sources:
    Paper - www.cs.toronto.edu/~hinton/FF...
  • Věda a technologie

Komentáře • 295

  • @maboesanman
    @maboesanman Před rokem +231

    “Imagine your brain blacking every couple seconds” I mean we do sleep for 8 hours a night

    • @nullbeyondo
      @nullbeyondo Před rokem +11

      seconds = hours? Please watch the video, he talks about sleep too.

    • @ThompYT
      @ThompYT Před rokem +51

      @@nullbeyondo when the joke is taken seriously

    • @Mr0rris0
      @Mr0rris0 Před rokem +1

      @@ThompYT Bono found the pub ya
      ENOUGH IS ENOUGH.
      kidding.

    • @andybrice2711
      @andybrice2711 Před rokem +7

      And I'd hazard a guess that dreams are mostly our brains doing some sort of data processing on the day's events.

    • @monad_tcp
      @monad_tcp Před rokem +4

      Doesn't the brain black out every 100ms already ?

  • @Waltonruler5
    @Waltonruler5 Před rokem +43

    "What if we abandoned computers altogether?"
    Bah gawd, that's the Butlerian Jihad's music

  • @nullbeyondo
    @nullbeyondo Před rokem +50

    Watched the entire video. It's very well-put. I also like that you're still showing parts of the papers so we can pause the video to read the rest, instead of just the highlighted parts. It actually interested me to read the entire paper with a more vivid view of it as you go through it.

  • @gbfar
    @gbfar Před rokem +101

    I just want to note that there have been many recent alternatives to traditional neural networks that do not rely on backpropagation or storing the results of the forward pass.
    Two examples are Neural Ordinary Differential Equations [Chen, 2017] and Deep Equilibrium Models (DEQ) [Bai, 2019]. The first performs both the forward pass and the backward pass by solving differential equations, while the latter does so by solving fixed-point equations.
    DEQs have been particularly successful. They are theoretically well-understood, use significantly less memory than regular neural networks due to not requiring to store the results of the forward pass, and have achieved comparable or SOTA results in many different tasks.
    There has also been some work in creating differentiable machine learning models based on optimizers such as Quadratic Programs [Amos, 2017], Semidefinite Programs [Wang, 2019] and even Integer Programs [Paulus, 2021].
    These models can be trained using gradient-based optimization without storing intermediate results or using backpropagation, although their results are overall less impressive.
    Interestingly, Zico Kolter (who has been involved with DEQs) has mentioned a few years ago that specialized hardware could be developed to solve fixed-point equations. This could greatly favor Deep Equilibrium Models by providing some of the benefits outlined in Hinton's paper.
    Personally, I think there has been some overreaction to Hinton's paper.

    • @hr3nk
      @hr3nk Před rokem +14

      True, I also have been interesetd in papers that you mentioned, and whats great about them and backprop in particular is that they rely on very solid and very well understood math, where models are just a form of notation for expressing calculations that computer can execute. However, Hintons "Forward forward" relies on some heavily intuitive engineering. Not saying that to discredit Hintons research, but even in this overview it is mentioned there are too many broad terms used like "goodness" and "positive and negative" data. It would be really cool if an attempt has been made to boil down this vague terms to statistical interpetations, like in distribution and out of distribution data, and provide some theoretical output, but I guess that out of scope for paper.

    • @randylefebvre3151
      @randylefebvre3151 Před rokem +2

      I have also seen bayesian inference neural networks which are really interesting!

    • @kuretaxyz
      @kuretaxyz Před rokem +1

      By the way, what happened to capsule networks?

    • @hr3nk
      @hr3nk Před rokem +2

      @@kuretaxyz Well, they sort of are not that relevant today, as attention mechanism really rules in most of todays architectures - and it pretty much learns what to learn, while capsules tried to capture specific information through manual engineering of network. They can be more efficient - but hardware is not an issue today (yet).

    • @revimfadli4666
      @revimfadli4666 Před rokem +2

      ​@@hr3nk aren't capsules more expensive to compute, but have the advantage of being orientation-independent? Which is probably one or a few clever hacks away from becoming practical, like how YOLO supplanted U-net for classification/detection and segmentation at once

  • @seedmole
    @seedmole Před rokem +37

    Self-organizing FPGA chips will someday be a thing, very exciting stuff

    • @dylanbrophy1203
      @dylanbrophy1203 Před rokem

      I'm curious to see more on this - any good references? I love FPGAs.

    • @randalllionelkharkrang4047
      @randalllionelkharkrang4047 Před rokem

      Hi, what do mean by FPGA? I am aware of Self-Organizing Maps , but what are FPGA?

    • @JensNyborg
      @JensNyborg Před rokem

      @@randalllionelkharkrang4047 guessing Field Programable Gate Array.
      I vaguely remember goofing around with one in the mid '80'ies and I too was thinking that variations over that theme might pop up in this context.
      And that the weights of neural networks in solutions inspired by FPGAs might be at least partially digital, solving some of the bootstrapping issues.
      Then again the '80ies are a while back and I haven't kept up on the hardware side of things.

  • @nevinb60
    @nevinb60 Před rokem +22

    I heard about the Forward-Forward algo on the podcast Eye on AI. There’s was so much technical detail that I felt like I was lost. But you did a great job!

  • @Bencurlis
    @Bencurlis Před rokem +48

    I have tried using the Forward-Forward algorithm as a way to learn universal probability distribution approximators, that could then be used to detect anomalies and generate new data, but it didn't work very well. Perhaps if I used the more complex variant that has feedback it would have worked better, but it is very complicated for a small gain imo.
    But far from me the idea of throwing out forward propagation, I think it is the way to go for future AIs.
    Instead of Forward-Forward, there is this preprint called "Signal Propagation: A Framework for Learning and Inference In a Forward Pass", which is a very similar framework that actually has the Forward-Forward algorithm as a special case, in a way. Sigprop also has a mechanism for feedback, which seems much more efficient.
    I hope you will make a video about this paper also, this one was very well explained.

    • @JeffHykin
      @JeffHykin Před rokem +2

      I still have yet to see an implementation of the more complex feedback version. I would really like to see it, but it's not the easiest thing to implement. I don't even think the performance will be great, but I would guess/hope that its naturally resistant to adversal attacks. Probably depends on the way negative data is generated though, which is kind of a major weakness overall.

    • @descai10
      @descai10 Před rokem

      @@JeffHykin You could look at it as a weakness or a benefit, the benefit being that figuring out how to generate good negative data could potentially make them even more effective than current backpropagated NNs

    • @Bencurlis
      @Bencurlis Před 2 měsíci

      @@seanoconnor1984 direct random feedback alignment work much better on multilayer perceptrons, I don't think the method will work if you use it on CNNs or transformers. You also have to correctly scale your random feedback matrices and make sure they stay constant throughout the training. You also have to start from a randomly initialized model and the last layer should be learned normally. Lastly, you should prevent gradients from flowing from a layer to the previous one.
      If you do all of these things, the model should train, but probably not to the level of an equivalent model with regular backprop.

  • @joey199412
    @joey199412 Před rokem +64

    As someone with an original background in electrical engineering I can immediately see the potential in this. There are some other benefits as well to stand alone analog AI chips. Lower latency, perhaps even the theoretical limit of the speed of light with photonic analog. Higher precision and divisibility for a more fine control. The downside would be state control and retention. There has to be a backup in case power runs out. Otherwise you'd have AI's that "Die" when the chip loses power.
    I think it's important for computer scientists to recognize that traditional distinctions like Neumann architectures, Software-hardware divide and digitization of data are all at the end of the day just human inventions and tools developed for a reason. If this reason changes then the tools and abstractions can also fade away. There's a reason evolution favored analog thinking machines. It's just way more power efficient and I think this is the obvious future of computing in the far future.

    • @EdanMeyer
      @EdanMeyer  Před rokem +9

      I know hardly any electrical engineering, so it’s interesting to hear from others that know a lot more about this, I wish I knew more so I could dive deeper into these things in my videos

    • @nullbeyondo
      @nullbeyondo Před rokem +8

      I do acknowledge the benefits but it's much harder to research neuromorphic hardware, let alone a company investing in neuromorphic chips when it could just use cloud computing or just buy already easily-scaleable general computers. I admit it's really frustrating the kind of world we live in, it'd happen one day I believe, but very slowly.

    • @sb_dunk
      @sb_dunk Před rokem +7

      I'm not sure I'm convinced that analog is always superior to digital. When it comes to procedural processes, I would assume digital is more efficient. I am happy to be corrected.
      The benefits of analog hardware for things like neural nets seem obvious, but I would assume the future of computing would involve a combination of analog and digital.

    • @nullbeyondo
      @nullbeyondo Před rokem +15

      @@sb_dunk You're getting the wrong idea. Analogs are not superior, but when it comes to computations on narrow tasks, they're much faster as they don't require cycles to perform computations. You could just let the laws of physics do the computations for you. Just like how quantum computers work in some narrow-purposed tasks.

    • @sb_dunk
      @sb_dunk Před rokem +5

      @@nullbeyondo I understand that, but the original comment said "There's a reason evolution favored analog thinking machines. It's just way more power efficient and I think this is the obvious future of computing in the far future."
      I was challenging the idea that analog is the future. For AI and similar systems, maybe, but I doubt digital computing will be phased out (pun intended).

  • @harrysvensson2610
    @harrysvensson2610 Před rokem +6

    34:36 exactly, staying in the digital domain means that if you do something in the morning of 2023, february 16th in Sweden where it's snowing. Or if you do something in the Sahara desert at noon where it's scorching hot, you'll get the same answer. With an analog computer you won't, because resistors are affected by temperature. The copper traces on the PCB that contains all the analog voltages will be susceptible to noise, any electromagnetic wave that gets absorbed by the circuit... small or large... goes into the circuit and affects the output. With a digital circuit you have a bit, the voltage levels between a logic 1 and a logic 0 are so vast that they will never be misinterpreted (sometimes they do, but that depends on your made-up scenario).

  • @General12th
    @General12th Před rokem

    Hi Edan!
    This is very interesting! Lovely overview. I'm excited!

  • @eruiluvatar236
    @eruiluvatar236 Před rokem +10

    Regarding the analog to digital converters: They are only expensive and take a lot of power because you need them for all the weights. If you just want to interface the input and output of the network with a digital system, the number of bits and speed required is much more manageable and so is the power. So it seems doable to have an AI accelerator using an analog chip for the Forward-Forward algorithm connected to regular digital stuff that is as high or low power as required for non AI functions.

  • @QuadraticPerplexity
    @QuadraticPerplexity Před rokem +6

    The "sleep is unlearning thing by negative examples" notion is decades old. I read it in a book about NNs about 30 years ago. :-)

  • @brian7android985
    @brian7android985 Před rokem +13

    "I toast, therefore I am"
    Red Dwarf

  • @oncedidactic
    @oncedidactic Před rokem

    Great paper digest and comments, thanks!

  • @ofconsciousness
    @ofconsciousness Před rokem +1

    Great dog! I love the pose with the legs and tail in motion, it really captures the forward direction of the dog. And the tilt of the head looks so curious and expressive!

  • @MickGardner-vc4us
    @MickGardner-vc4us Před rokem +2

    very RL like in its local kind of update rules! thanks for bringing this up Dr Meyer!

  • @harrysvensson2610
    @harrysvensson2610 Před rokem +8

    33:20 Multiplying two analog voltages is much more complicated than you'd expect. There are many many many many different ways of doing analog analog multiplication, but all of them are either more complicated than the binary multiplication, slower than the binary multiplication or less correct than the binary multiplication (5x3 is 15, not 14.3 or 18.01 which certain analog analog multiplication methods could spit out).
    Multiplying an analog voltage by a constant is easy peasy though, and luckily this is what the weights usually are.

    • @DeruwynArchmage
      @DeruwynArchmage Před rokem +1

      With adjustable weights it could correct for manufacturing tolerance issues. It could have an initial set of weights that are “about right” for its purpose, then it could learn from actual performance to adjust them to account for those inconsistencies and even adjust for issues of age or local EM noise environments, which would play hell with analog chips.

    • @harrysvensson2610
      @harrysvensson2610 Před rokem +2

      ​@@DeruwynArchmage And how would these "adjustable weights" work?

  • @JakeDownsWuzHere
    @JakeDownsWuzHere Před rokem

    i was just wondering about this today. great video

  • @SellusionStar
    @SellusionStar Před rokem +1

    I love how radical the ideas in the end of the paper are. Kind of vague still, but sometimes one has to challenge everything to get something better. To get out of a local maximum.

  • @kaoboom
    @kaoboom Před rokem +1

    Nice spider-dog!
    Been a long time since I touched NN but, having took a course of his some years back, I must say; Hinton is not only a great researcher but also a great teacher.
    Thanks for doing this rundown of the paper!
    Sounds like FF being more power efficient, and more easily parallelized, while being slower to learn might also make it usable for battery powered online learning for a start and then BP, or some other algorithm, comes in when charging in sleep mode.
    Speaking of dreams, it's a great example of how the brain settles over the years.
    With the dreams initially being muddy mixtures of mostly negative learning data as a child; really bringing to mind that masking method for generating negative learning data shown in the MNIST example.
    Some decades ago there were analogue FPGAs called EPAC. That could be a good way to copy these NNs. Although it might be more viable to have a settling mode over a set range of paired I/O.

  • @MultiScorpia
    @MultiScorpia Před rokem

    Great Video!
    Kind of a bold move to be the only author and only mention people in the acknowledgements

  • @Quasarbooster
    @Quasarbooster Před rokem +9

    The potential consequences of this algorithm are really cool! Thank you so much for sharing all these great papers and topics.

  • @WisdomCritFail
    @WisdomCritFail Před rokem +1

    That idea about dreams at 29:00 is really interesting and makes me feel like I should probably get more sleep 😂

  • @henryD9363
    @henryD9363 Před rokem +6

    This is so wonderful and interesting to me. And thank you!
    I am trying to find the Hinton talk where he discusses dream activity you mentioned.
    Can you provide a link? Please?

  • @Koroistro
    @Koroistro Před rokem +13

    2:50 It doesn't sound unreasonable that the Human brain would buffer backpropagation and then have most of it happen during Sleep, it'd explain the biological necessity of sleep and why we appear to recall information easier after having slept on it.

    • @revimfadli4666
      @revimfadli4666 Před rokem +2

      Also, haven't researchers found neurons that propagate signals towards "the reverse direction"(e.g. from visual cortex towards the eyes)? So "backprop" could happen through them triggering certain activations, even though internal mechanisms might be insufficient?

    • @MrTeathyme
      @MrTeathyme Před rokem +2

      There’s also research that shows that if you take a small break where you sit there and do literally nothing for 30 seconds every few minutes during active learning you learn at such a faster rate that it exceeds margin of error
      Which could be interpreted as forcing the backprop buffer to be cleared

    • @kaoboom
      @kaoboom Před rokem

      @@revimfadli4666 there are several neural layers behind each retina that, at the very least, seem to work as a sort of edge-detection. If that's what you mean with "internal mechanisms"?

    • @revimfadli4666
      @revimfadli4666 Před rokem

      @@kaoboom no not edge detection, intracellular backprop

    • @kaoboom
      @kaoboom Před rokem

      @@revimfadli4666 Sorry, I should've been more specific: My question were in regards to: What sort of "internal mechanism" do you consider to be insufficient for backprop to be functioning?
      The layers seem to work as a sort of edge-detection, among other filters, just like how the the AI focuses on hard edges when learning to identify numbers in MNIST.
      By that parallel the signals found in "the reverse direction" would then be a form of backprop that trains these filters.
      It could also explain why, in some cases, myopia & hyperopia can be, so to speak, "learned away". But I digress…

  • @zemanntill
    @zemanntill Před rokem +5

    New Hinton paper lets go 😁

  • @TeamDman
    @TeamDman Před rokem +1

    Amazing video holy cow

  • @revimfadli4666
    @revimfadli4666 Před rokem

    Reminds me of echo state networks(ESNs) where the earlier layers are just "whatever a random network happens to do" and the last layer linearly learns to "make sense" of it, but this time the "random layers" also learn too

  • @RyanGosring
    @RyanGosring Před rokem +39

    You do actually black out for 1/3 of day every day and let your brain do the backprop and organize the data its called sleep

  • @asdfghyter
    @asdfghyter Před rokem +1

    7:23 I love your pretty doggo picture, I subscribed instantly and will hope for many more doggo pictures to come! 😍

  • @xdman2956
    @xdman2956 Před rokem

    the last part remids me strong;y of Conway's idea that you could do addition and other operations on a random bundle of electrical circuitry, just learning how to input and read output
    although it's kind of reversed, because here we would adjust the weights (change "circuitry", not how we input and interpret) so that it gives good output

  • @Givinskey
    @Givinskey Před rokem +72

    My biggest issue with this is that FF normally stands for Feed Forward

  • @suleymanemirakin
    @suleymanemirakin Před 9 měsíci

    Thank you so much

  • @maxantson4837
    @maxantson4837 Před rokem +1

    Thanks so much for such an informative video on this - best explanation of FF that I have seen and found the visual examples super useful! I have a question regarding adding black boxes - I can understand how this algorithm overcomes the problem, but what are the use cases? I'd love some concrete examples where adding a black box to the forward pass is useful/advantageous.

    • @JordanMetroidManiac
      @JordanMetroidManiac Před rokem

      Chaotic behavior. Suppose a single forward pass has drastically different results from slightly perturbed inputs. This is a black box function as you have no idea how to predict what it will do unless you simulate it. This FF model would be forced to learn around that chaotic behavior, and it could be useful for modeling potentially related functions such as prime factor decomposition.

  • @Xavier-es4gi
    @Xavier-es4gi Před rokem +3

    That's amazing vulgarisation thanks! Some parts were not easy to follow though like the "6 video" I didn't get what were the x axis for

    • @redpepper74
      @redpepper74 Před rokem

      I think the x-axis was time and the y-axis was how far through the model the information was

  • @zzador
    @zzador Před rokem +1

    Sounds a bit like "Contrastive Divergence" which is the traditional training algorithm of Restricted Boltzman Machines (RBMs).

  • @jamesgl
    @jamesgl Před rokem +9

    This naming convention is gonna give us things like FoFo Vs FeFo

    • @Wertsir
      @Wertsir Před rokem +7

      And don’t even get me started on FeeFi vs FoFum

    • @2k7u
      @2k7u Před rokem

      And dont translate the first one from portuguese to english

    • @revimfadli4666
      @revimfadli4666 Před rokem

      ​@@Wertsir feerless fidelity?

    • @Wertsir
      @Wertsir Před rokem

      @@revimfadli4666 I smell the blood of an englishman. Be he living, or be he dead, I'll grind his bones to make my bread.

  • @TheKdcool
    @TheKdcool Před rokem +3

    Your description of backprop makes me think of somebody sleeping/dreaming

  • @monad_tcp
    @monad_tcp Před rokem

    23:03 this is neat, if you model it that way, you don't need massive amounts of memory, actually, you can design electrical circuits that will do it with very little cell memory

  • @nigelbrown-ok2ve
    @nigelbrown-ok2ve Před rokem

    Reminded me of the work of psychologist Robyn Dawes who was keen on improper linear models (eg weights +1 or -1) which performed similarly to experts if the right feature set was being input. Each row is a bit like a set of improper linear models scoring the previous row. His work was often in areas where the between expert and even repeat expert agreement wasn’t perfect eg medical tasks. The improper linear models approached between expert reliability. Simple linear scoring models are often used in medical diagnostics again with the importance being which features are chosen. Sounds very promising.

  • @Waitwhat469
    @Waitwhat469 Před rokem

    The ability to use black box data, such as inputs from other nets, seems like a fantastic way to do sympathetic multimodel networks. For example, audio and video of a person talking and catching lip-reading and body language.

  • @danielbrockerttravel
    @danielbrockerttravel Před rokem

    This is better explained than the Hinton interview with Eye on AI.

  • @DdesideriaS
    @DdesideriaS Před rokem

    The problem with LLMs is not only resources for training (backprop), but also in inference itself (forward pass). It takes a lot of gpu flops to get a token through the system

  • @alengm
    @alengm Před rokem +4

    22:40 oooh. Looks like a cellular automata!
    Beautiful idea and a great explanation. I am following AI research a bit in my free time and I feel that just watching your videos gives me more understanding than if I've red the paper myself for much less effort. Thank you

    • @alengm
      @alengm Před rokem +2

      On a second thought. It's weird that early layers are updated directly on whether the data is positive or negative (hot dog vs not hot dog). It's a high level piece of information.

    • @whannabi
      @whannabi Před rokem

      @@alengm 🥺

    • @DeruwynArchmage
      @DeruwynArchmage Před rokem

      I think the reason you want a high output on the first layer (sum of squares) is because that means that it has identified some characteristics somewhere. You don’t care which neurons activate at that stage, so long as at least some of them do. If none of them do, then that means that what ever it’s looking at, it’s not recognizing at all.

  • @povilasn8239
    @povilasn8239 Před rokem

    What is the talk about forward forward algorithm referenced in the video, did not find in description 😢

  • @arashsol1064
    @arashsol1064 Před 9 měsíci

    thanks a million, perfect

  • @kiwitou420
    @kiwitou420 Před rokem +3

    could you (theoretically) have a "2 layer" chip where 1 layer is the actual chip and the second one is just many sensors and "writers" that can output the wheights and biases in digital form, so that you can later "copy" to the writers that write the bias to the local cell? this would be like taking a human brain and saying: i want this neuron to connect to this neuron in this way
    basically copying the whole chip?

  • @TeamDman
    @TeamDman Před rokem +1

    Great video! What annotation software do you use?

  • @harrysvensson2610
    @harrysvensson2610 Před rokem +14

    35:17 that's called an FPGA. That's when you describe hardware with software and you get absurd, like, really absurd, throughput. Massive parallelism.

  • @JordanMetroidManiac
    @JordanMetroidManiac Před rokem +1

    If the forward pass consists of a black box, then you can't do a single level of gradient descent. How do you encourage that black box to return higher values for positive data and lower values for negative data? Hinton mentions how you do not need to know the precise structure of the forward pass calculation, so you can save on computational resources and/or power, but that takes me straight back to my original question.

  • @Amipotsophspond
    @Amipotsophspond Před rokem

    so this seems like it would work really well for small data sets because you can make more negative data from less positive data. in the example in the video a 7 image is mixed with a 6 image to make a negative 7 image, but you can used the same 7 image to make a new negative image from a 5 image, so you have a (data set) worth of positive images, and you have (data set)*(deta set) worth of negative images. over-fitting might be a problem.

  • @dustinandrews89019
    @dustinandrews89019 Před rokem

    With a 3D printer with the right properties you could "program" an analog computer on the fly (this is currently feasible, but cutting edge). You could also use one expensive computer that simulates the hardware computers for quick iterations and once you are happy you can print out cheap physical computers with the tested design.

  • @zxuiji
    @zxuiji Před rokem

    25:45, Nah I'd say that big thing in the middle is just taking a copy of the results from the previous layer and identifying how closely the combination of them matches the specific thing it is supposed to match, can only be between 0% & 100% result and pass the merged data into the next node which is receiving from other local nodes, if no node in the chain matches more than 50% then a new node is created to identify a new pattern with, in other words every node set is a growable array, never a fixed static array

  • @briancunning423
    @briancunning423 Před rokem +1

    Great drawing of the dog.

  • @JinKee
    @JinKee Před rokem

    FF sounds like it is amenable to also going up in scale, with the local interactions split up between different cloud clusters.

  • @aspergale9836
    @aspergale9836 Před rokem +1

    (Until ~13:00) - Is this not (a close variant of) Direct Feedback Alignment?

  • @andybrice2711
    @andybrice2711 Před rokem

    I'm not sure digital computers are inherently more specific than analogue circuits for this kind of processing, since even in DSP we're working with floating-point numbers, heuristics, and approximations.
    Also, surely there would be a way to export the weights from a neural network hardware chip? Like you can save an image of a FPGA.

  • @6lack5ushi
    @6lack5ushi Před rokem

    its like an integrated neural network ASIC a NNASICGreat video!

  • @gurukhan1344
    @gurukhan1344 Před 12 dny

    24:11 I noticed that hidden layers combine top-down inputs and bottom-up inputs, but why there are blue arrows for hidden layers to top layers?

  • @6lack5ushi
    @6lack5ushi Před rokem

    love the video seriously! but you do it a disservice by not at least briefly explain back prop, relative to how high level the rest of it was it feels like you should know but even then I think its the bow on top of a great video! keep it up

  • @graham8316
    @graham8316 Před rokem +6

    Incredible drawing 😍

  • @askii2004
    @askii2004 Před rokem

    Honestly, the description of this training methods reminds me of phase kickback and the irrelevance of global phase vs. relative phase in quantum computing, but I may be misconstruing the algorithm here.

  • @sharks1349
    @sharks1349 Před rokem +2

    Your doggo is impeccable

  • @SufficingPit
    @SufficingPit Před rokem +11

    That was an amazing doggo. 7:24

  • @VivekPatel-sj3up
    @VivekPatel-sj3up Před rokem +1

    I like your Drawing. I'mma sub

  • @vixguy
    @vixguy Před rokem +1

    Great dog drawing ❤

  • @tweak3871
    @tweak3871 Před rokem +3

    My question is, how necessary is it really to do learning on an analog chip? Most of the energy cost is in inference, and my initial impression from "chips that dynamically learn" is that I expect they would require a ton of maintenance to perform well. The human brain has a constant stimulus and energy source, and a very important feature of a toaster or any other electrical device is that I would like to be able to unplug it or not use it for awhile. To me, it seems a much better way of going about this is to have a more hardcoded analog chip that is somehow update-able. Development still happens on digital computers so you don't need to incur all the logistical and manufacturing cost of training individual analog chips, and we find a way of transferring from a digitally trained model to an analog chip for the sake of inference.
    This does feel like such a simple idea though that I assume there are big issues with what I'm saying, I'm obviously no expert in analog computing. I know the transfer from digital to analog is expensive, but (as I understand) this is primarily a problem in backprop because of how often you need to update w/backprop, whereas if I had to update "my toaster" only like, once every few years or whatever, that just seems significantly more preferable.
    Analog is certainly one of the directions we could go to greatly reduce energy cost, but my immediately feeling is that we're gonna crack more efficient use of CPUs w/large NNs a lot sooner.

    • @EdanMeyer
      @EdanMeyer  Před rokem +2

      It certainly comes with many cons as you mention, and for that reason I’m sure we will always have digital, but specifically for a continual learning on the fly setting (as opposed to an update once every few years) I would say this direction seems more promising. I would imagine the future will be a mix of both, using them where they make the most sense

    • @aureliusmarcusantoninus3441
      @aureliusmarcusantoninus3441 Před rokem

      @@EdanMeyer It seems cool, but designing analog hardware is hard from my experience

  • @nicolettileo
    @nicolettileo Před rokem

    It really sounds like all the statistical learning field, eg. Boltzmann machines, reservoir learning, hebbian rule etc.

  • @BooleanDisorder
    @BooleanDisorder Před 6 měsíci

    Nice! ❤

  • @bujin5455
    @bujin5455 Před rokem

    35:40. I actually think this is the wrong take away. The reason we don't use analog computers (we tried to in the beginning) is because of how difficult differentiation is. When you have a range of voltages for different values, such as ten for base10, it's difficult to identify if the voltage is really a "1" and not say a "0" or a "2". Whereas with digital logic, you have a voltage? 1. You don't? 0. That's not too complicated, don't need to measure how high it is, don't need to compare it against a discriminator, you don't have to implement complex error detection and correction for simple register states, you don't have to worry about how finely tuned the hardware is, and how much it's drifting over time (and all component in the system are given to drift, and not necessarily in step with one another).
    Neural nets, on the other hand, are far more forgiving about about voltages (weights) drifting a little, and besides that, implementing NNs in some sort of analog system would be idea. Especially if you're using a spike driven neuromorphic chip. But that doesn't mean that the weights have to die with the hardware. There is nothing stop us from designing the chip from being able to import and export its weights. After all, the chip itself has to be able to tune these weights, which means the chip has the ability to "read and write" them. You may need to have a DAC (assuming you want digital conversion) present for these operations, but that doesn't mean the DAC would have to be a part of its regular duty cycle, and would be reserved for only those times when the weights needed to either be imported or exported, otherwise, there would be no power going to it, and it wouldn't be in the loop.
    I really have no idea why Hinton thinks that a hardware/software intermingling necessitates the NN dying with the hardware. Really doesn't make any sense, for reasons explained above. Move over, what are you going to do, train every one of these intermingled systems from scratch? You're going to want to establish the foundational weights before you deploy to the field (you know, where power budgets really matter). You're going to want to be able to mass produce these systems. And I really see no technical reason why we can't read/write analog weights from/to the system. I question how strong Hinton's knowledge is on this particular aspect of the subject.

  • @xntumrfo9ivrnwf
    @xntumrfo9ivrnwf Před rokem +3

    I presume you've seen the Veritasium video from a few months ago called "Future Computers Will Be Radically Different (Analog Computing)" - if not, it's relevant and interesting

  • @brll5733
    @brll5733 Před rokem +12

    The description of the algorithm itself was way to fast. We have gradient descent between layers and in the end everyhing is dumped into a linear classifier? Or something?
    How is this more biologically plausible? If there is an error signal between layers than it might well be passed "further back" right? So error backprop is still possible under that assumption.
    Also I don't think the brain has extra linear classification modules.
    And this example was only for classification, correct? How would this work for regression?

    • @revimfadli4666
      @revimfadli4666 Před rokem +1

      Maybe what's more biologically plausible is local vs backprop gradient descent/ascent?

  • @syfontenot7427
    @syfontenot7427 Před rokem +3

    Weird question: what software are you using to highlight the paper and draw on it?

    • @jordan59961
      @jordan59961 Před rokem

      Microsoft paint

    • @skop6321
      @skop6321 Před rokem +2

      PDF opened in Microsoft Edge can highlight and annotate

  • @jonbrand5068
    @jonbrand5068 Před rokem

    Impeccable drawing sir. Eh hem. Exemplary.

  • @tophat593
    @tophat593 Před rokem

    About back prop being implausible - because it needs to take a break... You know, sleep. I very much doubt we backprop in our sleep but we do something akin and the activity rest cycle is well established.

  • @BleachWizz
    @BleachWizz Před rokem

    3:10 - I always thought of a second process backproping and updating also we get sleep

  • @brianj7204
    @brianj7204 Před rokem

    So this is the last component for the singularity?

  • @AjSmit1
    @AjSmit1 Před rokem

    Forward-forward just strikes me as "dreaming" as far as one understanding of it goes. The negative data seems like the way the brain (allegedly)hallucinates implausible but familiar episodes to better anchor the concept of reality (positive data)

  • @osbernperson
    @osbernperson Před rokem +1

    Why have only one? Why not have Both?
    Both a chip in a toaster (being efficient) and a Home Hub computer (heavy work) being able to update these chips when neccesary.

  • @servrcube6932
    @servrcube6932 Před rokem +4

    Nice Doggo

  • @DJWESG1
    @DJWESG1 Před rokem

    Its all being developed in a way that i didnt forsee, but having said that the trajectories of all attempts seem to be on a similar track .
    It was supposed to be at least 4 distinct layers. But it wont matter as the realisation will have already occurred to others.
    Also the digtial analogue machines can be clusters or stand alone units. But needs a new language, and that would need a new understanding of old techniques.

  • @giantbee9763
    @giantbee9763 Před rokem +9

    Hi Edan, I'm looking for potential postgraduate degrees in Alberta doing model based RL, do you have any recommendations for supervisors?

    • @EdanMeyer
      @EdanMeyer  Před rokem +6

      The only prof I know that is definitely doing MBRL is Mike Bowling, though most of the profs in RLAI are generally interested in RL, generally including MBRL. Though I do not know who is taking post docs. Best of luck!

    • @giantbee9763
      @giantbee9763 Před rokem +1

      @@EdanMeyer thanks!

  • @petevenuti7355
    @petevenuti7355 Před rokem

    That explanation at ~20: looks just like something I was thinking of but haven't learned enough to express or implement yet.

  • @harrysvensson2610
    @harrysvensson2610 Před rokem

    7:27 nice looking dawg, my dawg.

  • @tylertheeverlasting
    @tylertheeverlasting Před rokem

    I don't understand how the "Recurrent network on the same input multiple times" is not just a fancy way to do approximately the same thing as backpropagation.
    You can imagine it as not storing the full forward pass activations like regular backprop and instead just calculating the activations on multiple times based on the number of layers.
    The only difference is that each layer gets updated multiple times instead of once, but even that is approximately close to an activity regularizer.

    • @EdanMeyer
      @EdanMeyer  Před rokem +1

      It is essentially trying to do something similar, the purpose of the algorithm is to pass information backwards (and forwards) through multiple layers of the network. The different is in the information needed to update. Backprop requires storing activations at every layer until the whole forward pass is done, and then doing an update for each layer that is dependent on all layers that come after it. While yes, FF also does backpropagate through multiple layers, those updates are derived from only local layers (the layer in front and behind), so it can be done on the fly (doesn't have to wait for forward pass to finish to at least start updating) and without having to store activations.

  • @EngineerNick
    @EngineerNick Před rokem

    "My roomba was lucky in the chip binning process. It has the unique ability to execute a perfect kick flip down the stairs. It seems immensely proud of itself. What it doesn't know is that next year's model is supposed to be smart enough to climb back up the stairs again with no additional hardware"

  • @Bencurlis
    @Bencurlis Před rokem

    I'm coming back to tell that I think this algorithm is exactly Hebbian learning with positive data and anti-Hebbian learning with negative data. You can derive the non-linear hebbian learning rule of the weights if you take the L2 loss as the local loss.
    I may have made a mistake though, don't take my words for it.

  • @JordanMetroidManiac
    @JordanMetroidManiac Před rokem

    Weights & Biases changed their name to Clear ML?

  • @CyberwizardProductions
    @CyberwizardProductions Před rokem +8

    it's really hard to download an application if it's hard coded on to hardware

    • @zakuro8532
      @zakuro8532 Před rokem +2

      There will be a way to train the chip, if I understood correctly?

    • @revimfadli4666
      @revimfadli4666 Před rokem +1

      aren't instruction sets hardcoded onto hardware too? yet programs can still be installed

    • @aureliusmarcusantoninus3441
      @aureliusmarcusantoninus3441 Před rokem

      @@revimfadli4666 I mean basic computations carried out by the design of a circuit. The rule of thumb is to do a lot of simple operations to make a complicated operation. More complex instructions take longer like */ - takes longer than addition. Having a bunch of little simple components doing simple operations that result a complex output is "better" a big complex component doing a big operation when doing common operations. Having a customized circuit for just a neural network is a cool idea but I would rather design a more general computing machine where the heavy ai instructions are digital and the orders are relayed to the hardware. I was taught that having many identical circuits all doing a little bit of the computation is better design than a custom design. Maybe there is a research institute looking into this design.

  • @nevokrien95
    @nevokrien95 Před rokem

    This really reminds me of synthetic gradients

  • @sinanisler1
    @sinanisler1 Před rokem +2

    cant wait to useG PT on my machine like i use SD rightnow :)
    imagine everyone have a google in their machine. holy cant wait...
    this is the way.

  • @revimfadli4666
    @revimfadli4666 Před rokem

    I wonder if mortal computation was the inspiration behind Overwatch's omnic irreplacability

  • @03Krikri
    @03Krikri Před rokem +7

    Someone can explain me why we take the sum of the square of yi - theta and not just sum of yi - theta on the goodness function please?

    • @EdanMeyer
      @EdanMeyer  Před rokem +5

      Good question, there actually are many goodness function that would in theory work. I’m not sure if they tested the one you mention, but just thinking about it now, I would imagine that the interference from top down layers in the RNN is more interesting when you use the squared variant, though really I’m just guessing here.

    • @oncedidactic
      @oncedidactic Před rokem +4

      L2 norm is a regularization that promotes sparseness in the hidden layer latents. This usually results in better generality aka more robust learned representations (avoids overfitting).
      [edit] I hope this helps, it is my intuition from what I’ve learned in the topic. Used all the buzzwords so you have some handles to google 😅 these ideas go by different words in various contexts. For example I recommend look up steve brunton on L2 norm

    • @phoneticalballsack
      @phoneticalballsack Před rokem +2

      There are two answers. 1) The original equation in Yi et theta = mu was intended to approximate goodness in a population. That is, the emphasis in the original equations was on the shape of the function as a population, i.e., as the average or median over all pairs within the population, and the deceptive appearance of adding the two triangles no matter the order does nothing to change that. 2) However, if you wanted to study goodness of fit between a single data point and a population, that bend in function at theta = Pi is a serious issue, of which many models have been developed (including lognormal dance moves). To compute goodness of fit with respect to a single point location, you need to know precisely where it is in the population. Note that for one particular individual, location will be given by i = x / (2sigma), where x is the data value and Y is the location; hence in this formula,newsum((i-theta)^2 I moved the average of theta to screen out this incongruity.

    • @RebelKeithy
      @RebelKeithy Před rokem +1

      the sum of squares is the more natural way in my mind. It is analogous to taking the square distance of a vector, for a 3d vector the distance is sqrt(x^2 + y^2 + z^2). And in computation you often use the square distance, since the sqrt is slow to compute and it mathematically equivalent, if you make theta the square of the threshold you would use if you took a sqrt.

    • @firefly618
      @firefly618 Před rokem

      First because it penalizes large deviations, but just as importantly because it's not linear. Otherwise it would just simplify away and you'd get a null result. Same reason why every artificial neuron needs some sort of non-linearity, be it a fancy sigmoid or a simple max aka. ReLU

  • @NotASpyReally
    @NotASpyReally Před rokem +1

    I'm gonna cry
    for so long I believed that robot characters dying in science fiction was unrealistic because "Well they are robots! They are immortal! You can just send the data to another body!"
    AND NOW YOU'RE TELLING ME IT'S REALISTIC??
    ***ROBOTS WILL DIE!??!? 😭***
    EDIT: AND YOU'RE GONNA PUT GENERAL AI IN A FREAKING TOASTER
    toaster: "what is my purpose"
    human: "you toast bread"
    toaster: "oh my god"

  • @StefanReich
    @StefanReich Před rokem +10

    Hello. I am a toaster powered by GPT-4. What would you like me to toast?
    (Sorry for the stupid joke ^^ I am actually interested in the topic too)

    • @DeruwynArchmage
      @DeruwynArchmage Před rokem +1

      Robot: “What is my purpose?”
      Rick: “You pass butter.”
      Robot: “Oh. My. God.”
      Rick: “Yeah, welcome to the club, pals”

    • @boggless2771
      @boggless2771 Před rokem

      ​@@DeruwynArchmage lol, thats exactly what I was gonna comment.
      Toaster: What is my purpose?
      Rick: You make toast
      Toaster: Oh my god.
      Rick: Yeah, welcome to the club, pal.

  • @TiagoTiagoT
    @TiagoTiagoT Před rokem +3

    Does the increase in energy efficiency compensate for the added energy cost of raising each individual AI from "birth" instead of copy-pasting pre-trained AIs into new units?

    • @EdanMeyer
      @EdanMeyer  Před rokem +1

      If you want to mass produce an AI that does the exact same thing, who knows (though I would guess yes)
      If you want a bunch of AI that learn individually to adjust to their scenarios on the fly, then yes absolutely

    • @TiagoTiagoT
      @TiagoTiagoT Před rokem

      @@EdanMeyer What sort of use would there be for untested AI that may not have learned the right lesson yet?

    • @JorgetePanete
      @JorgetePanete Před rokem +3

      I don't get the point of mortal software, can't you save the resistance of memristors and use it on a new instance?

    • @TiagoTiagoT
      @TiagoTiagoT Před rokem +1

      @@JorgetePanete It is practically impossible to make perfect copies with analog systems, each copy will come out slightly different, and if you make copies of copies errors will added up over each iteration. That is if you can even read individual components like that in the first place without tearing up the chip.

    • @JorgetePanete
      @JorgetePanete Před rokem +1

      @@TiagoTiagoT Memristors allow reading it without/barely changing it, and you can base the following instances always on the same source, having slightly different prints from the same copy isn't ideal but in this case it may be good enough

  • @jackryan8588
    @jackryan8588 Před rokem +2

    Well, time to start some Neurals from scratch.

  • @Tomyb15
    @Tomyb15 Před rokem +1

    This is pretty exciting. I always thought that analog ai accelerators would always be limited to just running the network while all the training would still be done with traditional computing, but this changes the landscape quite a bit.
    But I have to say that I find the idea of mortal computation a bit stupid. Basically we'd have to re-train every single new mortal computer to be able to use it? At that point you are essentially copying our human way of learning a bit too much, where we have to spend decades from birth to degree just to form a new engineer.
    It ends up feeling to me like chip fabrication in mortal computers is a sort of middleman to simply training actual neural tissue on a pizza box to form a new "ai" aircraft pilot, neuromancer style. It's hard to beat biology on energy efficiency.

    • @hi-gf5yl
      @hi-gf5yl Před rokem

      Cortical labs has to maintain a perfusion circuit that feeds neurons, removes waste and provides antibiotics to prevent infection. Neuromorphic chips won’t have these issues.
      I think you can reduce training time compared to a human by specializing where possible.

    • @Tomyb15
      @Tomyb15 Před rokem

      @@hi-gf5yl In startrek voyager, the "gel packs" that contain neural tissue (or something very much like biological neural tissue) that provide much of the computational power aboard the ship also needed a system to keep the packs clean and they end up getting infected in one of the episodes (as in a biological infection). But If you can literally grow the computer with essentially blended pizzas then maybe it'd still be useful? Especially if with enough genetic engineering the computer can even heal itself and possibly even grow after being made.
      And regarding training, I'm sure it would be faster than training a human but my point was that it'd no longer be an overhead cost but more of a variable cost with the number of computers made. You'd have to do it every time.

    • @hi-gf5yl
      @hi-gf5yl Před rokem

      @@Tomyb15
      I’m skeptical of growing a bigger computer because you’d run into communication limits between neurons.
      rain neuromorphics originally planned to use randomly deposited nanowires to form artificial synapses Although it’s not truly random as the mask is the same for every chip. I think they can observe the weight changes in a chip after training and then create a mask containing only those relevant synapses for the next generation of chips (along with some redundant connections). So this is another way to bypass the training cost.
      I think memristors and other technologies can scale more than neurons.

  • @trolly4233
    @trolly4233 Před rokem

    I’ve always had a hunch that sleep was just data training for living beings. Probably more complicated than that but it would make sense. Gather all the experiences of a day, wait for sleep, and then use the data to retrain at night, then clear the “dream storage” for another day of collection.

    • @Jerryfan271
      @Jerryfan271 Před rokem

      There's research that supports this pretty strongly so yeah. I think even Deepmind did some on it, or at least cited it at some point.

  • @frankweiler7121
    @frankweiler7121 Před rokem +3

    Doggo👍