The Absolutely Simplest Neural Network Backpropagation Example

Sdílet
Vložit
  • čas přidán 5. 06. 2024
  • I'm (finally after all this time) thinking of new videos. If I get attention in the donate button area, I will proceed:
    www.paypal.com/donate/?busine...
    sorry there is a typo: @3.33 dC/dw should be 4.5w - 2.4, not 4.5w-1.5
    NEW IMPROVED VERSION AVAILABLE: • 0:03 / 9:21The Absolut...
    The absolutely simplest gradient descent example with only two layers and single weight. Comment below and click like!
  • Věda a technologie

Komentáře • 185

  • @GustavoMeschino
    @GustavoMeschino Před měsícem

    GREAT, it was a perfect inspiration for me to explain this critical subject in a class. Thank you!

  • @markneumann381
    @markneumann381 Před měsícem

    Really nice work. Thank you so much for your help.

  • @animatedzombie64
    @animatedzombie64 Před měsícem

    Best video ever about the back propagation in the internet 🛜

  • @Vicente75480
    @Vicente75480 Před 5 lety +184

    Dude, this was just what I needed to finally understand the basics of Back Propagation

    • @webgpu
      @webgpu Před měsícem

      if you _Really_ liked his video, just click the first link he put on the description 👍

  • @ilya5782
    @ilya5782 Před 6 měsíci +5

    To understand mathematics, I need to see an example.
    An this video from start to end is awesome with quality presentation.
    Thank you so much.

  • @lazarus8011
    @lazarus8011 Před měsícem

    Unreal explanation

  • @justinwhite2725
    @justinwhite2725 Před 3 lety +18

    @8:06 this was super useful. That's a fantastic shorthand. That's exactly the kind of thing I was looking for, something quick I can iterate over all the weights and find the most significant one for each step.

  • @SamuelBachorik-Mrtapo8-ApeX

    Hi I have question for you, at 3:42, you have, 1.5*2(a-y) = 4.5*w-1.51, how did you get this result?

    • @nickpelov
      @nickpelov Před rokem +16

      ... in case someone missed it like me - it's in the description (it's a typo). y=0.8; a=i*w = 1.5*w, so 1.5*2(a-y) =3*(1.5*w - 0.8) = 4.5*w - 3*0.8 = 4.5*w - 2.4 is the correct formula.

  • @bedeamadi9317
    @bedeamadi9317 Před 3 lety +6

    My long search ends here, you simplified this a great deal. Thanks!

  • @xflory26x
    @xflory26x Před měsícem

    Not kidding. This is the best explanation of backpropagation on the internet. The way you're able to simplify this "complex" concept is *chef's kiss* 👌

  • @mateoacostarojas6031
    @mateoacostarojas6031 Před 5 lety +7

    just perfect, simple and with this we can extrapolate easier when in each layer there are more than one neuron! thaaaaankksss!!

  • @adoughnut12345
    @adoughnut12345 Před 3 lety +16

    This was great. Removing non linearity and including basic numbers as context help drove this material home.

  • @whywhatwherein
    @whywhatwherein Před měsícem

    finally, a proper explanation.

  • @saral123
    @saral123 Před 3 lety +3

    Fantastic. This is the most simple and lucid way to explain backprop. Hats off

  • @arashnozarinejad9915
    @arashnozarinejad9915 Před 4 lety +5

    I had to write a comment and thank you for your very precise yet simple explanation, just what I needed. Thank you sir.

  • @gautamdawar5067
    @gautamdawar5067 Před 3 lety +3

    After a long frantic search, I stumbled upon this gold. Thank you so much!

  • @riccardo700
    @riccardo700 Před 3 měsíci +1

    I have to say it. You have done the best video about backpropagation because you chose to explain the easiest example, no one did that out there!! Congrats prof 😊

    • @webgpu
      @webgpu Před měsícem

      did you _really_ like his video? Then, i'd suggest you click the first link he put on the description 👍

  • @Freethinker33
    @Freethinker33 Před 2 lety +3

    I was just looking for this explanation to align derivatives with gradient descent. Now it is crystal clear. Thanks Miakel

  • @sparkartsdistinctions1257

    I watched almost every videos of back propagation even Stanford but never got such clear idea until I saw this one ☝️.
    Best and clean explanation.
    My first 👍🏼 which I rarely give.

    • @webgpu
      @webgpu Před měsícem

      a 👍is very good, but if you click on the first link on the description, it would be even better 👍

    • @sparkartsdistinctions1257
      @sparkartsdistinctions1257 Před měsícem

      @@webgpu 🆗

  • @praneethaluru2601
    @praneethaluru2601 Před 3 lety +2

    The best short video explanation of the concept0 on CZcams till now...

  • @EthanHofton
    @EthanHofton Před 3 lety +6

    Very clearly explained and easy to understand. Thank you!

  • @drummin2dabeat
    @drummin2dabeat Před 3 měsíci

    What a breakthrough, thanks to you. BTW, not to nitpick, but you are missing a close paren on f(g(x), which should be f(g(x)).

  • @TrungNguyen-ib9mz
    @TrungNguyen-ib9mz Před 3 lety +9

    Thank you for your video. But I’m a bit confused about 1,5.2(a-y) = 4,5.w-1,5, Might you please explain that? Thank you so much!

    • @user-gq7sv9tf1m
      @user-gq7sv9tf1m Před 3 lety +9

      I think this is how he got there :
      1.5 * 2(a - y) = 1.5 * 2 (iw - 0.5) = 1.5 * 2 (1.5w - 0.5) = 1.5 * (3w - 1) = 4.5w - 1.5

    • @christiannicoletti9762
      @christiannicoletti9762 Před 3 lety +2

      @@user-gq7sv9tf1m dude thanks for that, I was really scratching my head over how he got there too

    • @Fantastics_Beats
      @Fantastics_Beats Před 2 lety

      i am also confused this error

    • @morpheus1586
      @morpheus1586 Před rokem +2

      @@user-gq7sv9tf1m y is 0.8 not 0.5

  • @SureshBabu-tb7vh
    @SureshBabu-tb7vh Před 5 lety +3

    You made this concept very simple. Thank you

  • @javiersanchezgrinan919
    @javiersanchezgrinan919 Před měsícem

    Great video. Just one question, this is for 1 x 1 input and batch size of 1 right?. If we have, let´s say a batch size of 2, It is just to sum (b-y)^2 to the loss function ( C= (a-y)^2 + (b-y)^2) isnt it?, with b = w * j and j = the input of the second batch size. Then you just perform the backpropation with partial derivatives. Is it correct?

  • @polybender
    @polybender Před 26 dny

    best on internet.

  • @santysayantan
    @santysayantan Před 2 lety +2

    This makes more sense than anything I ever heard in the past! Thank you! 🥂

    • @brendawilliams8062
      @brendawilliams8062 Před 9 měsíci

      It beats the 1002165794 thing and 1001600474 jumping and calculating with 1000325836 and 1000564416. Much easier 😊

    • @jameshopkins3541
      @jameshopkins3541 Před 9 měsíci

      you are wrong: Say me what is deltaW?

  • @ronaldmercado4768
    @ronaldmercado4768 Před 8 měsíci

    Absolutly simple. Very useful illustration not only to understand Backpropagation but also to show gradient descent optimization. Thanks a lot.

  • @bettercalldelta
    @bettercalldelta Před 2 lety +1

    I'm currently programming a neural network from scratch, and I am trying to understand how to train it, and your video somewhat helped (didn't fully help cuz I'm dumb)

  • @ExplorerSpace
    @ExplorerSpace Před rokem

    @Mikael Laine even though you say that @3:33 has a typo. i cant see the typo. 1.5 is correct because y is the actual desired out put and it is 0.5. so 3.0 * 0.5 = 1.5

  • @adriannyamanga1580
    @adriannyamanga1580 Před 4 lety +3

    dude please make more videos. this is amazing

  • @AjitSingh147
    @AjitSingh147 Před rokem

    GOD BLESS YOU DUDE! SUBSCRIBED!!!!

  • @outroutono4937
    @outroutono4937 Před rokem

    Thank you bro! Its so easier to visualize it when its presented like that.

  • @DaSticks
    @DaSticks Před 5 měsíci

    Great video, going to spend some time working out it looks for multiple neurons, but a demonstration on that would be awesome

  • @SuperYtc1
    @SuperYtc1 Před 17 dny

    4:03 Shouldn't 3(a - y) be 3(1.5*w - 0.8) = 4.5w - 2.4? Where have you got -1.5 from?

  • @riccardo700
    @riccardo700 Před 3 měsíci

    My maaaaaaaannnnn TYYYY

  • @bhlooli
    @bhlooli Před rokem

    Thanks very helpful.

  • @OviGomy
    @OviGomy Před 5 měsíci

    I think there is a mistake. 4.5w -1.5 is correct.
    On the first slide you said 0.5 is the expected output.
    So "a" is the computed output and "y" is the expected output. 0.5 * 1.5 * 2 = 1.5 is correct.
    You need to correct the "y" next to the output neuron to 0.5.

  • @user-mc9rt9eq5s
    @user-mc9rt9eq5s Před 3 lety +15

    Thanks! This is Awesome. I have I question, if we make the NN more complicated a little bit (adding an activation function for each layer), what will be the difference?

  • @sunilchoudhary8281
    @sunilchoudhary8281 Před 2 měsíci +1

    I am so happy that I can't even express myself right now

    • @webgpu
      @webgpu Před měsícem

      there's a way you can express your happiness AND express your gratitude: by clicking on the first link in the description 🙂

  • @popionlyone
    @popionlyone Před 5 lety +24

    You made it easy to understand. Really appreciated it. You also earned my first CZcams comment.

  • @anirudhputrevu3878
    @anirudhputrevu3878 Před 2 lety

    Thanks for making this

  • @giuliadipalma5042
    @giuliadipalma5042 Před 2 lety

    thank you, this is exactly what I was looking for, very useful!

  • @rdprojects2954
    @rdprojects2954 Před 3 lety +1

    Excellent , please continue we need this kind of simplicity in NN

  • @TruthOfZ0
    @TruthOfZ0 Před 21 dnem

    if we take directly the derivitive dC/dw from C=(a-y)^2 is the same thing right? do we really have to split individually da/dw and dC/da ???

  • @btmg4828
    @btmg4828 Před měsícem +1

    I don’t get it you write 1.5*2(a-y) = 4.5w -1.5
    But why? It should be 4.5w -2,4
    Because 2*0,8*-1,5= -2,4
    Where am I rong?

  • @formulaetor8686
    @formulaetor8686 Před rokem

    Thats sick bro I just implemented it

  • @mahfuzurrahman4517
    @mahfuzurrahman4517 Před 7 měsíci

    Bro this is awesome, I was struggling to understand chain rule, now it is clear

  • @mixhybrid
    @mixhybrid Před 4 lety +1

    Thanks for the video! Awesome explanation

  • @srnetdamon
    @srnetdamon Před 3 měsíci +1

    man 4:08 i dont undestrand how you find the valor 4.5, in expression 4.5.w-1.5,

  • @ApplepieFTW
    @ApplepieFTW Před rokem

    It clicked after just 3 minutes. Thanks a lot!!

  • @thiagocrepaldi6071
    @thiagocrepaldi6071 Před 5 lety +7

    Great video. I believe there is a typo at 1:10. y should be 0.5 and not 0.8. That might cause some confusion, especially at 3:34, when we use numerical values to calculate the slope (C) / slope (w)

    • @mikaellaine9490
      @mikaellaine9490  Před 5 lety

      Thanks for pointing that out; perhaps time to make a new video!

    • @mikaellaine9490
      @mikaellaine9490  Před 5 lety

      yes, that should say a=1.2

    • @Vicente75480
      @Vicente75480 Před 5 lety +2

      +Mikael Laine I would be si glad if you could make more videos explaining these kind of concepts and how they actually work in a code level.

    • @mikaellaine9490
      @mikaellaine9490  Před 5 lety +2

      Did you have any particular topic in mind? I'm planning to make a quick video about the mathematical basics of backpropagation: automatic differentiation. Also I can make a video about how to implement the absolutely simples neural network in Tensorflow/Python.
      Let me know if you have a specific question. I do have quite a bit experience in TF.

    • @mychevysparkevdidntcatchfi1489
      @mychevysparkevdidntcatchfi1489 Před 5 lety

      @@mikaellaine9490 How about adding that to description? Someone else asked that question.

  • @satishsolanki9766
    @satishsolanki9766 Před 3 lety

    Awesome dude. Much appreciate your effort.

  • @hamedmajidian4451
    @hamedmajidian4451 Před 3 lety

    Great illustrated, thanks

  • @grimreaperplayz5774
    @grimreaperplayz5774 Před rokem

    This is absolutely awesome. Except..... Where did that 4.5 come from???

    • @delete7316
      @delete7316 Před 10 měsíci

      You’ve probably figured it out by now but just in case: i = 1.5, y=0.8, a = i•w. This means the expression for dC/dw = 1.5 • 2(1.5w - 0.8). Simplify this and you get 4.5w - 2.4. This is where the 4.5 comes from. Extra note: in the description it says -1.5 was a typo and the correct number is -2.4.

  • @zeljkotodor
    @zeljkotodor Před 2 lety

    Nice and clean. Helped me a lot!

  • @paurodriguez5364
    @paurodriguez5364 Před rokem

    best explanation i had ever seen, thanks.

  • @aorusaki
    @aorusaki Před 4 lety

    Very helpful tutorial. Thanks!

  • @kitersrefuge7353
    @kitersrefuge7353 Před 6 měsíci

    Brilliant. What would be awesome is to then further expand if u would and explain multiple rows of nodes...in order to try and visualise if possible multiple routes to a node and so on...i stress "if possible...".

  • @zh4842
    @zh4842 Před 4 lety

    excellent video, simple & clear many thanks

  • @evanparshall1323
    @evanparshall1323 Před 3 lety +1

    This video is very well done. Just need to understand implementation when there is more than one node per layer

    • @mikaellaine9490
      @mikaellaine9490  Před 3 lety

      Have you looked at my other videos? I have a two-dimensional case in this video: czcams.com/video/Bdrm-bOC5Ek/video.html

  • @giorgosmaragkopoulos9110
    @giorgosmaragkopoulos9110 Před 2 měsíci

    So what is the clever part of back prop? Why does it have a special name and it isn't just called "gradient estimation"? How does it save time? It looks like it just calculates all derivatives one by one

  • @shirish3008
    @shirish3008 Před 3 lety

    This is the best tutorial on back prop👏

  • @Controlvers
    @Controlvers Před 3 lety +1

    Thank you for sharing this video!

  • @fredfred9847
    @fredfred9847 Před 2 lety

    Great video

  • @RaselAhmed-ix5ee
    @RaselAhmed-ix5ee Před 3 lety +1

    in the final eqn why it is 4.5w-1.5 instead it should be 4.5w-2.4 since y=0.8 so 3*0.8 =2.4

  • @lhyd7hak
    @lhyd7hak Před 2 lety

    Thanks for a very explanatory video.

  • @svtrilogywestsail3278
    @svtrilogywestsail3278 Před 2 lety

    this was kicking my a$$ until i watched this video. thanks

  • @shilpatel5836
    @shilpatel5836 Před 3 lety

    Bro i just worked it through and it makes so much sense once you do the partial derivatives and do it step by step and show all the working

  • @RohitKumar-fg1qv
    @RohitKumar-fg1qv Před 5 lety +3

    Exactly what i needed

  • @st0a
    @st0a Před 9 měsíci

    Great video! One thing to mention is that the cost function is not always convex, in fact it is never truly convex. However, as an example this is really well explained.

  • @jks234
    @jks234 Před 3 měsíci

    I see.
    As previously mentioned, there are a few typos. For anyone watching, please note there are a few places where 0.8 and 0.5 are swapped for each other.
    That being said, this explanation has opened my eyes to the fully intuitive explanation of what is going on...
    Put simply, we can view each weight as an "input knob" and we want to know how each one creates the overall Cost/Loss.
    In order to do this, we link (chain) each component's local influence together until we have created a function that describes weight to overall cost.
    Once we have found that, we can adjust that knob with the aim of lowering total loss a small amount based on what we call "learning rate".
    Put even more succinctly, we are converting each weight's "local frame of reference" to the "global loss" frame of reference and then adjusting each weight with that knowledge.
    We would only need to find these functions once for a network.
    Once we know how every knob influences the cost, we can tweak them based on the next training input using this knowledge.
    The only difference between each training set will just be the model's actual output, which is then used to adjust the weights and lower the total loss.

  • @jakubpiekut1446
    @jakubpiekut1446 Před 2 lety

    Absolutely amazing 🏆

  • @sabinbaral4132
    @sabinbaral4132 Před rokem

    Good content sir keep making these i subscribe

  • @ahmetpala7945
    @ahmetpala7945 Před 4 lety

    Thank you for the easiest expression for bacpropagation dude

  • @malinyamato2291
    @malinyamato2291 Před rokem

    thanks a lot... a great start for me to learn NNs :)

  • @mysteriousaussie3900
    @mysteriousaussie3900 Před 3 lety +4

    are you able to briefly describe how the calculation at 8:20 works for a network with mutliple neurons per layer?

  • @dcrespin
    @dcrespin Před rokem

    The video shows what is perhaps the simplest case of a feedforward network, with all the advantages and limitations that extreme simplicity can have. From here to full generalization several steps are involved.
    1.- More general processing units.
    Any continuously differentiable function of inputs and weights will do; these inputs and weights can belong not only to Euclidean spaces but to any Hilbert spaces as well. Derivatives are linear transformations and the derivative of a unit is the direct sum of the partial derivatives with respect to the inputs and with respect to the weights.
    2.- Layers with any number of units.
    Single unit layers can create a bottleneck that renders the whole network useless. Putting together several units in a layer is equivalent to taking their product (as functions, in the set theoretical sense). Layers are functions of the totality of inputs and weights of the various units. The derivative of a layer is then the product of the derivatives of the units. This is a product of linear transformations.
    3.- Networks with any number of layers.
    A network is the composition (as functions, and in the set theoretical sense) of its layers. By the chain rule the derivative of the network is the composition of the derivatives of the layers. Here we have a composition of linear transformations.
    4.- Quadratic error of a function.
    ---
    This comment is becoming a too long. But a general viewpoint clarifies many aspects of BPP.
    If you are interested in the full story and have some familiarity with Hilbert spaces please Google for papers dealing with backpropagation in Hilbert spaces.
    Daniel Crespin

  • @JAYSVC234
    @JAYSVC234 Před 9 měsíci

    Thank you. Here is pytorch implementation.
    import torch
    import torch.nn as nn
    class C(nn.Module):
    def __init__(self):
    super(C, self).__init__()
    r = torch.zeros(1)
    r[0] = 0.8
    self.r = nn.Parameter(r)
    def forward(self, i):
    return self.r * i
    class L(nn.Module):
    def __init__(self):
    super(L, self).__init__()
    def forward(self, p, t):
    loss = (p-t)*(p-t)
    return loss
    class Optim(torch.optim.Optimizer):
    def __init__(self, params, lr):
    defaults = {"lr": lr}
    super(Optim, self).__init__(params, defaults)
    self.state = {}
    for group in self.param_groups:
    for par in group["params"]:
    # print("par: ", par)
    self.state[par] = {"mom": torch.zeros_like(par.data)}
    def step(self):
    for group in self.param_groups:
    for par in group["params"]:
    grad = par.grad.data
    # print("grad: ", grad)
    mom = self.state[par]["mom"]
    # print("mom: ", mom)
    mom = mom - group["lr"] * grad
    # print("mom update: ", mom)
    par.data = par.data + mom
    print("Weight: ", round(par.data.item(), 4))
    # r = torch.ones(1)
    x = torch.zeros(1)
    x[0] = 1.5
    y = torch.zeros(1)
    y[0] = 0.5
    c = C()
    o = Optim(c.parameters(), lr=0.1)
    l = L()
    print("x:", x.item(), "y:", y.item())
    for j in range(5):
    print("_____Iter ", str(j), " _______")
    o.zero_grad()
    p = c(x)
    loss = l(p, y).mean()
    print("prediction: ", round(p.item(), 4), "loss: ", round(loss.item(), 4))
    loss.backward()
    o.step()

  • @user-og9zn9vf4k
    @user-og9zn9vf4k Před 4 lety +1

    thanks a lot for that explanation :)

  • @hegerwalter
    @hegerwalter Před měsícem

    Where and how did you get the learning rate?

  • @rachidbenabdelmalek3098

    Thanks you

  • @meanderthalensis
    @meanderthalensis Před 2 lety

    Helped me so much!

  • @smartdev1636
    @smartdev1636 Před 7 měsíci +12

    Thank you so much! I'm 14 years old and I'm now trying to build a neural network with python without using any kind of libraries, and this video made me understand everything much better.

    • @Banana-anim8ions
      @Banana-anim8ions Před 4 měsíci

      No way me too

    • @smartdev1636
      @smartdev1636 Před 4 měsíci

      Brooo WW I ended up coding something which looked good to me but for some reason It didn't work so I just gave up on it. I wish you good luck man@@Banana-anim8ions

  • @Leon-cm4uk
    @Leon-cm4uk Před 7 měsíci

    The error should be (1.2 - 0.5) = squared(0.7) = 0.49. So y is 0.49 and not 0.8 as it is displayed after minute 01:08.

  • @alexandrmelnikov5126
    @alexandrmelnikov5126 Před 7 měsíci

    man, thanks!

  • @LunaMarlowe327
    @LunaMarlowe327 Před 2 lety

    very clear

  • @banpridev
    @banpridev Před měsícem

    Ow you did not lie on the tittle.

  • @sameersahu3987
    @sameersahu3987 Před rokem

    Thanks

  • @elgs1980
    @elgs1980 Před 3 lety

    Thank you so much!

  • @Nova-Rift
    @Nova-Rift Před 3 lety

    hmm, if y = .8 then should dc/dw = 4.5w - 2.4. Because .8 * 3 = 2.4, not 1.5. What am I missing?

  • @zemariagp
    @zemariagp Před 9 měsíci

    why do we ever need to consider multiple levels, why not just think about getting the right weight given the output "in front" of it

  • @samiswilf
    @samiswilf Před 3 lety

    This video is gold.

  • @AAxRy
    @AAxRy Před 3 měsíci

    THIS IS SOO FKING GOOD!!!!

  • @caifang324
    @caifang324 Před 3 lety +1

    I thought it is just similar to LMS widely used in communication, right? LMS was developed by Bernard back in 60s.

  • @glaswasser
    @glaswasser Před 3 lety

    Okay I am better with language than with maths, so I'll try to sum it up:
    We basically look for a desired weight in order to get a certain output unit. And we get this desired weight by setting the weight equal to C, which again is the x-value of the minimum of some function that we get by deriving the function containing the original (faulty) output continuously (by steps determined by a "learning rate") until it is very close to zero. That correct?

  • @user-tt1hl6sk8y
    @user-tt1hl6sk8y Před 4 lety +2

    Спасибо братан, наконец-то выкупил что после последнего слоя происходит:)

  • @TheRainHarvester
    @TheRainHarvester Před rokem

    6:55 but it's NOT the same terms. Is that da0/dw1 term correct?

  •  Před 4 lety

    Very helpful

  • @user-pr7de7jq2v
    @user-pr7de7jq2v Před 4 lety +1

    I apologize in advance. I don't quite understand, why can't we equate the derivative function to 0 instead of gradient descent. If it is nonlinear, it will have several zeros. Then you can choose the one that suits us

    • @mikaellaine9490
      @mikaellaine9490  Před 4 lety

      Deep learning is a numerical method of finding the features / classes appropriate for the given problem. You are correct in that here - in this naive example - you could calculate the closed-form solution, but in the general / complex case that would not be feasible.