Backpropagation explained | Part 3 - Mathematical observations

Sdílet
Vložit
  • čas přidán 9. 07. 2024
  • We have focused on the mathematical notation and definitions that we would be using going forward to show how backpropagation mathematically works to calculate the gradient of the loss function. We'll start making use of what we learned and applying it in this video, so it's crucial that you have a full understanding of everything we covered in that video first.
    • Backpropagation explai...
    Here, we're going to be making some mathematical observations about the training process of a neural network. The observations we'll be making are actually facts that we already know conceptually, but we'll now just be expressing them mathematically. We'll be making these observations because the math for backprop that comes next, particularly, the differentiation of the loss function with respect to the weights, is going to make use of these observations.
    We're first going to start out by making an observation regarding how we can mathematically express the loss function. We're then going to make observations around how we express the input and the output for any given node mathematically. And lastly, we'll observe what method we'll be using to differentiate the loss function via backpropagation.
    🕒🦎 VIDEO SECTIONS 🦎🕒
    00:00 Welcome to DEEPLIZARD - Go to deeplizard.com for learning resources
    01:15 Outline for the episode
    01:44 Mathematical Observations
    05:30 Expressing the loss as a composition of functions
    10:15 Summary
    10:56 Collective Intelligence and the DEEPLIZARD HIVEMIND
    💥🦎 DEEPLIZARD COMMUNITY RESOURCES 🦎💥
    👋 Hey, we're Chris and Mandy, the creators of deeplizard!
    👉 Check out the website for more learning material:
    🔗 deeplizard.com
    💻 ENROLL TO GET DOWNLOAD ACCESS TO CODE FILES
    🔗 deeplizard.com/resources
    🧠 Support collective intelligence, join the deeplizard hivemind:
    🔗 deeplizard.com/hivemind
    🧠 Use code DEEPLIZARD at checkout to receive 15% off your first Neurohacker order
    👉 Use your receipt from Neurohacker to get a discount on deeplizard courses
    🔗 neurohacker.com/shop?rfsn=648...
    👀 CHECK OUT OUR VLOG:
    🔗 / deeplizardvlog
    ❤️🦎 Special thanks to the following polymaths of the deeplizard hivemind:
    Tammy
    Mano Prime
    Ling Li
    🚀 Boost collective intelligence by sharing this video on social media!
    👀 Follow deeplizard:
    Our vlog: / deeplizardvlog
    Facebook: / deeplizard
    Instagram: / deeplizard
    Twitter: / deeplizard
    Patreon: / deeplizard
    CZcams: / deeplizard
    🎓 Deep Learning with deeplizard:
    Deep Learning Dictionary - deeplizard.com/course/ddcpailzrd
    Deep Learning Fundamentals - deeplizard.com/course/dlcpailzrd
    Learn TensorFlow - deeplizard.com/course/tfcpailzrd
    Learn PyTorch - deeplizard.com/course/ptcpailzrd
    Natural Language Processing - deeplizard.com/course/txtcpai...
    Reinforcement Learning - deeplizard.com/course/rlcpailzrd
    Generative Adversarial Networks - deeplizard.com/course/gacpailzrd
    🎓 Other Courses:
    DL Fundamentals Classic - deeplizard.com/learn/video/gZ...
    Deep Learning Deployment - deeplizard.com/learn/video/SI...
    Data Science - deeplizard.com/learn/video/d1...
    Trading - deeplizard.com/learn/video/Zp...
    🛒 Check out products deeplizard recommends on Amazon:
    🔗 amazon.com/shop/deeplizard
    🎵 deeplizard uses music by Kevin MacLeod
    🔗 / @incompetech_kmac
    ❤️ Please use the knowledge gained from deeplizard content for good, not evil.

Komentáře • 66

  • @deeplizard
    @deeplizard  Před 6 lety +9

    Backpropagation explained | Part 1 - The intuition
    czcams.com/video/XE3krf3CQls/video.html
    Backpropagation explained | Part 2 - The mathematical notation
    czcams.com/video/2mSysRx-1c0/video.html
    Backpropagation explained | Part 3 - Mathematical observations
    czcams.com/video/G5b4jRBKNxw/video.html
    Backpropagation explained | Part 4 - Calculating the gradient
    czcams.com/video/Zr5viAZGndE/video.html
    Backpropagation explained | Part 5 - What puts the “back” in backprop?
    czcams.com/video/xClK__CqZnQ/video.html
    Machine Learning / Deep Learning Fundamentals playlist: czcams.com/play/PLZbbT5o_s2xq7LwI2y8_QtvuXZedL6tQU.html
    Keras Machine Learning / Deep Learning Tutorial playlist: czcams.com/play/PLZbbT5o_s2xrwRnXk_yCPtnqqo4_u2YGL.html

  • @rewangtm
    @rewangtm Před 4 lety +24

    Andrew Ng - "I can explain everything"
    Deep lizard - "Hold my backpropagation"

    • @ssffyy
      @ssffyy Před 3 lety

      "if you can't explain it simply, you don't understand it well enough" - Einstein

    • @sohailape
      @sohailape Před 2 lety +2

      @@ssffyy he do but problem with a lot of professors is they assume students knows everything already when in reality they know nothing. Sometimes I wonder if we pay for college just for the "tag" and not for teaching.

  • @jamestuckett5285
    @jamestuckett5285 Před 5 lety +27

    These videos are awesome. Finally, someone who can break the steps down sufficiently for those less fluent in maths to grasp easily. Great work.

  • @scottthornley5405
    @scottthornley5405 Před 5 lety +26

    I just want to say thanks. I've seen the 3Blue1Brown videos, some of Ng's videos and several different articles. Yours is the first content that is allowing me to get a handle on understanding the math behind back propagation.

    • @deeplizard
      @deeplizard  Před 5 lety

      You're welcome, Scott! Glad to hear that :D Thanks for letting me know.

  • @tymothylim6550
    @tymothylim6550 Před 3 lety +2

    Thank you very much for this video! I learnt how to explain complicated math to others through your simple-to-understand series! I also learnt how to understand the loss as a function of all those things!

  • @Uditsinghparihar
    @Uditsinghparihar Před 5 lety +1

    A crisp and precise description upto sub/super scripts.
    Thanks for the upload

  • @kushagrachaturvedy2821
    @kushagrachaturvedy2821 Před 4 lety +4

    Such a great video. I understood everything perfectly. You guys definitely need more subs.

  • @ujjwalkumar8173
    @ujjwalkumar8173 Před 3 lety +2

    I am once again commenting -"I love you :)" Awesome lectures!!

  • @datasciencestory15
    @datasciencestory15 Před 4 lety +1

    Speechless!, you are kind...

  • @ricardofrancalaccisavarisr4364

    Great job, SUGGESTION, when you mention on of the indices you may show where it is in the Neural Network

  • @mariodurndorfer6996
    @mariodurndorfer6996 Před 5 lety +1

    awesome style of explanation! all thumbs up!

  • @gero8049
    @gero8049 Před 3 lety +1

    That was a really good explanation. It really help me. Thanks a lot.

  • @gero8049
    @gero8049 Před 3 lety +1

    Hey this really help me giving a better understanding in Andrew course. Thank you

  • @nerkulec
    @nerkulec Před 6 lety +2

    Great! Thanks!

  • @danielrodriguezgonzalez2982

    This is the best!

  • @vishakdm7728
    @vishakdm7728 Před 3 lety +1

    Great work, really helped me a lot :)

  • @todianmishtaku6249
    @todianmishtaku6249 Před 4 lety +1

    Superb! Really liked it.

  • @lightningblade9347
    @lightningblade9347 Před 6 lety +2

    Hi deeplizard, before posting my question I want first to thank you (again) for these amazing videos and valuable playlist, it's one of the main ressources I use in my journey to mastering deep learning, I also think it's very important to know the mathematical foundations of Neural Nets to actually understand them in the best way possible, my question is (it's a bit long, I hope it won't bother you): let's say we have a 3 by 1 neural net (an input layer with 3 neurons and an output layer with one neuron, no hidden layer), I find no problem calculating the outputs of the forward and back propagation when feeding the neural net with one training sample (namely feature1, feature2, feature3 inputs) and i know exactly how my initial weights get optimized, the problem I find is when feeding the NN with multiple training inputs each time, here, I don't know exactly how the initial weights get optimized.
    I would be grateful if you could explain how the initial weights get modified when feeding the NN with multiple training inputs.
    (For example we have training inputs of 3 × 3 Matrix.
    [[195, 90, 41],
    [140, 50, 30],
    [180, 85, 43]]
    the first column is the height, 2nd: the weight, 3rd: shoe size, where we feed the NN with the first row then the second and the third row).
    I know that to calculate the new weights when feeding the NN with one training sample we rely on this formula:
    New_weights = Initial_weights - learning_rate × (derivative of the loss function wrt the weights)
    But when we feed the NN with more than one training example then which formula do we use ? Do we calculate the average of all dw (derivative of the loss function wrt the weights) or do we sum all of'em then multiply by the learning rate and substract them from the initial weights or what ?
    I'm a bit confused here.
    Thanks in advance.

    • @deeplizard
      @deeplizard  Před 6 lety +2

      Hey Lightning Blade - Glad to see you're progressing through the content! In regards to your question, your first thought is correct. We calculate the average derivative of the loss over all training samples. I touch on this in the next video starting at 11:56: czcams.com/video/Zr5viAZGndE/video.html
      Let me know if this helps clarify!

  • @loneWOLF-fq7nz
    @loneWOLF-fq7nz Před 5 lety +2

    keep up great videos :)

  • @panwong9624
    @panwong9624 Před 6 lety +1

    This video is great!

  • @codeXcycle
    @codeXcycle Před rokem +1

    you're the best

  • @samhithbarlaya23
    @samhithbarlaya23 Před 4 lety +3

    @deeplizard, thanks for this great series. I have a query : At 8:03, you represent input for node j as a function of all the weights connected to j. But my understanding is that the input is weighted sum of activation outputs of previous layer. So shouldn't activation output of the previous layer also be considered when representing the input function?

    • @SKULDROPR
      @SKULDROPR Před rokem

      Our end goal is getting the derivative of the loss with respect to weights, not the output activations of the previous layer (for the reasons explained in the part 1 video). You could definitely make a valid expression using the output activations of the previous layer (good spot!); however, it is not useful for the goal at hand. Also, remember that the output activations are directly influenced by the weights, so they don't need to be in the expression either. Apologies for resurrecting, hopefully someone finds this useful, as it confused me at first too.

  • @WheatleyOS
    @WheatleyOS Před 3 lety

    This video is giving me the urge to make a 3-node-per-layer, 3-layer neural network in Excel

  • @assemblyorganization522

    C(sub zero ) is a loss of a particular sample.let suppose we have two classes(Male and female) four samples(two for each class) .Will the C(sub zero) represent loss for either two female samples or male samples. please explain this concept. thanks

  • @Boldalt
    @Boldalt Před 4 lety

    Hi. Why do you say n-1 in the cost function?

  • @adesiph.d.journal461
    @adesiph.d.journal461 Před 3 lety +1

    Amazing Stuff! The way you break down the math is really neat. I love the way you spend time in walking through the notations, their meanings, and what they stand for. Would love to see a more comprehensive series on the math behind various loss functions, regularization techniques and maybe in general concepts from Ian Goodfellow's Deep Learning Book

  • @thespam8385
    @thespam8385 Před 4 lety +2

    {
    "question": "Use of the chain rule is required because:",
    "choices": [
    "the loss function is a composition of functions.",
    "of the vast number of weights.",
    "of the sign of the gradient.",
    "the gradient must flow backward."
    ],
    "answer": "the loss function is a composition of functions.",
    "creator": "Chris",
    "creationDate": "2020-04-17T17:32:03.668Z"
    }

    • @deeplizard
      @deeplizard  Před 4 lety

      More great questions, thanks Chris!
      Just added your question to deeplizard.com/learn/video/G5b4jRBKNxw :)

  • @ramiro6322
    @ramiro6322 Před 3 lety +1

    0:00 Introduction
    1:15 Outline for the episode
    1:44 Observations
    5:30 Expressing C0 as a composition of functions
    10:15 Next video and Outro

    • @deeplizard
      @deeplizard  Před 3 lety +1

      Added to the description. Thanks so much!

    • @ramiro6322
      @ramiro6322 Před 3 lety

      @@deeplizard Thank you for the videos, they're great!

  • @Jxordan
    @Jxordan Před 6 lety +8

    ETA on part 4? I have a midterm Wednesday :/

    • @deeplizard
      @deeplizard  Před 6 lety +2

      Hey Jordan - I'm _hoping_ to have it released by Tuesday evening.

    • @Jxordan
      @Jxordan Před 6 lety +1

      Thank you!

  • @hailhuskz
    @hailhuskz Před 2 lety +1

    Hey deeplizard, this video made all my doubts on this topic go away :). However- just asking- what happened to the bias of each neuron? Should z(j) not be sum[w(jk) × a(k)] + bias(j)?

    • @deeplizard
      @deeplizard  Před 2 lety +1

      Yes, bias was eliminated (or assumed to be 0) here for simplicity since we hadn't yet covered bias in the course. With bias included, your assumption for the calculation of z(j) is correct. We cover bias in a later episode here:
      czcams.com/video/HetFihsXSys/video.html

    • @hailhuskz
      @hailhuskz Před 2 lety

      @@deeplizard thank you so much for replying!

  • @tostupidforname
    @tostupidforname Před 4 lety +1

    I assume the bias is implicit in w?

  • @lucavoros8073
    @lucavoros8073 Před 2 lety

    Wouldn't it be (yi - aj(l))^2 instead of (aj(l) - yi)^2 ?

  • @mariodurndorfer6996
    @mariodurndorfer6996 Před 5 lety

    Maybe it is somehow confusing to differentiate subscript k for layer l-1 and j for layer l but not using annother subscript (j is also used) for output layer L? Did i miss something?

    • @georgepalafox5967
      @georgepalafox5967 Před 4 lety

      Same here. I think another subscript should be used. For layer l-1 its k, for layer l, its j, and for layer L its j again??? Shouldn't it be another index letter?

  • @kaushikkn
    @kaushikkn Před 5 lety

    It would be great to actual introduce the Einstein summation convention instead of sum. It looks a lot more neat and less clumsy. Otherwise thanks for the fantastic explanation.

  • @chavankoppa
    @chavankoppa Před 6 lety +1

    Hi, It's a great explanation. Thanks a lot.I have a quick question (9:05), zj(L) which is the input to the activation function aj(L) is a function of the weights wj(L) and output of the activation function ak(l-1) of the layer k right.
    If it is correct then C0j = C0j(aj(L)(zj(L)( wj(L) ak(l-1)))

    • @deeplizard
      @deeplizard  Před 6 lety +1

      Hey Chavan - Yes, that's right.
      Notation wise, however, if we include ak(l-1) as you did above, then that would lead us to needing to express ak(l-1) as a function of the weights and the input of the previous layer, l-2. Then, we'd go through the same process again of expressing ak(l-2) as a function of the weights and the input of the previous layer, l-3. We'd continue this over and over until we reached the start of the network.
      This is indeed correct, but it just gets a little messy with the notation if we continue expressing each function as a function of function of a function of a ...
      I illustrate the concept and the math behind this idea in part 5 of the backpropagation series here: czcams.com/video/xClK__CqZnQ/video.html

    • @ssffyy
      @ssffyy Před 3 lety

      @@deeplizard Awesome, that was the answer I was looking for. Great explanation BTW.

  • @asadali4153
    @asadali4153 Před 4 lety

    Loss function is same linear regression. Derivative also used in LR. Then how NN is different then linear regression?

    • @jsarvesh
      @jsarvesh Před 3 lety

      we use non-linear activation function in NN to introduce non-linearity. You can also think of linear regression having linear activation in the final output layer

  • @transolve9726
    @transolve9726 Před 5 lety

    at this point you should have given an example of the activation function g() also it would be helpful if you had diagram/illustration on the right while you are explaining.

    • @deeplizard
      @deeplizard  Před 5 lety +2

      Hey Transolve - In part 4 and 5 of the backprop videos, I use a diagram to illustrate the math. Be sure to check those out!
      Also, here is our video/blog on activation functions if you're interested: deeplizard.com/learn/video/m0pIlLfpXWE
      Any activation function can be substituted in for g().

    • @transolve9726
      @transolve9726 Před 5 lety

      i later saw the illustration in the next video(5) and now see you have a seperate video on activated functions.

  • @MaahirGupta
    @MaahirGupta Před 3 lety

    Damn......

  • @poulamikar5921
    @poulamikar5921 Před 4 lety

    Found these too basic. Trying to train a CNN with drawings of artist to classify different features from it. I don't think just formulation are gonna be of any help, these are in every lectures of DL. Can you make a lecture where training it with features of an image?

  • @fupopanda
    @fupopanda Před 5 lety

    So I assume n is number of nodes in a layer. That's the only logical answer that works here.

  • @rohtashbeniwal9202
    @rohtashbeniwal9202 Před 4 lety

    video good, why lizard?

    • @deeplizard
      @deeplizard  Před 4 lety +1

      *I could a tale unfold whose lightest word*
      *Would harrow up thy soul.*
      👻🦎

  • @markcuello5
    @markcuello5 Před rokem

    HELP

  • @sukantdebnath4463
    @sukantdebnath4463 Před 5 lety

    Very Complicated..

  • @sprajapati2011
    @sprajapati2011 Před 4 lety

    change th name of channel