Tutorial 8- Exploding Gradient Problem in Neural Network

Sdílet
Vložit
  • čas přidán 22. 07. 2019
  • After completing this video, you will know:
    What exploding gradients are and the problems they cause during training.
    How to know whether you may have exploding gradients with your network model.
    How you can fix the exploding gradient problem with your network
    Below are the various playlist created on ML,Data Science and Deep Learning. Please subscribe and support the channel. Happy Learning!
    Deep Learning Playlist: • Tutorial 1- Introducti...
    Data Science Projects playlist: • Generative Adversarial...
    NLP playlist: • Natural Language Proce...
    Statistics Playlist: • Population vs Sample i...
    Feature Engineering playlist: • Feature Engineering in...
    Computer Vision playlist: • OpenCV Installation | ...
    Data Science Interview Question playlist: • Complete Life Cycle of...
    You can buy my book on Finance with Machine Learning and Deep Learning from the below url
    amazon url: www.amazon.in/Hands-Python-Fi...
    🙏🙏🙏🙏🙏🙏🙏🙏
    YOU JUST NEED TO DO
    3 THINGS to support my channel
    LIKE
    SHARE
    &
    SUBSCRIBE
    TO MY CZcams CHANNEL

Komentáře • 174

  • @khalidal-reemi3361
    @khalidal-reemi3361 Před 3 lety +33

    I never got such clear explanation for deep learning concepts.
    I had Coursera deep learning. They make it more difficult to what it is.
    Thank you Krish.

  • @midhileshmomidi2434
    @midhileshmomidi2434 Před 4 lety +38

    From now If anyone asked me about Vanishing Gradient Descent OR Exploring Gradient Descent I will not just answer and I even take a class to them
    The best video I've ever seen

    • @manishsharma2211
      @manishsharma2211 Před 3 lety

      Exactly

    • @kiruthigakumar8557
      @kiruthigakumar8557 Před 3 lety +3

      i have a small doubt...in vanishing- the values where very small but here its high but both have the same eqn right? or is it due to the weights in the vanishing was normal and in exploding its high?....ur help is really appreciated

    • @sargun_narula
      @sargun_narula Před 3 lety

      @@kiruthigakumar8557 even I have the same doubt if anyone can help it would be really appreciated

    • @chiragchauhan8429
      @chiragchauhan8429 Před 3 lety +4

      @@sargun_narula As he said when using sigmoid the values would be between 0-1 so if their weights are smaller when we initialise them but for a smaller network that is 1 or 2 hidden layer network vanishing won't be a problem but if it uses more like 10 layers then after some layers considering last 3rd layer when backpropagating the derivative will be decreasing with every layer and due to that optimizer will be so slow to reach the minima value and that's what vanishing gradient is. Talking about exploding gradient if weights are bigger and derivative increases after backpropagating than that may put our optimizer into diverging rather than reaching minima i.e exploding problem. Simply saying weights shouldn't be initialized so high or so low.

    • @babupatil2416
      @babupatil2416 Před 3 lety

      ​@@kiruthigakumar8557​Irrespective of your activation function your weights causes the Exploding/Vanishing gradient descent problem. Weights shouldn't be initialized so high or so low. Here is the Andrew Ng video for the same czcams.com/video/qhXZsFVxGKo/video.html

  • @winviki123
    @winviki123 Před 4 lety +38

    Loving this playlist
    Most of these abstract concepts are explained very elegantly
    Thank you so much

  • @tarun4705
    @tarun4705 Před rokem +2

    This playlist is like a treasure.

  • @pushkarajpalnitkar1695

    Best explanation for EXPLODING gradient problem on the internet I have encountered so far. Awesome!

  • @skviknesh
    @skviknesh Před 3 lety +5

    9:32 peak of interest! Happiness in explaining why it will not converge... I love that reaction!!!😍😍😍

  • @rukeshshrestha5938
    @rukeshshrestha5938 Před 4 lety +6

    I really love your videos. Today only i started watching your tutorial. It was really helpful. Thank you so much for sharing your knowledge.

  • @raidblade2307
    @raidblade2307 Před 4 lety +2

    Deep Concepts are getting clear.
    Thank you sir. Such a beautiful explanation

  • @whitemamba7128
    @whitemamba7128 Před 3 lety +3

    Sir, your videos are very educational and, you put a lot of energy into making them. They make the learning process easy, and it also lets me develop an interest in deep learning. That's the best I could have asked for and, you delivered it. Thank you, Sir.

  • @slaozturk47
    @slaozturk47 Před rokem

    Your classes are quite clear, thank you so much !!!!

  • @somanathking4694
    @somanathking4694 Před 2 měsíci

    how i missed the class all these years
    how come you are able to simplify the topics.
    👏

  • @annalyticsannalizaramos5890

    Congrats for a well explained topic. Now I know the effect of exploding gradients

  • @143balug
    @143balug Před 4 lety +1

    Excellent Videos bro, I am getting clear picture on those concepts Thank you very much for making the video's with clear understandable manner.
    I am follwing your every video.

  • @vincenzo3908
    @vincenzo3908 Před 4 lety +1

    Very well explained, and the writings and drawings are very clear too by the way

  • @aravindpiratla2443
    @aravindpiratla2443 Před rokem

    Love the explanation bro... I used to initialize weights randomly but after watching this, I came to know the impact of such initializations...

  • @bigbull266
    @bigbull266 Před 2 lety +5

    Exploding Gradient Problem is because of Higher Weights Initialization. If the weights are higher, then during BackProp gradients value will be higher which in turn affects the new weights to be vv small when updating weights [ Wnew = Wold - lr * Grad] Due to which the weight difference will be Varying a lot at every epoch and this is why Gradient Descent will never converge.

  • @emirozgun3368
    @emirozgun3368 Před 4 lety +1

    Pure passion,appriciate it.

  • @ArthurCor-ts2bg
    @ArthurCor-ts2bg Před 4 lety +1

    Very passionate and articulate lecture well done

  • @basharfocke
    @basharfocke Před 11 měsíci

    Best explanation so far. No doubt !!!

  • @janekou2482
    @janekou2482 Před 3 lety

    Awesome explanation! Best video I have seen for this problem.

  • @yogenderkushwaha5523
    @yogenderkushwaha5523 Před 4 lety

    Amazing explanation sir. I am going to learn whole deep learning from your videos only

  • @farzanehparvar_
    @farzanehparvar_ Před 3 lety

    That was one of the best explanations for Exploding gradient problem. But please mention the next video in the description box. I could find it hard.

  • @4abdoulaye
    @4abdoulaye Před 4 lety +1

    YOU ARE JUST KIND DUDE. THANKS

  • @DanielSzalko
    @DanielSzalko Před 5 lety +2

    Please keep making videos like this!

  • @tinumathews
    @tinumathews Před 5 lety +3

    This is super krish, its like a story that you explain... at 9:35 minutes the whole picture jumps into your mind. neat explanation. Nice work krish... awaiting for more videos. meet you on satruday..till then cheers

  • @adityashewale7983
    @adityashewale7983 Před 11 měsíci

    hats off to you sir,Your explanation is top level, THnak you so much for guiding us...

  • @ne2514
    @ne2514 Před 2 lety

    love your video of machine learning algorithms, kudos

  • @anshulzade6355
    @anshulzade6355 Před rokem

    keep up the good work, disrupting the education system. Lots of love

  • @PeyiOyelo
    @PeyiOyelo Před 4 lety +1

    Another Great Video. Namaste

  • @ganeshkharad
    @ganeshkharad Před 4 lety

    best explaination... thanks for making this video

  • @harshsharma-jp9uk
    @harshsharma-jp9uk Před 2 lety

    great work.. Kudos to u!!!!!!!!!!

  • @-birigamingcallofduty2219

    Very very effective video sir 👍👍👍👍👍👍....my love and gratitude to you 🙏...

  • @pdteach
    @pdteach Před 4 lety

    Very nice explanation.thanks

  • @praneethcj6544
    @praneethcj6544 Před 4 lety +1

    Excellent ..!!!

  • @indrashispowali
    @indrashispowali Před 2 lety

    thanks Krish... nice explanations

  • @16876
    @16876 Před 4 lety

    awesome video, much respect

  • @Mustafa-jy8el
    @Mustafa-jy8el Před 4 lety +1

    I love the energy

  • @rajaramk1993
    @rajaramk1993 Před 5 lety +1

    excellent and to the point explanation sir. Waiting for your future videos in Deep Learning.

  • @tarunbhatia8652
    @tarunbhatia8652 Před 3 lety

    Best video. Hands down

  • @nitayg1326
    @nitayg1326 Před 4 lety

    Exploding GD explained nicely!

  • @sushantshukla6673
    @sushantshukla6673 Před 4 lety

    u doing great job man

  • @bangarrajumuppidu8354
    @bangarrajumuppidu8354 Před 2 lety

    super explanation sir !!

  • @thunder440v3
    @thunder440v3 Před 4 lety

    Awesome video!

  • @sushmitapoudel8500
    @sushmitapoudel8500 Před 3 lety

    You're great!

  • @nareshbabu9517
    @nareshbabu9517 Před 5 lety +4

    Do tutorial based on machine learning like regression ,classification and clustering sir

  • @emilyme9478
    @emilyme9478 Před 3 lety

    great video !

  • @sarrae100
    @sarrae100 Před 2 lety

    Excellent.

  • @pranjalgupta9427
    @pranjalgupta9427 Před 2 lety +1

    Awesome 😊👏👍

  • @jsverma143
    @jsverma143 Před 4 lety

    just excellent :-)

  • @brindapatel1750
    @brindapatel1750 Před 4 lety

    excellent krish
    love to watch your videos

  • @nitishkumar-bk8kd
    @nitishkumar-bk8kd Před 3 lety

    beautiful explanation

  • @kueen3032
    @kueen3032 Před 3 lety +41

    One correction: dL/dW'11 should be (dL/dO31. dO31/dO21. dO21/dO11. dO11/dW'11)

    • @vikrambharadwaj7072
      @vikrambharadwaj7072 Před 3 lety +3

      In tutorial 6 also there was a correction...!
      is there an explanation

    • @adarshyadav340
      @adarshyadav340 Před 3 lety

      You are right @kueen, krish has missed out the first term in the chain rule.

    • @vvek27
      @vvek27 Před 3 lety

      yes you are right

    • @manojsamal7248
      @manojsamal7248 Před 2 lety

      but what will come in "dL" is that (y-Y) ^2 or log loss funtion will come in "dL"

    • @indrashispowali
      @indrashispowali Před 2 lety

      just wanted to know... does the chain rule refer to partial derivative ??

  • @benvelloor
    @benvelloor Před 3 lety

    Thanks a lot sir

  • @vd.se.17
    @vd.se.17 Před 3 lety

    Thank you.

  • @dhruvajpatil8359
    @dhruvajpatil8359 Před 4 lety

    Too good man !!! #BohotHard

  • @shamussim137
    @shamussim137 Před 3 lety +4

    Question:
    Hi Krish. dO21/do11 is large because we mutliple the derivate of the sigmoid (btwn 0 to 0.25) with a large weight. However, in Tutorial 7 we didn't use this formula(chain rule derivation), we directly said dO21/do11 is between 0 to 0.25. Please can you clarify on this?

    • @hritiknandanwar5095
      @hritiknandanwar5095 Před 2 lety

      Even I have the same question, sir can you please explain this section?

    • @shrikotha3899
      @shrikotha3899 Před 2 lety

      even I have the same doubt.. can u explain this?

    • @aadityabhardwaj4036
      @aadityabhardwaj4036 Před 7 měsíci

      That is because O21 = sigmoid(ff21) and when we take the derivate of O21 with respect to any variable (be it O11), We know it will range between 0 and 0.25. Because the derivative of sigmoid(x) ranges from 0 to.25, and x can be any value.

  • @sandipansarkar9211
    @sandipansarkar9211 Před 4 lety

    Superb video once again.But need to study a little bit of theory.But still no idea how questions are framed in an interview in regards to deep learning.

  • @ronishsharma8825
    @ronishsharma8825 Před 4 lety +16

    the chain rule is a mistake please correct it.

  • @sindhuorigins
    @sindhuorigins Před 4 lety +2

    the activation function is denoted by phi, not to be confused with the symbol of cyclicc integral

  • @sahilsaini3783
    @sahilsaini3783 Před 3 lety +2

    At 08:30, the derivative of O21 wrt O11 is 125, but O21 is a sigmoid function. How can its derivative be 125 because derivative of sigmoid function ranges from 0 to 0.25.

  • @quranicscience9631
    @quranicscience9631 Před 4 lety

    very good content

  • @kishanpandey4798
    @kishanpandey4798 Před 4 lety +8

    Please see, the chain rule has missed something at 2:55. @krish naik

    • @omkarrane1347
      @omkarrane1347 Před 4 lety +8

      yes there is mistake is missing del L /del o31 onwards

    • @amrousimen684
      @amrousimen684 Před 3 lety

      @@omkarrane1347 yes this is a miss

  • @louerleseigneur4532
    @louerleseigneur4532 Před 3 lety

    Thanks krish

  • @afsheenmaroof6209
    @afsheenmaroof6209 Před 4 lety

    Write a model function to predict the y when given weights wi , input x where
    y=w0+w1.x
    how i can model this function??

  • @saikiran-mi3jc
    @saikiran-mi3jc Před 5 lety +1

    Waiting for future videos on DL

  • @YoutubePremium-ny2ys
    @YoutubePremium-ny2ys Před 3 lety

    Request for a video on side by side comparison of vanishing gradient and exploding gradient...

  • @narayanjha3488
    @narayanjha3488 Před 4 lety

    Great videoo

  • @ankurmodi4588
    @ankurmodi4588 Před 3 lety +1

    This likes turn into 1M likes after mid 2021. People do not understand the effort and hard work as they are also not doing anything right now. wait and watch

  • @alphonseinbaraj7602
    @alphonseinbaraj7602 Před 4 lety

    in this video ,5:30 u mentioned that w21' is this correct? i hope that is w11''? am i right or wrong ?So z=O11.w11''+b2will come .instead O11.w21+b2. am i right ?pls

  • @y.mamathareddy8699
    @y.mamathareddy8699 Před 4 lety +1

    Sir please make a video on bayes theorem and its concepts learning....

  • @jasbirsingh8849
    @jasbirsingh8849 Před 4 lety +4

    In the vanishing gradient you directly put values b/w 0 and 0.25 as derivative ranges in that range but why not put direct values here ?
    I mean the same we could we have done in vanishing gradient as well i.e. expanding the equation and multiple by its weight ?

    • @anshul8258
      @anshul8258 Před 3 lety

      Even i am having the same doubt. After watching this video, I cannot understand why (d O21 / d 011) was directly put between 0 to 0.25 in Vanishing Gradient Problem video.

    • @souravsaha1973
      @souravsaha1973 Před 2 lety

      @krish naik sir, can you please help clarify this doubt

    • @elileman6599
      @elileman6599 Před rokem

      yes it made me confused too

  • @karunasagargundiga5821
    @karunasagargundiga5821 Před 4 lety +3

    hello sir,
    In vanishing gradient problem you have mentioned that derivative of sigmoid is always between 0-0.25. When you did the derivative of sigmoid function result i.e derivative of o12 w.r.t o11 it must be in the range of 0-0.25 but when you expanded we got the answer as 125. I did not understand how did the derivative of sigmoid exceed the range of 0-0.25. It seems contradictory. Hope you can clear my doubt, sir.

    • @priyanath2754
      @priyanath2754 Před 4 lety +1

      I am having the same doubt. Can anyone please explain it?

    • @reachDeepNeuron
      @reachDeepNeuron Před 4 lety

      Even I had this question

    • @praneetkuber7210
      @praneetkuber7210 Před 3 lety

      He multiplied 0.25 with initial value weight w21 which was 500. W21 is derivative of z wrt O11 in his case.

  • @prakashprasad9218
    @prakashprasad9218 Před 3 lety +1

    The same analysis can be done to explain vanishing gradients. So why do we say Relu solves vanishing gradients problem? low weights there can be a problem there as well when derivative is 1 right?

  • @samyakjain8079
    @samyakjain8079 Před 3 lety +1

    @7:47 d(w_21 * O_11) = O_11 dw_21 + w_21 dO_11 (why are you assuming w_21 is constant)

  • @pratikkhadse732
    @pratikkhadse732 Před 4 lety

    Doubt: the BIAS that is added, what constitutes this bias.
    For instance Learning rate was found by optimization models, what methodology is used to introduce bias?

  • @ashwinsenthilvel4976
    @ashwinsenthilvel4976 Před 4 lety

    im getting confused as u said 3.20. why do u expand o21/o11 in this expolding gradient but y not expanded in vanishing gradient?.

  • @kalpeshnaik8826
    @kalpeshnaik8826 Před 4 lety +1

    Exploding Gradient Problem is only for sigmoid activation function or for all activation functions

  • @sumeetseth22
    @sumeetseth22 Před 4 lety

    love your videos and can't thankyou enough. Thankyou so much for theawesomest lessons

  • @jagadeeswarareddy9726
    @jagadeeswarareddy9726 Před 3 lety

    Really very good videos, One doubt - High value weights causing this exploding problem. But W-old also might be large vale right, if we do W-old - derivative L / dW not cause for big variance right. please help me.

  • @soodipaj6477
    @soodipaj6477 Před 4 lety

    How do you define O_11? in the first hidden layer?

  • @KamalkaGermany
    @KamalkaGermany Před 2 lety +2

    Shouldn't the derivative be dl/ dw'11 = dl/dO31 and then the rest? Could someone please clarify? Thanks

  • @invisible2836
    @invisible2836 Před 13 dny

    So overall you're saying that if you choose high values of weights, it'll cause problem to reach or maybe will never reach global minima

  • @Adinasa2
    @Adinasa2 Před 4 lety

    On what basis are the weights initialises

  • @sohamdutta3086
    @sohamdutta3086 Před 6 měsíci

    👍👍

  • @rmn7086
    @rmn7086 Před 3 lety

    Krish Naik bester Mann!

  • @shambhuthakur5562
    @shambhuthakur5562 Před 4 lety +5

    Thanks Krish for the video, however I didn't understood how you replaced loss function with output of output layer, it should actually be real output minus predicted.pls suggest.

    • @shashwatsinha4170
      @shashwatsinha4170 Před 3 lety

      He has just shown that the predicted output will be made input to the loss function (not that predicted output is loss function as you have comprehended)

  • @user-kp5fi6hr5j
    @user-kp5fi6hr5j Před 3 měsíci

    So basically Exploding and vanishing are dependent on how the weights are initialised?

  • @omkarrane1347
    @omkarrane1347 Před 4 lety +3

    sir, please note that in the last two videos there was the wrong application of chain rule. even our teacher who referred to the video has written the same mistake in her notes. ref del L /del o31 onwards

    • @krishnaik06
      @krishnaik06  Před 4 lety

      I probably made a mistake in the last part

    • @shubhammaurya2658
      @shubhammaurya2658 Před 4 lety +1

      can you explain what is wrong briefly. so I can understand

    • @chinmaybhat9636
      @chinmaybhat9636 Před 4 lety

      Which one is correct then one used in this video or the one used in the previous video ??

  • @shahariarsarkar3433
    @shahariarsarkar3433 Před 3 lety

    sir may be there is a problem in the chain rule that you explain. Here something is missing that is derivative of L with respect to O31

  • @SimoneIovane
    @SimoneIovane Před 4 lety +2

    Very well explained thanks! I have a doubt tho: Are vanishing and exploding gradient coexistent phenomena? As they both happen in the BP does their happening depend exclusively on the value of the loss at a particular epoch? Hope my question is clear

    • @reachDeepNeuron
      @reachDeepNeuron Před 4 lety

      Even I hv the same question. Appreciate if you can clear

  • @subrataghosh735
    @subrataghosh735 Před 2 lety

    Thanks for the great explanation. One small doubt/clarification would be helpful. Here since we have sigmoid if the Weight value is somewhat around 2 ( ex:2) dO11/dW11. then the dO21/dO11 value will be .25*2 = .5 then chain rule ((dO21/dO11) * (dO11/dW11)) will be again .5*.5 = 25 considering dO11/dW11 weight is also 2. Then instead of exploding it will be shirking. Can you please suggest what is the thought for this scenario

    • @subrataghosh735
      @subrataghosh735 Před 2 lety

      Thanks for the great explanation. One small doubt/clarification would be helpful. Here since we have sigmoid if the Weight value is somewhat around 2 ( ex:2) dO11/dW11. then the dO21/dO11 value will be .25*2 = .5 then chain rule ((dO21/dO11) * (dO11/dW11)) will be again .5*.5 = .25 considering dO11/dW11 weight is also 2. Then instead of exploding it will be shirking. Can you please suggest what is the thought for this scenario

  • @komandoorideekshith85
    @komandoorideekshith85 Před 3 měsíci

    a small doubt is that in another video you told that derivative of loss w.r.t. weight equals to derivative of loss w.r.t. output and etc... but in this video you considered directly from out on r.h.s could you please conform it

  • @jayanthkothapalli9.2
    @jayanthkothapalli9.2 Před rokem

    sir why your are not writing the term dL/d(o31) with other terms?

  • @muntazirmehdi503
    @muntazirmehdi503 Před 3 lety

    why we are multiplying O11 with weights

  • @jt007rai
    @jt007rai Před 4 lety

    Thanks for this amazing video sir!
    Just to summarize, can I say that only if my weight initialization would be very high and activation function is sigmoid and learning rate is also very high, I can experience this problem and no other such cases?

    • @32deepan
      @32deepan Před 4 lety

      Activation function doesn't matter for exploding gradient decent to occur. High magnitude weights initialization alone can cause this problem.

    • @songs-jn1cf
      @songs-jn1cf Před 4 lety

      deepan chakravarthi
      Activation function is proportional to weights being applied.so exploding gradient indirectly depends on activation function and directly on weights.

    • @manishsharma2211
      @manishsharma2211 Před 3 lety

      The derivate should also be high

  • @revanthshalon5626
    @revanthshalon5626 Před 4 lety +1

    Sir, the only time when the exploding gradient problem occurs is when the weights is high and the time when vanishing gradient occurs is when the weights are too low, is my assumption correct?

  • @anirbandas6122
    @anirbandas6122 Před rokem

    @2.37 u have missed a derivate dL/d031 on the RHS.

  • @chaitanyauppuluri6181
    @chaitanyauppuluri6181 Před 3 lety

    why isn't there any weight like W31 for O21

  • @jibinsebastian187
    @jibinsebastian187 Před 2 lety

    How we will assign the weight value as 500. The normalized value is (-1,1).

  • @muhammadiqbalbazmi9275

    I don't think that we will face Exploding Gradient problem ever because we use the standard way of initializing weights like Xavier/Glorot(sigmoid, tanh) and 'He_uniform/normal'(ReLU).