Vanishing & Exploding Gradient explained | A problem resulting from backpropagation
VloĆŸit
- Äas pĆidĂĄn 5. 07. 2024
- Let's discuss a problem that creeps up time-and-time during the training process of an artificial neural network. This is the problem of unstable gradients, and is most popularly referred to as the vanishing gradient problem.
We're first going to answer the question, what is the vanishing gradient problem, anyway? Here, we'll cover the idea conceptually. We'll then move our discussion to talking about how this problem occurs. Then, with the understanding that we'll have developed up to this point, we'll discuss the problem of exploding gradients, which we'll see is actually very similar to the vanishing gradient problem, and so we'll be able to take what we learned about that problem and apply it to this new one.
đđŠ VIDEO SECTIONS đŠđ
00:00 Welcome to DEEPLIZARD - Go to deeplizard.com for learning resources
00:28 Gradient review
01:18 Agenda
01:45 The vanishing gradient problem
03:27 The cause of the vanishing gradients
05:30 Exploding gradient
07:13 Collective Intelligence and the DEEPLIZARD HIVEMIND
đ„đŠ DEEPLIZARD COMMUNITY RESOURCES đŠđ„
đ Hey, we're Chris and Mandy, the creators of deeplizard!
đ Check out the website for more learning material:
đ deeplizard.com
đ» ENROLL TO GET DOWNLOAD ACCESS TO CODE FILES
đ deeplizard.com/resources
đ§ Support collective intelligence, join the deeplizard hivemind:
đ deeplizard.com/hivemind
đ§ Use code DEEPLIZARD at checkout to receive 15% off your first Neurohacker order
đ Use your receipt from Neurohacker to get a discount on deeplizard courses
đ neurohacker.com/shop?rfsn=648...
đ CHECK OUT OUR VLOG:
đ / deeplizardvlog
â€ïžđŠ Special thanks to the following polymaths of the deeplizard hivemind:
Tammy
Mano Prime
Ling Li
đ Boost collective intelligence by sharing this video on social media!
đ Follow deeplizard:
Our vlog: / deeplizardvlog
Facebook: / deeplizard
Instagram: / deeplizard
Twitter: / deeplizard
Patreon: / deeplizard
CZcams: / deeplizard
đ Deep Learning with deeplizard:
Deep Learning Dictionary - deeplizard.com/course/ddcpailzrd
Deep Learning Fundamentals - deeplizard.com/course/dlcpailzrd
Learn TensorFlow - deeplizard.com/course/tfcpailzrd
Learn PyTorch - deeplizard.com/course/ptcpailzrd
Natural Language Processing - deeplizard.com/course/txtcpai...
Reinforcement Learning - deeplizard.com/course/rlcpailzrd
Generative Adversarial Networks - deeplizard.com/course/gacpailzrd
đ Other Courses:
DL Fundamentals Classic - deeplizard.com/learn/video/gZ...
Deep Learning Deployment - deeplizard.com/learn/video/SI...
Data Science - deeplizard.com/learn/video/d1...
Trading - deeplizard.com/learn/video/Zp...
đ Check out products deeplizard recommends on Amazon:
đ amazon.com/shop/deeplizard
đ” deeplizard uses music by Kevin MacLeod
đ / @incompetech_kmac
â€ïž Please use the knowledge gained from deeplizard content for good, not evil.
Machine Learning / Deep Learning Fundamentals playlist:
czcams.com/play/PLZbbT5o_s2xq7LwI2y8_QtvuXZedL6tQU.html
Keras Machine Learning / Deep Learning Tutorial playlist:
czcams.com/play/PLZbbT5o_s2xrwRnXk_yCPtnqqo4_u2YGL.html
BACKPROP VIDEOS:
Backpropagation explained | Part 1 - The intuition
czcams.com/video/XE3krf3CQls/video.html
Backpropagation explained | Part 2 - The mathematical notation
czcams.com/video/2mSysRx-1c0/video.html
Backpropagation explained | Part 3 - Mathematical observations
czcams.com/video/G5b4jRBKNxw/video.html
Backpropagation explained | Part 4 - Calculating the gradient
czcams.com/video/Zr5viAZGndE/video.html
Backpropagation explained | Part 5 - What puts the âbackâ in backprop?
czcams.com/video/xClK__CqZnQ/video.html
I was stuck on this concept for hours and didn't click on this video because of the views but I was wrong this is the clearest and simplest explaination I've found thanks a lot!
The voice is so nice and confident.
Thanks, yu!
@@deeplizard lol, was that intentional? xD
I landed here after checking andrews vidoes about this(which was confusing), but this video explained it very clearly and simple
Same here, Didn't like his explanation. This was very clear.Thanks
Yep, his explanations aren't clear sometimes. It's frustating.
Same, now I watch other videos first then move to his lectures đ
I fiiiinally understand how the vanishing gradient problem occurs
Very good channel indeed !!!
The best and the simplest explanation of Vanishing Gradient I have found so far.
The best explanation of exploding and vanishing gradients I have come across so far. Great job!
Perhaps a small addition to the explanation for vanishing gradients in this video, from a computer architecture point of view. When a network is trained on an actual computer system the variable types (e.g. floats) have a limited 'resolution' that they can express due to their numerical representation. This means that adding or subtracting a very small value from a value 'x' could actually result in 'x' (unchanged), meaning that the network stopped training that particular weight.
For example: 0.29 - 0.000000001 could become 0.29.
With neural networks moving towards smaller variable types (e.g. 16 bit floats instead of 32) this problem is becoming more pronounced. For a similar reason, floating point representations usually do not become zero, they just approach zero.
Dude, this is super insightful. Thanks!
What an easy explanation.. đ
I am jst loving this Playlist and dont want it to end ever đ
Thanks, deepak! Be sure to check out the Keras series as well đ
czcams.com/play/PLZbbT5o_s2xrwRnXk_yCPtnqqo4_u2YGL.html
Underrated channel ! Thanks for posting these videos :)
Wow this explanation was really clear and to the point, subbed immediately, going to check out all the videos over time.
This channel never let us down. great work!
Great job explaining this, understood something I was unsure about in a very decisive and clear way. Thanks!
Your videos are just perfect ! the voice, the explanation, the animation! Genius of pedagogy :)
Thank you very much for this video! I learnt about these similar problems of vanishing and exploding gradients and how it affects the convergence of weight values to their optimal values!
I am having my deep learning exam tomorrow. Started studying just one day before the exam. Couldn't understand anything. Then found your video. Now I understood this concept. Thanks a lot đđđ
The intro is so relaxing. It is like you are in another dimension.
Thank you so much for your series and explanations!
The best source to understand machine learning concepts in an easy way. =)
cannot wait to watch the next video that addresses the vanishing and exploding gradient problem. :)
Have you got around to it yet?
This is the one - Weight initialization: czcams.com/video/8krd5qKVw-Q/video.html
I finally understand about gradient vanishing and exploding from your video!! Thanks : )
Well explained. Thanks for your works.
Thanks for the explaining the concept so clearly :D
Thank you for such a nice video. Understood the concept.
Crystal Clear Explanation.
Great viedeo thanks for that! But I have one long time question about the backpropagation. I can adjust which layer's weight I hit bit summing up the components. But which weight will actually be updated then? will the layers inbetween the components of my chainrule update aswell? Would be very greatful for an answer, thanks!
best explanation !!! Good Work
Yet to see a video that explained this so clearly. One question. Does 1. both vanishing and exploding gradients lead to underfitting or 2. vanishing leads to underfitting and exploding lead to overfitting?
Thank you so much!!!
Thanks for an amazing tutorial, love from Afghanistan
Awesome video!
good explanation. keep up the great work :)
awesome explanation
Thank you!
In the case of exploding gradient, when it gets multiplied with the Learning Rate (between 0.0001 & 0.01), the result will be much less (usually less than 1). When this is further subtracted with the existing weight, wouldn't the updated weight still be less than 1? In that case, how is it different from vanishing gradient?
00:30 What is the gradient
01:18 Introduction
01:45 What is the vanishing gradient problem?
03:28 How does the vanishing gradient problem occurs?
05:31 What about exploding gradient?
Added to the description. Thanks so much!
there is a bigger problem with gradient descent and it is not vanishing or exploding thing. It is that it gets stuck in local minima, and that hasn't been solved yet. Only partial solutions like simulated annealing or GAs. UKF , montecarlo and stuff like that, which involves randomness. The only way to find better minimum is to introduce randomness.
Good description
Hello!
Question - I might sound silly here but do we ever have a weight update in the positive direction? I mean the weight was let's say 0.3 and then after update, it turned to 0.4? as while updating we always "subtract" the gradient * very low learning rate, this product that we subtract from the actual weight will always be very small.. so unless this product is negative (which only happens when the gradient is negative) , we will never add some value to the current weight but always reduce it right?
So to make some sense out of it, when do we get negative gradients ? do we generally have this happening?
Yes, we do get negative gradients, if looked at the formula for back prop; you will easily find when the gradient will turn out to be negative
Thank you so much
Is this problem occurs in simple neuruel or rnn lstm networks??
Vanishing gradient is dependent on the learning rate of the model, right?
Excellent #28 follow up to your playlist ## 23-27. Thanks.
I would have labeled your layers or edges "a", "b", "c", etc when you were discussing the cumulative effect on gradients that are earlier in the network (gradient = a * b * c * d, etc.). It can be a bit confusing since convention has us thinking one way and notation is reinforcing that while the conversation is about backprop which runs counter to that convention. The groundwork is laid for a very basic misunderstanding that could be cured with simple labels.
Great video, btw.
Appreciate your feedback, Samsung Blues.
Why is this an issue? If the partial derivative of the loss w.r.t. a weight is small, its change should also be small so that we step in the direction of steepest descent of the loss function. Is the problem of vanishing gradients that we effectively lose the ability to train certain weights of our network, reducing dimensionality of our model?
I spotted a couple slight typos in the article for this video.
we don't perse,
â
we don't per se,
further away from itâs optimal value
â
further away from its optimal value
Fixed, thanks Chris! :D
what can we do to prevent it????
i think we should use relu activation function
The answer is revealed in the following video.
deeplizard.com/learn/video/8krd5qKVw-Q
loved it
Great explanation! Background gives me motion sickness though.
Thanks for the feedback! I'll keep that in mind.
{
"question": "Vanishing Gradient is mostly related to âŠâŠâŠâŠ and is usually caused by having too many values âŠâŠâŠâŠâŠ in calculating the âŠâŠâŠâŠ",
"choices": [
"earlier weights, less than one, gradient",
"earlier weights, less than one, loss",
"back propagation, less than one, gradient",
"back propagation, less than one, loss"
],
"answer": "earlier weights, less than one, gradient",
"creator": "Hivemind",
"creationDate": "2020-04-21T21:46:35.500Z"
}
Thanks, Gideon! Just added your question to deeplizard.com/learn/learn/video/qO_NLVjD6zE :)
A gradient gets substracted from the weights to update them. This gradient can be really small and hence has no impact. Is also can become really large. How comes, since the gradient gets substracted, exploding gradient creates values that are larger than their former values? Should it not be sth. like a negative weight then? Nice videos btw. :)
Hey Dennis - Good question and observation. Let me see if I can help clarify. Essentially, vanishing gradient = small gradient update to the weight. Exploding gradient = large gradient update to the weight. With exploding gradient, the large gradient causes a relatively large weight update, which possibly makes the weight completely "jump" over its optimal value. This update could indeed result in a negative weight, and that's fine. The "exploding" is just in terms of how large the gradient is, not how large the weight becomes. In the video, I did illustrate the exploding gradient update with a larger positive number, but it probably would have been more intuitive to show the example with a larger negative number. Does this help clarify?
@@deeplizard I think that gradients can't be greater than 0.25 when using sigmoid as activation function as its derivative range from 0 to 0.25 thus it will never exceed 1 by any means. I think gradient explode is coming due to weight initialization problem as weights are initialized with large values. Correct or clarify me please?
Vanishing gradient is not a problem. It is a feature of stacking something upon something that depends on something else and so on. It is like falling dominoes but with increasing piece on each step. Because chaining function after function after function after function after function, where all the functions are summing, and at the end doing a ReLU, you are going to get your output values blowing up! Vanishing gradients is not a problem, it is how it is supposed to work. The math is right, at the lower layers you can't use big gradients because they are going to affect the output layer exponentialy.
And also, cut the first minute and a half of the video because it is just loss of time.
If the gradient (of loss) is small, doesn't it imply that a very small update is required?
Hey sandesh - Good question.
Michael Nielsen addresses this question in chapter 5 of his book, and I think it's a nice explanation. I'll link to the full chapter, but I've included the relevant excerpt below. Let me know if this helps clarify.
neuralnetworksanddeeplearning.com/chap5.html
"One response to vanishing (or unstable) gradients is to wonder if they're really such a problem. Momentarily stepping away from neural nets, imagine we were trying to numerically minimize a function f(x) of a single variable. Wouldn't it be good news if the derivative fâČ(x) was small? Wouldn't that mean we were already near an extremum? In a similar way, might the small gradient in early layers of a deep network mean that we don't need to do much adjustment of the weights and biases?
Of course, this isn't the case. Recall that we randomly initialized the weight and biases in the network. It is extremely unlikely our initial weights and biases will do a good job at whatever it is we want our network to do. To be concrete, consider the first layer of weights in a [784,30,30,30,10] network for the MNIST problem. The random initialization means the first layer throws away most information about the input image. Even if later layers have been extensively trained, they will still find it extremely difficult to identify the input image, simply because they don't have enough information. And so it can't possibly be the case that not much learning needs to be done in the first layer. If we're going to train deep networks, we need to figure out how to address the vanishing gradient problem."
As gradient of sigmoid or tan function are very low wouldn't it lead to vanishing gradient effect and why these these are chosen as activation function deeplizard
@@deeplizard so what is the reason for the derivative to come out small if it is not for the function to be at optimum w.r.t that particular weight
But Exploding Gradient can be handled by batch normalisation. Isn't it?
OMG I UNDERSTAND NOW
What if some intermediate gradients are large (> 1) and some are small (< 1), they could balance out to give early gradients still normal sizes. The described problem seems not to stem from the defect of the theory but rather from practical numerical implementation perspectives. The heuristic explanation is a bit handwaving. When should we expect to have a vanishing problem and when should we expect to have an exploding problem, is one vs the other purely random. If they are not random but depend on the nature of data or the NN architecture, what are they and why?
As gradient of sigmoid or tan function are very low wouldn't it lead to vanishing gradient effect and why these these are chosen as activation function
Hey Vinay - Check out the video on bias below to see how the result from a given activation function may actually be "shifted" to a greater number, which in turn might help to reduce vanishing gradient. czcams.com/video/HetFihsXSys/video.html
Let me know if this helps.
4:22 this is all wrong. updates can be positive or negative. this depends on the direction of the slope, if the slope is positive, the update is going to be substracted, if the slope is negative the weight updated is going to be added. So, weights don't just get smaller and smaller, they are updated in small quantities that's it, if you wait long enough (like weeks or months) you are going to get your ANN training completed just fine. You should learn about derivatives of composite functions and you will understand then that vanishing gradient is not a problem. It can be a problem though if you use float32 data type (single precision), because it has considerable error when using long chain of calculations. Switching to double will help with vanishing gradient "problem" (in quotes)
Nothing about the sigmoid function? I thought this was also one of the causes of a vanishing/exploding gradient?
Hmm... possible solutions, my guess is (A) too many layers in the net, or (B) a separate Learning Rate for each layer used in conjunction with a function to normalize the Learning Rate and thus the gradient.
But this is just my guess.
Nice! Not sure what type of impact these potential solutions may have. A known solution is revealed in the next episode âš
@@deeplizard after spending a further ten seconds thinking about this I decided I didn't like my "solutions". But I did think that "Every first solution should be Use Less Layers." Was not a bad philosophy. đ
May I know who edits your videos? Because whoever does, he/ she has humor đ
Haha thanks, kareem! There are two of us behind deeplizard. Myself, which you hear in this video, and my partner, who you'll hear in other videos like our PyTorch series. Together, we run the entire assembly line from topic creation and recording to production and editing, etc. :)
Is there any reason behind the name of your channel?
_I could a tale unfold whose lightest word_
_Would harrow up thy soul._
{
"question": "Vanishing gradient impair our training by making our weight being updated and their values getting further away from the optimal weights value",
"choices": [
"False",
"True",
"BLANK_SPACE",
"BLANK_SPACE"
],
"answer": "False",
"creator": "Hivemind",
"creationDate": "2020-04-21T21:53:20.326Z"
}
Thanks, Gideon! Just added your question to deeplizard.com/learn/video/qO_NLVjD6zE :)
math-less machine learning is so good :D
Love the series, but please refrain from zoom-in zoom-out background animation - it makes me distracted, nauseous even.
Thank you for the feedback!
The background image is really unnecessary and is rather disturbing
You are great, I feel like wanting to marry you for your intelligence
I wish you wouldnt talk so fast. Cant keep up. Or subtitles wouldnt bei nice
There are English subtitles automatically created by CZcams that you can turn on for this video. Also, you can slow down the speed on the video settings to 75% or 50% of the real-time speed to see if that helps.
Please talk slowly.
I want to marry you