Vanishing & Exploding Gradient explained | A problem resulting from backpropagation

deeplizard

zhlédnutí 122 372

Přidat do
- Můj playlist
- Přehrát později
Sdílet

Sdílet

Vložit

Velikost videa:

Zobrazit ovladače přehrávání

Automatické přehrávání

Přehrát

čas přidán 5. 07. 2024
Let's discuss a problem that creeps up time-and-time during the training process of an artificial neural network. This is the problem of unstable gradients, and is most popularly referred to as the vanishing gradient problem.
We're first going to answer the question, what is the vanishing gradient problem, anyway? Here, we'll cover the idea conceptually. We'll then move our discussion to talking about how this problem occurs. Then, with the understanding that we'll have developed up to this point, we'll discuss the problem of exploding gradients, which we'll see is actually very similar to the vanishing gradient problem, and so we'll be able to take what we learned about that problem and apply it to this new one.
🕒🦎 VIDEO SECTIONS 🦎🕒
00:00 Welcome to DEEPLIZARD - Go to deeplizard.com for learning resources
00:28 Gradient review
01:18 Agenda
01:45 The vanishing gradient problem
03:27 The cause of the vanishing gradients
05:30 Exploding gradient
07:13 Collective Intelligence and the DEEPLIZARD HIVEMIND
💥🦎 DEEPLIZARD COMMUNITY RESOURCES 🦎💥
👋 Hey, we're Chris and Mandy, the creators of deeplizard!
👉 Check out the website for more learning material:
🔗 deeplizard.com
💻 ENROLL TO GET DOWNLOAD ACCESS TO CODE FILES
🔗 deeplizard.com/resources
🧠 Support collective intelligence, join the deeplizard hivemind:
🔗 deeplizard.com/hivemind
🧠 Use code DEEPLIZARD at checkout to receive 15% off your first Neurohacker order
👉 Use your receipt from Neurohacker to get a discount on deeplizard courses
🔗 neurohacker.com/shop?rfsn=648...
👀 CHECK OUT OUR VLOG:
🔗 / deeplizardvlog
❤️🦎 Special thanks to the following polymaths of the deeplizard hivemind:
Tammy
Mano Prime
Ling Li
🚀 Boost collective intelligence by sharing this video on social media!
👀 Follow deeplizard:
Our vlog: / deeplizardvlog
Facebook: / deeplizard
Instagram: / deeplizard
Twitter: / deeplizard
Patreon: / deeplizard
CZcams: / deeplizard
🎓 Deep Learning with deeplizard:
Deep Learning Dictionary - deeplizard.com/course/ddcpailzrd
Deep Learning Fundamentals - deeplizard.com/course/dlcpailzrd
Learn TensorFlow - deeplizard.com/course/tfcpailzrd
Learn PyTorch - deeplizard.com/course/ptcpailzrd
Natural Language Processing - deeplizard.com/course/txtcpai...
Reinforcement Learning - deeplizard.com/course/rlcpailzrd
Generative Adversarial Networks - deeplizard.com/course/gacpailzrd
🎓 Other Courses:
DL Fundamentals Classic - deeplizard.com/learn/video/gZ...
Deep Learning Deployment - deeplizard.com/learn/video/SI...
Data Science - deeplizard.com/learn/video/d1...
Trading - deeplizard.com/learn/video/Zp...
🛒 Check out products deeplizard recommends on Amazon:
🔗 amazon.com/shop/deeplizard
🎵 deeplizard uses music by Kevin MacLeod
🔗 / @incompetech_kmac
❤️ Please use the knowledge gained from deeplizard content for good, not evil.

Komentáře • 99

@deeplizard Před 6 lety ⁺¹²
Machine Learning / Deep Learning Fundamentals playlist:
czcams.com/play/PLZbbT5o_s2xq7LwI2y8_QtvuXZedL6tQU.html
Keras Machine Learning / Deep Learning Tutorial playlist:
czcams.com/play/PLZbbT5o_s2xrwRnXk_yCPtnqqo4_u2YGL.html
BACKPROP VIDEOS:
Backpropagation explained | Part 1 - The intuition
czcams.com/video/XE3krf3CQls/video.html
Backpropagation explained | Part 2 - The mathematical notation
czcams.com/video/2mSysRx-1c0/video.html
Backpropagation explained | Part 3 - Mathematical observations
czcams.com/video/G5b4jRBKNxw/video.html
Backpropagation explained | Part 4 - Calculating the gradient
czcams.com/video/Zr5viAZGndE/video.html
Backpropagation explained | Part 5 - What puts the “back” in backprop?
czcams.com/video/xClK__CqZnQ/video.html
@jackripper6066 Před 2 lety ⁺⁸
I was stuck on this concept for hours and didn't click on this video because of the views but I was wrong this is the clearest and simplest explaination I've found thanks a lot!
@yuxiaofei3442 Před 6 lety ⁺⁶²
The voice is so nice and confident.
@deeplizard Před 6 lety ⁺¹⁴
Thanks, yu!
@himanshutanwani5118 Před 5 lety ⁺²²
@@deeplizard lol, was that intentional? xD
@lonewolf2547 Před 5 lety ⁺²⁵
I landed here after checking andrews vidoes about this(which was confusing), but this video explained it very clearly and simple
@prateekkumar151 Před 5 lety ⁺⁵
Same here, Didn't like his explanation. This was very clear.Thanks
@milindbebarta2226 Před rokem
Yep, his explanations aren't clear sometimes. It's frustating.
@AfnanKhan-ni6zc Před 5 dny
Same, now I watch other videos first then move to his lectures 😂
@NikkieBiteMe Před 5 lety ⁺³⁰
I fiiiinally understand how the vanishing gradient problem occurs
@mita1498 Před 4 lety
Very good channel indeed !!!
@cmram23 Před 3 lety ⁺²
The best and the simplest explanation of Vanishing Gradient I have found so far.
@anirudhgangadhar6158 Před rokem ⁺¹
The best explanation of exploding and vanishing gradients I have come across so far. Great job!
@MarkW_ Před 3 lety ⁺¹⁴
Perhaps a small addition to the explanation for vanishing gradients in this video, from a computer architecture point of view. When a network is trained on an actual computer system the variable types (e.g. floats) have a limited 'resolution' that they can express due to their numerical representation. This means that adding or subtracting a very small value from a value 'x' could actually result in 'x' (unchanged), meaning that the network stopped training that particular weight.
For example: 0.29 - 0.000000001 could become 0.29.
With neural networks moving towards smaller variable types (e.g. 16 bit floats instead of 32) this problem is becoming more pronounced. For a similar reason, floating point representations usually do not become zero, they just approach zero.
@RabbitLLLord Před 2 měsíci ⁺¹
Dude, this is super insightful. Thanks!
@deepaksingh9318 Před 6 lety ⁺³³
What an easy explanation.. 👍
I am jst loving this Playlist and dont want it to end ever 😁
@deeplizard Před 6 lety ⁺²
Thanks, deepak! Be sure to check out the Keras series as well 😎
czcams.com/play/PLZbbT5o_s2xrwRnXk_yCPtnqqo4_u2YGL.html
@100deep1001 Před 4 lety ⁺²
Underrated channel ! Thanks for posting these videos :)
@neoblackcyptron Před 5 lety ⁺²
Wow this explanation was really clear and to the point, subbed immediately, going to check out all the videos over time.
@farshad-hasanpour Před 3 lety ⁺¹
This channel never let us down. great work!
@prithviprakash1110 Před 2 lety ⁺¹
Great job explaining this, understood something I was unsure about in a very decisive and clear way. Thanks!
@sciences_rainbow6292 Před 3 lety ⁺¹
Your videos are just perfect ! the voice, the explanation, the animation! Genius of pedagogy :)
@tymothylim6550 Před 3 lety ⁺¹
Thank you very much for this video! I learnt about these similar problems of vanishing and exploding gradients and how it affects the convergence of weight values to their optimal values!
@HassanKhan-kq4lj Před měsícem ⁺¹
I am having my deep learning exam tomorrow. Started studying just one day before the exam. Couldn't understand anything. Then found your video. Now I understood this concept. Thanks a lot 😭😭😭
@karimafifi5501 Před 4 lety ⁺²
The intro is so relaxing. It is like you are in another dimension.
@craigboger Před 4 lety ⁺²
Thank you so much for your series and explanations!
@carinacruz3945 Před 4 lety ⁺¹
The best source to understand machine learning concepts in an easy way. =)
@panwong9624 Před 6 lety ⁺⁴
cannot wait to watch the next video that addresses the vanishing and exploding gradient problem. :)
@deeplizard Před 6 lety ⁺¹
Have you got around to it yet?
This is the one - Weight initialization: czcams.com/video/8krd5qKVw-Q/video.html
@user-ho4be1ef9m Před 2 lety
I finally understand about gradient vanishing and exploding from your video!! Thanks : )
@billycheung7095 Před 4 lety ⁺¹
Well explained. Thanks for your works.
@jsaezmarti Před 3 lety ⁺¹
Thanks for the explaining the concept so clearly :D
@MrSoumyabrata Před 3 lety ⁺¹
Thank you for such a nice video. Understood the concept.
@harshitraj8409 Před 2 lety ⁺²
Crystal Clear Explanation.
@petraschubert8220 Před 4 lety
Great viedeo thanks for that! But I have one long time question about the backpropagation. I can adjust which layer's weight I hit bit summing up the components. But which weight will actually be updated then? will the layers inbetween the components of my chainrule update aswell? Would be very greatful for an answer, thanks!
@loneWOLF-fq7nz Před 5 lety ⁺¹
best explanation !!! Good Work
@_seeker423 Před 4 lety
Yet to see a video that explained this so clearly. One question. Does 1. both vanishing and exploding gradients lead to underfitting or 2. vanishing leads to underfitting and exploding lead to overfitting?
@albertoramos9586 Před 2 lety ⁺¹
Thank you so much!!!
@entertainment8067 Před 2 lety ⁺¹
Thanks for an amazing tutorial, love from Afghanistan
@aidynabirov7728 Před rokem ⁺¹
Awesome video!
@ritukamnnit Před 3 lety ⁺¹
good explanation. keep up the great work :)
@abdullahalmazed5387 Před 5 měsíci ⁺¹
awesome explanation
@laurafigueroa2852 Před 3 lety ⁺¹
Thank you!
@abihebbar Před 3 lety
In the case of exploding gradient, when it gets multiplied with the Learning Rate (between 0.0001 & 0.01), the result will be much less (usually less than 1). When this is further subtracted with the existing weight, wouldn't the updated weight still be less than 1? In that case, how is it different from vanishing gradient?
@GirlKnowsTech Před 3 lety ⁺⁷
00:30 What is the gradient
01:18 Introduction
01:45 What is the vanishing gradient problem?
03:28 How does the vanishing gradient problem occurs?
05:31 What about exploding gradient?
@deeplizard Před 3 lety
Added to the description. Thanks so much!
@absolute___zero Před 4 lety ⁺¹
there is a bigger problem with gradient descent and it is not vanishing or exploding thing. It is that it gets stuck in local minima, and that hasn't been solved yet. Only partial solutions like simulated annealing or GAs. UKF , montecarlo and stuff like that, which involves randomness. The only way to find better minimum is to introduce randomness.
@bashirghariba879 Před 4 lety ⁺¹
Good description
@justchill99902 Před 5 lety
Hello!
Question - I might sound silly here but do we ever have a weight update in the positive direction? I mean the weight was let's say 0.3 and then after update, it turned to 0.4? as while updating we always "subtract" the gradient * very low learning rate, this product that we subtract from the actual weight will always be very small.. so unless this product is negative (which only happens when the gradient is negative) , we will never add some value to the current weight but always reduce it right?
So to make some sense out of it, when do we get negative gradients ? do we generally have this happening?
@ink-redible Před 5 lety
Yes, we do get negative gradients, if looked at the formula for back prop; you will easily find when the gradient will turn out to be negative
@islamicinterestofficial Před 4 lety
Thank you so much
@hamidraza1584 Před 3 lety
Is this problem occurs in simple neuruel or rnn lstm networks??
@sidbhatia4230 Před 5 lety
Vanishing gradient is dependent on the learning rate of the model, right?
@Sikuq Před 3 lety ⁺¹
Excellent #28 follow up to your playlist ## 23-27. Thanks.
@yongwoo1020 Před 6 lety
I would have labeled your layers or edges "a", "b", "c", etc when you were discussing the cumulative effect on gradients that are earlier in the network (gradient = a * b * c * d, etc.). It can be a bit confusing since convention has us thinking one way and notation is reinforcing that while the conversation is about backprop which runs counter to that convention. The groundwork is laid for a very basic misunderstanding that could be cured with simple labels.
Great video, btw.
@deeplizard Před 6 lety
Appreciate your feedback, Samsung Blues.
@fosheimdet Před rokem
Why is this an issue? If the partial derivative of the loss w.r.t. a weight is small, its change should also be small so that we step in the direction of steepest descent of the loss function. Is the problem of vanishing gradients that we effectively lose the ability to train certain weights of our network, reducing dimensionality of our model?
@fritz-c Před 4 lety ⁺¹
I spotted a couple slight typos in the article for this video.
we don't perse,
↓
we don't per se,
further away from it’s optimal value
↓
further away from its optimal value
@deeplizard Před 4 lety
Fixed, thanks Chris! :D
@RandomGuy-hi2jm Před 4 lety ⁺³
what can we do to prevent it????
i think we should use relu activation function
@deeplizard Před 4 lety ⁺¹
The answer is revealed in the following video.
deeplizard.com/learn/video/8krd5qKVw-Q
@hossainahamed8789 Před 3 lety ⁺¹
loved it
@alphatrader5450 Před 5 lety ⁺³
Great explanation! Background gives me motion sickness though.
@deeplizard Před 5 lety
Thanks for the feedback! I'll keep that in mind.
@gideonfaive6886 Před 4 lety ⁺¹
{
"question": "Vanishing Gradient is mostly related to ………… and is usually caused by having too many values …………… in calculating the …………",
"choices": [
"earlier weights, less than one, gradient",
"earlier weights, less than one, loss",
"back propagation, less than one, gradient",
"back propagation, less than one, loss"
],
"answer": "earlier weights, less than one, gradient",
"creator": "Hivemind",
"creationDate": "2020-04-21T21:46:35.500Z"
}
@deeplizard Před 4 lety ⁺¹
Thanks, Gideon! Just added your question to deeplizard.com/learn/learn/video/qO_NLVjD6zE :)
@dennismuller371 Před 5 lety ⁺²
A gradient gets substracted from the weights to update them. This gradient can be really small and hence has no impact. Is also can become really large. How comes, since the gradient gets substracted, exploding gradient creates values that are larger than their former values? Should it not be sth. like a negative weight then? Nice videos btw. :)
@deeplizard Před 5 lety ⁺¹⁵
Hey Dennis - Good question and observation. Let me see if I can help clarify. Essentially, vanishing gradient = small gradient update to the weight. Exploding gradient = large gradient update to the weight. With exploding gradient, the large gradient causes a relatively large weight update, which possibly makes the weight completely "jump" over its optimal value. This update could indeed result in a negative weight, and that's fine. The "exploding" is just in terms of how large the gradient is, not how large the weight becomes. In the video, I did illustrate the exploding gradient update with a larger positive number, but it probably would have been more intuitive to show the example with a larger negative number. Does this help clarify?
@ahmedelhamy1845 Před 2 lety
@@deeplizard I think that gradients can't be greater than 0.25 when using sigmoid as activation function as its derivative range from 0 to 0.25 thus it will never exceed 1 by any means. I think gradient explode is coming due to weight initialization problem as weights are initialized with large values. Correct or clarify me please?
@absolute___zero Před 4 lety ⁺³
Vanishing gradient is not a problem. It is a feature of stacking something upon something that depends on something else and so on. It is like falling dominoes but with increasing piece on each step. Because chaining function after function after function after function after function, where all the functions are summing, and at the end doing a ReLU, you are going to get your output values blowing up! Vanishing gradients is not a problem, it is how it is supposed to work. The math is right, at the lower layers you can't use big gradients because they are going to affect the output layer exponentialy.
And also, cut the first minute and a half of the video because it is just loss of time.
@askinc102 Před 6 lety ⁺¹
If the gradient (of loss) is small, doesn't it imply that a very small update is required?
@deeplizard Před 6 lety ⁺²
Hey sandesh - Good question.
Michael Nielsen addresses this question in chapter 5 of his book, and I think it's a nice explanation. I'll link to the full chapter, but I've included the relevant excerpt below. Let me know if this helps clarify.
neuralnetworksanddeeplearning.com/chap5.html
"One response to vanishing (or unstable) gradients is to wonder if they're really such a problem. Momentarily stepping away from neural nets, imagine we were trying to numerically minimize a function f(x) of a single variable. Wouldn't it be good news if the derivative f′(x) was small? Wouldn't that mean we were already near an extremum? In a similar way, might the small gradient in early layers of a deep network mean that we don't need to do much adjustment of the weights and biases?
Of course, this isn't the case. Recall that we randomly initialized the weight and biases in the network. It is extremely unlikely our initial weights and biases will do a good job at whatever it is we want our network to do. To be concrete, consider the first layer of weights in a [784,30,30,30,10] network for the MNIST problem. The random initialization means the first layer throws away most information about the input image. Even if later layers have been extensively trained, they will still find it extremely difficult to identify the input image, simply because they don't have enough information. And so it can't possibly be the case that not much learning needs to be done in the first layer. If we're going to train deep networks, we need to figure out how to address the vanishing gradient problem."
@VinayKumar-hy6ee Před 5 lety
As gradient of sigmoid or tan function are very low wouldn't it lead to vanishing gradient effect and why these these are chosen as activation function deeplizard
@abdulbakey8305 Před 5 lety
@@deeplizard so what is the reason for the derivative to come out small if it is not for the function to be at optimum w.r.t that particular weight
@himanshupoddar1395 Před 5 lety
But Exploding Gradient can be handled by batch normalisation. Isn't it?
@mikashaw7926 Před 3 lety ⁺¹
OMG I UNDERSTAND NOW
@zongyigong6658 Před 4 lety
What if some intermediate gradients are large (> 1) and some are small (< 1), they could balance out to give early gradients still normal sizes. The described problem seems not to stem from the defect of the theory but rather from practical numerical implementation perspectives. The heuristic explanation is a bit handwaving. When should we expect to have a vanishing problem and when should we expect to have an exploding problem, is one vs the other purely random. If they are not random but depend on the nature of data or the NN architecture, what are they and why?
@VinayKumar-hy6ee Před 5 lety
As gradient of sigmoid or tan function are very low wouldn't it lead to vanishing gradient effect and why these these are chosen as activation function
@deeplizard Před 5 lety
Hey Vinay - Check out the video on bias below to see how the result from a given activation function may actually be "shifted" to a greater number, which in turn might help to reduce vanishing gradient. czcams.com/video/HetFihsXSys/video.html
Let me know if this helps.
@absolute___zero Před 4 lety ⁺¹
4:22 this is all wrong. updates can be positive or negative. this depends on the direction of the slope, if the slope is positive, the update is going to be substracted, if the slope is negative the weight updated is going to be added. So, weights don't just get smaller and smaller, they are updated in small quantities that's it, if you wait long enough (like weeks or months) you are going to get your ANN training completed just fine. You should learn about derivatives of composite functions and you will understand then that vanishing gradient is not a problem. It can be a problem though if you use float32 data type (single precision), because it has considerable error when using long chain of calculations. Switching to double will help with vanishing gradient "problem" (in quotes)
@driesdesmet1069 Před 3 lety
Nothing about the sigmoid function? I thought this was also one of the causes of a vanishing/exploding gradient?
@garrett6064 Před 3 lety ⁺¹
Hmm... possible solutions, my guess is (A) too many layers in the net, or (B) a separate Learning Rate for each layer used in conjunction with a function to normalize the Learning Rate and thus the gradient.
But this is just my guess.
@deeplizard Před 3 lety ⁺¹
Nice! Not sure what type of impact these potential solutions may have. A known solution is revealed in the next episode ✨
@garrett6064 Před 3 lety
@@deeplizard after spending a further ten seconds thinking about this I decided I didn't like my "solutions". But I did think that "Every first solution should be Use Less Layers." Was not a bad philosophy. 😆
@kareemjeiroudi1964 Před 5 lety ⁺¹
May I know who edits your videos? Because whoever does, he/ she has humor 😃
@deeplizard Před 5 lety ⁺⁸
Haha thanks, kareem! There are two of us behind deeplizard. Myself, which you hear in this video, and my partner, who you'll hear in other videos like our PyTorch series. Together, we run the entire assembly line from topic creation and recording to production and editing, etc. :)
@coolguy-dw5jq Před 6 lety ⁺¹
Is there any reason behind the name of your channel?
@deeplizard Před 6 lety ⁺¹¹
_I could a tale unfold whose lightest word_
_Would harrow up thy soul._
@gideonfaive6886 Před 4 lety ⁺¹
{
"question": "Vanishing gradient impair our training by making our weight being updated and their values getting further away from the optimal weights value",
"choices": [
"False",
"True",
"BLANK_SPACE",
"BLANK_SPACE"
],
"answer": "False",
"creator": "Hivemind",
"creationDate": "2020-04-21T21:53:20.326Z"
}
@deeplizard Před 4 lety
Thanks, Gideon! Just added your question to deeplizard.com/learn/video/qO_NLVjD6zE :)
@rishabbanerjee5152 Před 4 lety
math-less machine learning is so good :D
@stormwaker Před 5 lety ⁺¹
Love the series, but please refrain from zoom-in zoom-out background animation - it makes me distracted, nauseous even.
@deeplizard Před 5 lety
Thank you for the feedback!
@abdulmukit4420 Před 4 lety
The background image is really unnecessary and is rather disturbing
@midopurple3665 Před rokem
You are great, I feel like wanting to marry you for your intelligence
@lovelessOrphenKoRn Před 5 lety
I wish you wouldnt talk so fast. Cant keep up. Or subtitles wouldnt bei nice
@deeplizard Před 5 lety ⁺²
There are English subtitles automatically created by CZcams that you can turn on for this video. Also, you can slow down the speed on the video settings to 75% or 50% of the real-time speed to see if that helps.
@albertodomino9420 Před 3 lety
Please talk slowly.
@mouhamadibrahim3574 Před 2 lety
I want to marry you

Další v pořadí

Automatické přehrávání

Weight Initialization explained | A way to reduce the vanishing gradient problem