Tutorial 14- Stochastic Gradient Descent with Momentum
Vložit
- čas přidán 24. 07. 2024
- In this post I’ll talk about simple addition to classic SGD algorithm, called momentum which almost always works better and faster than Stochastic Gradient Descent. Momentum or SGD with momentum is method which helps accelerate gradients vectors in the right directions, thus leading to faster converging. It is one of the most popular optimization algorithms and many state-of-the-art models are trained using it. Before jumping over to the update equations of the algorithm, let’s look at some math that underlies the work of momentum.
Below are the various playlist created on ML,Data Science and Deep Learning. Please subscribe and support the channel. Happy Learning!
Deep Learning Playlist: • Tutorial 1- Introducti...
Data Science Projects playlist: • Generative Adversarial...
NLP playlist: • Natural Language Proce...
Statistics Playlist: • Population vs Sample i...
Feature Engineering playlist: • Feature Engineering in...
Computer Vision playlist: • OpenCV Installation | ...
Data Science Interview Question playlist: • Complete Life Cycle of...
You can buy my book on Finance with Machine Learning and Deep Learning from the below url
amazon url: www.amazon.in/Hands-Python-Fi...
🙏🙏🙏🙏🙏🙏🙏🙏
YOU JUST NEED TO DO
3 THINGS to support my channel
LIKE
SHARE
&
SUBSCRIBE
TO MY CZcams CHANNEL
You're doing really great. It's really good that you're focusing on the theory part and making it crisp clear for every one.
Krish , you are doing really a great job. Even though I had completed my MSc. in Data Science and have some work experience, I am learning so much more from your tutorials. Lot of love. From Saudi Arabia 😃
Maidam is there any job oportunities for Data scientists or IT experts in Saudi Arabia
Understanding concept is very important, When i started deep learning, I was not able to understand any terminology . After watching your tutorial, I am able to correlate everything.. Thanks you so much..
if you confuse at 11:30 in SGD Momentum Equation, I will try to write again all equations.
Weight updated Formula
w2 = w1 - (learning_rate * dl/dw1)
define a new variable g1 = dl/dw1
and v1 = learning_rate* g1
so you can write your Weight updated Formula Again
w2 = w1 - v1
Again come to exponential moving Average Part
v1 = learning_rate* g1
v2 = gamma* v1 + (learning_rate* g2)
v_n = gamma* v_n-1 + (learning_rate* gn)
So final Equation will be
w_n = w_n-1 - v_n
Case1. If gamma value is 0 then
w_n = w_n-1 - learning_rate* gn
case 2. if gamma value is not 0
w_n = w_n-1 - v_n = w_n-1 - (gamma* v_n-1 + (learning_rate* gn))
Thanks!!!
if w2=w1-v1 then w_n=w_n-1 - v_n-1 ????
Great work ! Thanks
@@bijayadhikari3904 right no
so, basically optimizers dowes wetight calculation
Yes, we need to understand the basic concepts and then we shall apply it practically, well organized lecture topics. Great keep going sir.
Awesome videos:), I was always confuse with the momentum concept in the optimizer, now I am understanding it crystal and clear.
I just love you, Krish. No need to search the Web, just Krish Naik is there to clear all the ideas. I like your approach of teaching theory first and then practical. Doing practical without clearing theory is useless. Thank you.
Thank you for explaining SGD+Momentum. I have a much more intuitive understanding of the method now.
Awesome Work Sir! your sequence of topics is very well organized
That was a great video.Hope my understanding continues till the end.Only need to know one thing.You don't have to remember all the things .Just know what is going on. THat's all.Thanks
Loved this different take on SGD
Awesome work dude. Really Like your videos..keep going
You are amazing. Please do not stop making videos.
Utmost respect sir..... looking for this theory and the way you explained it is just great
so cristal clear!! Good job
GREAT WORK BY YOU SIR!
Nice video. Very intuitive.
Thank you very much!!! very helpful
Thanks!! Great video
Excellent Lecture, Krish.....
Hi sir I have some doubts what is the failure mechanism in existing systems in deep learning network optimization
Thank you lovely guy !
Thanks a lot sir!
Do we avoid learning rate from weight updation equation?
Nicely Explained.
continue your work. The theoretical concept is very important. The practical implementations won't take much time.
Good sir, you are brilliant
very nice video thanks
what about nesterov momentum? is it simillar to the moving average concept?
Hi Krish, Do we want to find global minima for each batch size of data?
Very well explained. Not seen any other tutorial with some much emphasis on foundation. Btw, your video is going out of focus at times, may be your camera is set on auto focus.
excellent bro
neat video!
So helpful for me
Could you please explain about Adam optimizer ?????
I do not yet understand how the gamma connects when using a batch selection of rewards / outputs, there is no way to give an order and all of them have the same gamma applied
Good explanation
Krish in Mini SGD for every one batch weight will get updated like considering 100 data points after this 100 data points weight will be updated or what i was confused with it....
really helpful
Thanks Krish
well explained.
Any suggestion for books for practical approach to deep learning , NLP , generative AI's mainly looking at coding for reference for coding after I complete this playlist and I also require it please suggest some easily understandable and practicable books
So can we say that reduction in noise will depend on value of gamma. Lesser the value of gamma more will be the reduction in noise??
When you say time interval does it refer to a epoch with a mini batch? Also noise as in noise created by varying loss values?
It represents each iteration in an epoch.
For example if the data set has 100 data points and we choose the mini batch size to be 10. No. of iterations per epoch will be 100/10 = 10.
Once 10 iterations are completed, one epoch gets completed.
Noise is the deflected paths followed by the wights to reach the global minima. The noise is induced as the neurons are only exposed to a portion of the data set per iteration.
This videos are really helpful to understand basic of deep learning. keep going sir. and where i can find practical implementation??? I'm doing 1 project on deep learning from where can i start my coding? if you can suggest me something that would of great help.
At 10:30, why the learning rate is not multiplied by the term \gamma V_t?
SGD with momentum, in the last part at 11:30 min it should be V(t+1) coz we are predicting for future value and hence V(t) will be the recent known value.
No it will be V(t). since Vt2 = Gamma(Vt1) + Vt2. Similarly you can calculate for V(t) as well.
@@robinredhu1995 yaah it will be V(t)
very good
at 4:20 the points are supposed to be on the curve or not?
I thought that momentum was used to prevent converging in a local minimum. I wasn't aware that it also helped with noise reduction for SGD. It does both right?
I don't believe any descent method in the non convex scenario will take us easily to a global minimizer. Momentum only improves the speed of convergence. Steepest descent is very slow in general.
not sure of oscillation would have along the surface. It should be both the sides of minima
Shouldn"t the last equation be V(t) instead of V(t-1)
agreed
thought the same as well
it should be V(t+1) coz we are predicting for future value and hence V(t) will be the recent known value.
@@adityachandra2462 We are calculating V(t-1)
For the first datapoint he considers as '1', while he is calculating the momentum for 2nd datapoint he is using V(t-1), if it is 3rd datapoint it may be (V(t-2)) and vice versa....
This is just my understanding, i haven't any any research
The contour plot on the first screen. Is it L(w)-vs-w ? Should it not be w1-vs-w2 (or b) and L(w) should be perpendicular to the screen ?
Yes you are right .
10:54 time... Not sure how the equation is formed....
sir please make on video time series analysis and ARIMA model
what was that plot name
Guy's help me. I am confused about Exponential Moving Avg. In the equation is it --> (beta + Beta* square) or (Beta +( beta-1))
Beta +( beta-1)
Continue sir,
I'm understanding all this theory....... This is awesome
Thank you sir for this free educational video, this help mean a lot to us...
Keep continuing....
And I'm click ads so that you can get money in rewards.... 🙏
Sir, Why we are multiplying the points with Gamma?
So that compensation of the vivid strength can accelerate the weight pinnacle of structure multiplied by the t2
Thank you for the good explanation, I think the moving average is not the correct term here and it is better to use a weighted average.
Krish, can you write a book on Deep Learning? You are the best
Better then andrew ng. Thanks
i am also waiting for practical implementation but i know sir you are trying to give indepth knowldge .....
When you say noise, it will be more clear if you provide the explanation about how we get the noise if we select 100 or 200 records for MBG. In General please don't miss the explanations to those key points.
is noise get introduced because of random selection of samples from whole data set? or selected samples does not represent the relationship properly? correct me if I am wrong
how batch get created, what we consider?
I believe batches are created under the hood with some kind of stratified sampling. or without changing any kind of distribution.
Sir is this 34 videos completes deep learning or are you going to upload more videos?
More videos will come
@@krishnaik06 okk sir
last me basad machadi bhai..
how do we exponentiate the gamma value according to all previous partial derivatives when we are in current loop to calculate Vt-1? would this not add alot of work to the computation if we have even just 100partial derivatives?
so in my head the pseudo code is looking like:
vt-1 = dl/dwn + (for( K = 1, while K < num iterations, K++ ) Summation of : gamma(power of K) * dl/dwn-K).
so we would need to store a gradient vector of all the previous partial derivatives for each neuron. which probably means we have to do this with mini batches otherwise we would end of with vectors of size > 1000's.
Is this correct?
Concept of momentum not very clear though formula etc is understood! Why "momentum"
best tutorial. small issue with your camera. unnecessary camera zoom In and zoom out causing eye strain. Thank you for the wonderful lectures.
Another reason for using mometum is to jump out of local minimum if we are not using batch normalisation. This is something not covered here
the subscript notation in the formula at the end probably isn't written correctly
better than Andrew NG on this topic
Why the value of gamma is between 0 and 1
Its like 0 to 100%, the point something will decide what portion of the weight to be considered
last part of this video is little difficult
u didn't told what exactly is gama
Hi Krish. I am wondering whether you might be open to tutoring me in building and deploying ML models with Pytorch. Or, if you know anyone that might be interested. I have a background in basic Data science and basic Pytorch. Compensated of course :)
good job. you got one part wrong tho, 0.5^2 = 0.025 not 0.25
You have made it complicated - Mini batch SGD or SGD
Mini batch sgd
i didnt understand anything i nthis.. it is too high level. is that i need to learn anything else before this video. i dnt know from whr the formula comes gamma vt
try in 1.25 speed
so much advertisement 😔😔😔
8 ads in 13 min video
After being impressed by 13 videos, I got unimpressed by this as here one can clearly see that you yourself are not clear in depth. No offence
Thanks :)
Agree
Comon guys. it was good enough to give more than a basic concept. everyone is not perfect but this is more than average. Thumbs up. KN.