Tutorial 14- Stochastic Gradient Descent with Momentum

Krish Naik

zhlédnutí 114 291

Přidat do
- Můj playlist
- Přehrát později
Sdílet

Sdílet

Vložit

Velikost videa:

Zobrazit ovladače přehrávání

Automatické přehrávání

Přehrát

čas přidán 24. 07. 2024
In this post I’ll talk about simple addition to classic SGD algorithm, called momentum which almost always works better and faster than Stochastic Gradient Descent. Momentum or SGD with momentum is method which helps accelerate gradients vectors in the right directions, thus leading to faster converging. It is one of the most popular optimization algorithms and many state-of-the-art models are trained using it. Before jumping over to the update equations of the algorithm, let’s look at some math that underlies the work of momentum.
Below are the various playlist created on ML,Data Science and Deep Learning. Please subscribe and support the channel. Happy Learning!
Deep Learning Playlist: • Tutorial 1- Introducti...
Data Science Projects playlist: • Generative Adversarial...
NLP playlist: • Natural Language Proce...
Statistics Playlist: • Population vs Sample i...
Feature Engineering playlist: • Feature Engineering in...
Computer Vision playlist: • OpenCV Installation | ...
Data Science Interview Question playlist: • Complete Life Cycle of...
You can buy my book on Finance with Machine Learning and Deep Learning from the below url
amazon url: www.amazon.in/Hands-Python-Fi...
🙏🙏🙏🙏🙏🙏🙏🙏
YOU JUST NEED TO DO
3 THINGS to support my channel
LIKE
SHARE
&
SUBSCRIBE
TO MY CZcams CHANNEL

Komentáře • 113

@allenalex4861 Před 4 lety ⁺¹⁷
You're doing really great. It's really good that you're focusing on the theory part and making it crisp clear for every one.
@story_teller_1987 Před 3 lety ⁺²³
Krish , you are doing really a great job. Even though I had completed my MSc. in Data Science and have some work experience, I am learning so much more from your tutorials. Lot of love. From Saudi Arabia 😃
@webStream258 Před rokem
Maidam is there any job oportunities for Data scientists or IT experts in Saudi Arabia
@pravinkaushikbsp Před 4 lety ⁺¹
Understanding concept is very important, When i started deep learning, I was not able to understand any terminology . After watching your tutorial, I am able to correlate everything.. Thanks you so much..
@shahrukhsharif9382 Před 3 lety ⁺²¹
if you confuse at 11:30 in SGD Momentum Equation, I will try to write again all equations.
Weight updated Formula
w2 = w1 - (learning_rate * dl/dw1)
define a new variable g1 = dl/dw1
and v1 = learning_rate* g1
so you can write your Weight updated Formula Again
w2 = w1 - v1
Again come to exponential moving Average Part
v1 = learning_rate* g1
v2 = gamma* v1 + (learning_rate* g2)
v_n = gamma* v_n-1 + (learning_rate* gn)
So final Equation will be
w_n = w_n-1 - v_n
Case1. If gamma value is 0 then
w_n = w_n-1 - learning_rate* gn
case 2. if gamma value is not 0
w_n = w_n-1 - v_n = w_n-1 - (gamma* v_n-1 + (learning_rate* gn))
@srishtikumari6664 Před 3 lety
Thanks!!!
@bijayadhikari3904 Před 3 lety ⁺⁴
if w2=w1-v1 then w_n=w_n-1 - v_n-1 ????
@user-gj8mb8gv6i Před 3 lety
Great work ! Thanks
@madhavilathamandaleeka5953 Před 3 lety
@@bijayadhikari3904 right no
@remrem6681 Před 10 měsíci
so, basically optimizers dowes wetight calculation
@brindhasenthilkumar7871 Před 4 lety ⁺²
Yes, we need to understand the basic concepts and then we shall apply it practically, well organized lecture topics. Great keep going sir.
@rishabhkumar-qs3jb Před 2 lety
Awesome videos:), I was always confuse with the momentum concept in the optimizer, now I am understanding it crystal and clear.
@sukumarroychowdhury4122 Před 3 lety ⁺²
I just love you, Krish. No need to search the Web, just Krish Naik is there to clear all the ideas. I like your approach of teaching theory first and then practical. Doing practical without clearing theory is useless. Thank you.
@melodytune5619 Před 2 lety
Thank you for explaining SGD+Momentum. I have a much more intuitive understanding of the method now.
@alikalair7031 Před 4 lety ⁺¹
Awesome Work Sir! your sequence of topics is very well organized
@sandipansarkar9211 Před 4 lety ⁺¹
That was a great video.Hope my understanding continues till the end.Only need to know one thing.You don't have to remember all the things .Just know what is going on. THat's all.Thanks
@vgaurav3011 Před 4 lety ⁺¹
Loved this different take on SGD
@abhishekkaushik9154 Před 5 lety ⁺²
Awesome work dude. Really Like your videos..keep going
@raminehlopezyazdani6603 Před 3 lety
You are amazing. Please do not stop making videos.
@swapnilkushwaha5772 Před 4 měsíci
Utmost respect sir..... looking for this theory and the way you explained it is just great
@gabriellakorchmaros4165 Před 4 lety ⁺¹
so cristal clear!! Good job
@ektamarwaha5941 Před 4 lety ⁺¹
GREAT WORK BY YOU SIR!
@Matias-eh2pn Před rokem
Nice video. Very intuitive.
@fpl8648 Před 2 lety
Thank you very much!!! very helpful
@gustavorocha6592 Před 4 lety
Thanks!! Great video
@aniruddhapal1997 Před 2 lety
Excellent Lecture, Krish.....
@user-pj6su5lk6o Před 4 lety ⁺¹
Hi sir I have some doubts what is the failure mechanism in existing systems in deep learning network optimization
@blackyogurt Před 2 lety
Thank you lovely guy !
@kushh7550 Před rokem
Thanks a lot sir!
@sudeepnellur Před 4 lety ⁺¹
Do we avoid learning rate from weight updation equation?
@avikasliwal4283 Před 4 lety
Nicely Explained.
@abhishekkaushik9154 Před 5 lety ⁺¹¹
continue your work. The theoretical concept is very important. The practical implementations won't take much time.
@vishalgupta3175 Před 3 lety
Good sir, you are brilliant
@sagessevaldesdongmovoufo3101 Před 5 měsíci
very nice video thanks
@foxfinance9362 Před 3 lety
what about nesterov momentum? is it simillar to the moving average concept?
@kaviarasu.thuraiarasu89 Před 3 lety ⁺¹
Hi Krish, Do we want to find global minima for each batch size of data?
@ranjithmadhavan Před 5 lety ⁺⁵
Very well explained. Not seen any other tutorial with some much emphasis on foundation. Btw, your video is going out of focus at times, may be your camera is set on auto focus.
@maYYidtS Před 4 lety ⁺¹
excellent bro
@moudhafferbouallegui Před rokem
neat video!
@chikhang5122 Před 4 lety
So helpful for me
@techspoc7442 Před 4 lety ⁺¹
Could you please explain about Adam optimizer ?????
@Dan-uf2vh Před 3 lety
I do not yet understand how the gamma connects when using a batch selection of rewards / outputs, there is no way to give an order and all of them have the same gamma applied
@mangaenfrancais934 Před 4 lety
Good explanation
@dharmatejasingampalli6480 Před 4 lety
Krish in Mini SGD for every one batch weight will get updated like considering 100 data points after this 100 data points weight will be updated or what i was confused with it....
@rachitsonthalia6747 Před 3 lety
really helpful
@louerleseigneur4532 Před 3 lety
Thanks Krish
@LiangyueLi Před 4 lety
well explained.
@sriramvaidyanathan5094 Před 8 měsíci
Any suggestion for books for practical approach to deep learning , NLP , generative AI's mainly looking at coding for reference for coding after I complete this playlist and I also require it please suggest some easily understandable and practicable books
@robinredhu1995 Před 4 lety ⁺¹
So can we say that reduction in noise will depend on value of gamma. Lesser the value of gamma more will be the reduction in noise??
@gowthamprabhu122 Před 4 lety ⁺¹
When you say time interval does it refer to a epoch with a mini batch? Also noise as in noise created by varying loss values?
@benvelloor Před 4 lety ⁺²
It represents each iteration in an epoch.
For example if the data set has 100 data points and we choose the mini batch size to be 10. No. of iterations per epoch will be 100/10 = 10.
Once 10 iterations are completed, one epoch gets completed.
Noise is the deflected paths followed by the wights to reach the global minima. The noise is induced as the neurons are only exposed to a portion of the data set per iteration.
@richatiwari9922 Před 3 lety
This videos are really helpful to understand basic of deep learning. keep going sir. and where i can find practical implementation??? I'm doing 1 project on deep learning from where can i start my coding? if you can suggest me something that would of great help.
@MohandAlbaz Před 3 lety
At 10:30, why the learning rate is not multiplied by the term \gamma V_t?
@adityachandra2462 Před 4 lety ⁺³
SGD with momentum, in the last part at 11:30 min it should be V(t+1) coz we are predicting for future value and hence V(t) will be the recent known value.
@robinredhu1995 Před 4 lety ⁺⁵
No it will be V(t). since Vt2 = Gamma(Vt1) + Vt2. Similarly you can calculate for V(t) as well.
@debarshibhattacharya9141 Před 3 lety
@@robinredhu1995 yaah it will be V(t)
@quranicscience9631 Před 4 lety
very good
@ItachiUchiha-fo9zg Před 3 lety
at 4:20 the points are supposed to be on the curve or not?
@martijnbos9873 Před 4 lety ⁺¹
I thought that momentum was used to prevent converging in a local minimum. I wasn't aware that it also helped with noise reduction for SGD. It does both right?
@kamrupexpress Před 3 lety
I don't believe any descent method in the non convex scenario will take us easily to a global minimizer. Momentum only improves the speed of convergence. Steepest descent is very slow in general.
@strippingdatascience9168 Před 4 lety ⁺¹
not sure of oscillation would have along the surface. It should be both the sides of minima
@darshmehta3476 Před 4 lety ⁺²⁰
Shouldn"t the last equation be V(t) instead of V(t-1)
@gael2010 Před 4 lety ⁺²
agreed
@morpheus6172 Před 4 lety
thought the same as well
@adityachandra2462 Před 4 lety ⁺²
it should be V(t+1) coz we are predicting for future value and hence V(t) will be the recent known value.
@darshmehta3476 Před 4 lety
@@adityachandra2462 We are calculating V(t-1)
@gujjalapatiraju7435 Před 3 lety ⁺²
For the first datapoint he considers as '1', while he is calculating the momentum for 2nd datapoint he is using V(t-1), if it is 3rd datapoint it may be (V(t-2)) and vice versa....
This is just my understanding, i haven't any any research
@siddharthachatterjee9959 Před 4 lety ⁺¹
The contour plot on the first screen. Is it L(w)-vs-w ? Should it not be w1-vs-w2 (or b) and L(w) should be perpendicular to the screen ?
@jagdishjazzy Před 4 lety
Yes you are right .
@dharmendrabhojwani Před 4 lety ⁺²
10:54 time... Not sure how the equation is formed....
@sunnysavita9071 Před 4 lety
sir please make on video time series analysis and ARIMA model
@bibekgupta4134 Před 3 lety
what was that plot name
@prafulbs7216 Před 3 lety ⁺¹
Guy's help me. I am confused about Exponential Moving Avg. In the equation is it --> (beta + Beta* square) or (Beta +( beta-1))
@airesearch2267 Před 3 lety
Beta +( beta-1)
@Artista1010 Před 3 lety ⁺²
Continue sir,
I'm understanding all this theory....... This is awesome
Thank you sir for this free educational video, this help mean a lot to us...
Keep continuing....
And I'm click ads so that you can get money in rewards.... 🙏
@saurabhmukherjee3801 Před 4 lety
Sir, Why we are multiplying the points with Gamma?
@uniquetobin4real Před 4 lety
So that compensation of the vivid strength can accelerate the weight pinnacle of structure multiplied by the t2
@rezarawassizadeh4601 Před 3 lety
Thank you for the good explanation, I think the moving average is not the correct term here and it is better to use a weighted average.
@eliashossain4327 Před rokem
Krish, can you write a book on Deep Learning? You are the best
@sametozenc Před 10 měsíci
Better then andrew ng. Thanks
@gopalakrishna9510 Před 4 lety ⁺¹
i am also waiting for practical implementation but i know sir you are trying to give indepth knowldge .....
@sml9360 Před 3 lety ⁺¹
When you say noise, it will be more clear if you provide the explanation about how we get the noise if we select 100 or 200 records for MBG. In General please don't miss the explanations to those key points.
@sml9360 Před 3 lety
is noise get introduced because of random selection of samples from whole data set? or selected samples does not represent the relationship properly? correct me if I am wrong
@HarshPatel-iy5qe Před 9 měsíci
how batch get created, what we consider?
@HarshPatel-iy5qe Před 2 měsíci
I believe batches are created under the hood with some kind of stratified sampling. or without changing any kind of distribution.
@sumitkumarsah8782 Před 4 lety ⁺¹
Sir is this 34 videos completes deep learning or are you going to upload more videos?
@krishnaik06 Před 4 lety
More videos will come
@sumitkumarsah8782 Před 4 lety
@@krishnaik06 okk sir
@ashkraze Před rokem
last me basad machadi bhai..
@doyugen465 Před 3 lety
how do we exponentiate the gamma value according to all previous partial derivatives when we are in current loop to calculate Vt-1? would this not add alot of work to the computation if we have even just 100partial derivatives?
so in my head the pseudo code is looking like:
vt-1 = dl/dwn + (for( K = 1, while K < num iterations, K++ ) Summation of : gamma(power of K) * dl/dwn-K).
so we would need to store a gradient vector of all the previous partial derivatives for each neuron. which probably means we have to do this with mini batches otherwise we would end of with vectors of size > 1000's.
Is this correct?
@nitayg1326 Před 4 lety
Concept of momentum not very clear though formula etc is understood! Why "momentum"
@bharathamma7279 Před 4 lety
best tutorial. small issue with your camera. unnecessary camera zoom In and zoom out causing eye strain. Thank you for the wonderful lectures.
@anandhasrivi Před 4 lety
Another reason for using mometum is to jump out of local minimum if we are not using batch normalisation. This is something not covered here
@wilsvenleong96 Před 2 lety
the subscript notation in the formula at the end probably isn't written correctly
@user-lz2kz5kc4t Před 2 lety
better than Andrew NG on this topic
@Adinasa2 Před 4 lety
Why the value of gamma is between 0 and 1
@sudeepnellur Před 4 lety
Its like 0 to 100%, the point something will decide what portion of the weight to be considered
@quranicscience9631 Před 4 lety
last part of this video is little difficult
@nikhil7129 Před 3 lety
u didn't told what exactly is gama
@seanmcgowan9154 Před 2 lety
Hi Krish. I am wondering whether you might be open to tutoring me in building and deploying ML models with Pytorch. Or, if you know anyone that might be interested. I have a background in basic Data science and basic Pytorch. Compensated of course :)
@yessinekhanfir4157 Před rokem
good job. you got one part wrong tho, 0.5^2 = 0.025 not 0.25
@ahasanhabibsajeeb1979 Před 3 lety
You have made it complicated - Mini batch SGD or SGD
@arlynsneha5052 Před 3 lety
Mini batch sgd
@karthickd537 Před 2 lety
i didnt understand anything i nthis.. it is too high level. is that i need to learn anything else before this video. i dnt know from whr the formula comes gamma vt
@aksadhamirani7868 Před 3 lety
try in 1.25 speed
@suvratshukla5943 Před rokem
so much advertisement 😔😔😔
@suvratshukla5943 Před rokem
8 ads in 13 min video
@k_anu7 Před 4 lety
After being impressed by 13 videos, I got unimpressed by this as here one can clearly see that you yourself are not clear in depth. No offence
@krishnaik06 Před 4 lety ⁺¹
Thanks :)
@vijaypatneedi Před 4 lety
Agree
@latifbhanger Před 4 lety ⁺¹
Comon guys. it was good enough to give more than a basic concept. everyone is not perfect but this is more than average. Thumbs up. KN.

Další v pořadí

Automatické přehrávání

Tutorial 15- Adagrad Optimizers in Neural Network