Tutorial 14- Stochastic Gradient Descent with Momentum

Sdílet
Vložit
  • čas přidán 24. 07. 2024
  • In this post I’ll talk about simple addition to classic SGD algorithm, called momentum which almost always works better and faster than Stochastic Gradient Descent. Momentum or SGD with momentum is method which helps accelerate gradients vectors in the right directions, thus leading to faster converging. It is one of the most popular optimization algorithms and many state-of-the-art models are trained using it. Before jumping over to the update equations of the algorithm, let’s look at some math that underlies the work of momentum.
    Below are the various playlist created on ML,Data Science and Deep Learning. Please subscribe and support the channel. Happy Learning!
    Deep Learning Playlist: • Tutorial 1- Introducti...
    Data Science Projects playlist: • Generative Adversarial...
    NLP playlist: • Natural Language Proce...
    Statistics Playlist: • Population vs Sample i...
    Feature Engineering playlist: • Feature Engineering in...
    Computer Vision playlist: • OpenCV Installation | ...
    Data Science Interview Question playlist: • Complete Life Cycle of...
    You can buy my book on Finance with Machine Learning and Deep Learning from the below url
    amazon url: www.amazon.in/Hands-Python-Fi...
    🙏🙏🙏🙏🙏🙏🙏🙏
    YOU JUST NEED TO DO
    3 THINGS to support my channel
    LIKE
    SHARE
    &
    SUBSCRIBE
    TO MY CZcams CHANNEL

Komentáře • 113

  • @allenalex4861
    @allenalex4861 Před 4 lety +17

    You're doing really great. It's really good that you're focusing on the theory part and making it crisp clear for every one.

  • @story_teller_1987
    @story_teller_1987 Před 3 lety +23

    Krish , you are doing really a great job. Even though I had completed my MSc. in Data Science and have some work experience, I am learning so much more from your tutorials. Lot of love. From Saudi Arabia 😃

    • @webStream258
      @webStream258 Před rokem

      Maidam is there any job oportunities for Data scientists or IT experts in Saudi Arabia

  • @pravinkaushikbsp
    @pravinkaushikbsp Před 4 lety +1

    Understanding concept is very important, When i started deep learning, I was not able to understand any terminology . After watching your tutorial, I am able to correlate everything.. Thanks you so much..

  • @shahrukhsharif9382
    @shahrukhsharif9382 Před 3 lety +21

    if you confuse at 11:30 in SGD Momentum Equation, I will try to write again all equations.
    Weight updated Formula
    w2 = w1 - (learning_rate * dl/dw1)
    define a new variable g1 = dl/dw1
    and v1 = learning_rate* g1
    so you can write your Weight updated Formula Again
    w2 = w1 - v1
    Again come to exponential moving Average Part
    v1 = learning_rate* g1
    v2 = gamma* v1 + (learning_rate* g2)
    v_n = gamma* v_n-1 + (learning_rate* gn)
    So final Equation will be
    w_n = w_n-1 - v_n
    Case1. If gamma value is 0 then
    w_n = w_n-1 - learning_rate* gn
    case 2. if gamma value is not 0
    w_n = w_n-1 - v_n = w_n-1 - (gamma* v_n-1 + (learning_rate* gn))

  • @brindhasenthilkumar7871
    @brindhasenthilkumar7871 Před 4 lety +2

    Yes, we need to understand the basic concepts and then we shall apply it practically, well organized lecture topics. Great keep going sir.

  • @rishabhkumar-qs3jb
    @rishabhkumar-qs3jb Před 2 lety

    Awesome videos:), I was always confuse with the momentum concept in the optimizer, now I am understanding it crystal and clear.

  • @sukumarroychowdhury4122
    @sukumarroychowdhury4122 Před 3 lety +2

    I just love you, Krish. No need to search the Web, just Krish Naik is there to clear all the ideas. I like your approach of teaching theory first and then practical. Doing practical without clearing theory is useless. Thank you.

  • @melodytune5619
    @melodytune5619 Před 2 lety

    Thank you for explaining SGD+Momentum. I have a much more intuitive understanding of the method now.

  • @alikalair7031
    @alikalair7031 Před 4 lety +1

    Awesome Work Sir! your sequence of topics is very well organized

  • @sandipansarkar9211
    @sandipansarkar9211 Před 4 lety +1

    That was a great video.Hope my understanding continues till the end.Only need to know one thing.You don't have to remember all the things .Just know what is going on. THat's all.Thanks

  • @vgaurav3011
    @vgaurav3011 Před 4 lety +1

    Loved this different take on SGD

  • @abhishekkaushik9154
    @abhishekkaushik9154 Před 5 lety +2

    Awesome work dude. Really Like your videos..keep going

  • @raminehlopezyazdani6603

    You are amazing. Please do not stop making videos.

  • @swapnilkushwaha5772
    @swapnilkushwaha5772 Před 4 měsíci

    Utmost respect sir..... looking for this theory and the way you explained it is just great

  • @gabriellakorchmaros4165
    @gabriellakorchmaros4165 Před 4 lety +1

    so cristal clear!! Good job

  • @ektamarwaha5941
    @ektamarwaha5941 Před 4 lety +1

    GREAT WORK BY YOU SIR!

  • @Matias-eh2pn
    @Matias-eh2pn Před rokem

    Nice video. Very intuitive.

  • @fpl8648
    @fpl8648 Před 2 lety

    Thank you very much!!! very helpful

  • @gustavorocha6592
    @gustavorocha6592 Před 4 lety

    Thanks!! Great video

  • @aniruddhapal1997
    @aniruddhapal1997 Před 2 lety

    Excellent Lecture, Krish.....

  • @user-pj6su5lk6o
    @user-pj6su5lk6o Před 4 lety +1

    Hi sir I have some doubts what is the failure mechanism in existing systems in deep learning network optimization

  • @blackyogurt
    @blackyogurt Před 2 lety

    Thank you lovely guy !

  • @kushh7550
    @kushh7550 Před rokem

    Thanks a lot sir!

  • @sudeepnellur
    @sudeepnellur Před 4 lety +1

    Do we avoid learning rate from weight updation equation?

  • @avikasliwal4283
    @avikasliwal4283 Před 4 lety

    Nicely Explained.

  • @abhishekkaushik9154
    @abhishekkaushik9154 Před 5 lety +11

    continue your work. The theoretical concept is very important. The practical implementations won't take much time.

  • @vishalgupta3175
    @vishalgupta3175 Před 3 lety

    Good sir, you are brilliant

  • @sagessevaldesdongmovoufo3101

    very nice video thanks

  • @foxfinance9362
    @foxfinance9362 Před 3 lety

    what about nesterov momentum? is it simillar to the moving average concept?

  • @kaviarasu.thuraiarasu89
    @kaviarasu.thuraiarasu89 Před 3 lety +1

    Hi Krish, Do we want to find global minima for each batch size of data?

  • @ranjithmadhavan
    @ranjithmadhavan Před 5 lety +5

    Very well explained. Not seen any other tutorial with some much emphasis on foundation. Btw, your video is going out of focus at times, may be your camera is set on auto focus.

  • @maYYidtS
    @maYYidtS Před 4 lety +1

    excellent bro

  • @moudhafferbouallegui
    @moudhafferbouallegui Před rokem

    neat video!

  • @chikhang5122
    @chikhang5122 Před 4 lety

    So helpful for me

  • @techspoc7442
    @techspoc7442 Před 4 lety +1

    Could you please explain about Adam optimizer ?????

  • @Dan-uf2vh
    @Dan-uf2vh Před 3 lety

    I do not yet understand how the gamma connects when using a batch selection of rewards / outputs, there is no way to give an order and all of them have the same gamma applied

  • @mangaenfrancais934
    @mangaenfrancais934 Před 4 lety

    Good explanation

  • @dharmatejasingampalli6480

    Krish in Mini SGD for every one batch weight will get updated like considering 100 data points after this 100 data points weight will be updated or what i was confused with it....

  • @rachitsonthalia6747
    @rachitsonthalia6747 Před 3 lety

    really helpful

  • @louerleseigneur4532
    @louerleseigneur4532 Před 3 lety

    Thanks Krish

  • @LiangyueLi
    @LiangyueLi Před 4 lety

    well explained.

  • @sriramvaidyanathan5094
    @sriramvaidyanathan5094 Před 8 měsíci

    Any suggestion for books for practical approach to deep learning , NLP , generative AI's mainly looking at coding for reference for coding after I complete this playlist and I also require it please suggest some easily understandable and practicable books

  • @robinredhu1995
    @robinredhu1995 Před 4 lety +1

    So can we say that reduction in noise will depend on value of gamma. Lesser the value of gamma more will be the reduction in noise??

  • @gowthamprabhu122
    @gowthamprabhu122 Před 4 lety +1

    When you say time interval does it refer to a epoch with a mini batch? Also noise as in noise created by varying loss values?

    • @benvelloor
      @benvelloor Před 4 lety +2

      It represents each iteration in an epoch.
      For example if the data set has 100 data points and we choose the mini batch size to be 10. No. of iterations per epoch will be 100/10 = 10.
      Once 10 iterations are completed, one epoch gets completed.
      Noise is the deflected paths followed by the wights to reach the global minima. The noise is induced as the neurons are only exposed to a portion of the data set per iteration.

  • @richatiwari9922
    @richatiwari9922 Před 3 lety

    This videos are really helpful to understand basic of deep learning. keep going sir. and where i can find practical implementation??? I'm doing 1 project on deep learning from where can i start my coding? if you can suggest me something that would of great help.

  • @MohandAlbaz
    @MohandAlbaz Před 3 lety

    At 10:30, why the learning rate is not multiplied by the term \gamma V_t?

  • @adityachandra2462
    @adityachandra2462 Před 4 lety +3

    SGD with momentum, in the last part at 11:30 min it should be V(t+1) coz we are predicting for future value and hence V(t) will be the recent known value.

    • @robinredhu1995
      @robinredhu1995 Před 4 lety +5

      No it will be V(t). since Vt2 = Gamma(Vt1) + Vt2. Similarly you can calculate for V(t) as well.

    • @debarshibhattacharya9141
      @debarshibhattacharya9141 Před 3 lety

      @@robinredhu1995 yaah it will be V(t)

  • @quranicscience9631
    @quranicscience9631 Před 4 lety

    very good

  • @ItachiUchiha-fo9zg
    @ItachiUchiha-fo9zg Před 3 lety

    at 4:20 the points are supposed to be on the curve or not?

  • @martijnbos9873
    @martijnbos9873 Před 4 lety +1

    I thought that momentum was used to prevent converging in a local minimum. I wasn't aware that it also helped with noise reduction for SGD. It does both right?

    • @kamrupexpress
      @kamrupexpress Před 3 lety

      I don't believe any descent method in the non convex scenario will take us easily to a global minimizer. Momentum only improves the speed of convergence. Steepest descent is very slow in general.

  • @strippingdatascience9168
    @strippingdatascience9168 Před 4 lety +1

    not sure of oscillation would have along the surface. It should be both the sides of minima

  • @darshmehta3476
    @darshmehta3476 Před 4 lety +20

    Shouldn"t the last equation be V(t) instead of V(t-1)

    • @gael2010
      @gael2010 Před 4 lety +2

      agreed

    • @morpheus6172
      @morpheus6172 Před 4 lety

      thought the same as well

    • @adityachandra2462
      @adityachandra2462 Před 4 lety +2

      it should be V(t+1) coz we are predicting for future value and hence V(t) will be the recent known value.

    • @darshmehta3476
      @darshmehta3476 Před 4 lety

      @@adityachandra2462 We are calculating V(t-1)

    • @gujjalapatiraju7435
      @gujjalapatiraju7435 Před 3 lety +2

      For the first datapoint he considers as '1', while he is calculating the momentum for 2nd datapoint he is using V(t-1), if it is 3rd datapoint it may be (V(t-2)) and vice versa....
      This is just my understanding, i haven't any any research

  • @siddharthachatterjee9959
    @siddharthachatterjee9959 Před 4 lety +1

    The contour plot on the first screen. Is it L(w)-vs-w ? Should it not be w1-vs-w2 (or b) and L(w) should be perpendicular to the screen ?

  • @dharmendrabhojwani
    @dharmendrabhojwani Před 4 lety +2

    10:54 time... Not sure how the equation is formed....

  • @sunnysavita9071
    @sunnysavita9071 Před 4 lety

    sir please make on video time series analysis and ARIMA model

  • @bibekgupta4134
    @bibekgupta4134 Před 3 lety

    what was that plot name

  • @prafulbs7216
    @prafulbs7216 Před 3 lety +1

    Guy's help me. I am confused about Exponential Moving Avg. In the equation is it --> (beta + Beta* square) or (Beta +( beta-1))

  • @Artista1010
    @Artista1010 Před 3 lety +2

    Continue sir,
    I'm understanding all this theory....... This is awesome
    Thank you sir for this free educational video, this help mean a lot to us...
    Keep continuing....
    And I'm click ads so that you can get money in rewards.... 🙏

  • @saurabhmukherjee3801
    @saurabhmukherjee3801 Před 4 lety

    Sir, Why we are multiplying the points with Gamma?

    • @uniquetobin4real
      @uniquetobin4real Před 4 lety

      So that compensation of the vivid strength can accelerate the weight pinnacle of structure multiplied by the t2

  • @rezarawassizadeh4601
    @rezarawassizadeh4601 Před 3 lety

    Thank you for the good explanation, I think the moving average is not the correct term here and it is better to use a weighted average.

  • @eliashossain4327
    @eliashossain4327 Před rokem

    Krish, can you write a book on Deep Learning? You are the best

  • @sametozenc
    @sametozenc Před 10 měsíci

    Better then andrew ng. Thanks

  • @gopalakrishna9510
    @gopalakrishna9510 Před 4 lety +1

    i am also waiting for practical implementation but i know sir you are trying to give indepth knowldge .....

  • @sml9360
    @sml9360 Před 3 lety +1

    When you say noise, it will be more clear if you provide the explanation about how we get the noise if we select 100 or 200 records for MBG. In General please don't miss the explanations to those key points.

    • @sml9360
      @sml9360 Před 3 lety

      is noise get introduced because of random selection of samples from whole data set? or selected samples does not represent the relationship properly? correct me if I am wrong

  • @HarshPatel-iy5qe
    @HarshPatel-iy5qe Před 9 měsíci

    how batch get created, what we consider?

    • @HarshPatel-iy5qe
      @HarshPatel-iy5qe Před 2 měsíci

      I believe batches are created under the hood with some kind of stratified sampling. or without changing any kind of distribution.

  • @sumitkumarsah8782
    @sumitkumarsah8782 Před 4 lety +1

    Sir is this 34 videos completes deep learning or are you going to upload more videos?

  • @ashkraze
    @ashkraze Před rokem

    last me basad machadi bhai..

  • @doyugen465
    @doyugen465 Před 3 lety

    how do we exponentiate the gamma value according to all previous partial derivatives when we are in current loop to calculate Vt-1? would this not add alot of work to the computation if we have even just 100partial derivatives?
    so in my head the pseudo code is looking like:
    vt-1 = dl/dwn + (for( K = 1, while K < num iterations, K++ ) Summation of : gamma(power of K) * dl/dwn-K).
    so we would need to store a gradient vector of all the previous partial derivatives for each neuron. which probably means we have to do this with mini batches otherwise we would end of with vectors of size > 1000's.
    Is this correct?

  • @nitayg1326
    @nitayg1326 Před 4 lety

    Concept of momentum not very clear though formula etc is understood! Why "momentum"

  • @bharathamma7279
    @bharathamma7279 Před 4 lety

    best tutorial. small issue with your camera. unnecessary camera zoom In and zoom out causing eye strain. Thank you for the wonderful lectures.

  • @anandhasrivi
    @anandhasrivi Před 4 lety

    Another reason for using mometum is to jump out of local minimum if we are not using batch normalisation. This is something not covered here

  • @wilsvenleong96
    @wilsvenleong96 Před 2 lety

    the subscript notation in the formula at the end probably isn't written correctly

  • @user-lz2kz5kc4t
    @user-lz2kz5kc4t Před 2 lety

    better than Andrew NG on this topic

  • @Adinasa2
    @Adinasa2 Před 4 lety

    Why the value of gamma is between 0 and 1

    • @sudeepnellur
      @sudeepnellur Před 4 lety

      Its like 0 to 100%, the point something will decide what portion of the weight to be considered

  • @quranicscience9631
    @quranicscience9631 Před 4 lety

    last part of this video is little difficult

  • @nikhil7129
    @nikhil7129 Před 3 lety

    u didn't told what exactly is gama

  • @seanmcgowan9154
    @seanmcgowan9154 Před 2 lety

    Hi Krish. I am wondering whether you might be open to tutoring me in building and deploying ML models with Pytorch. Or, if you know anyone that might be interested. I have a background in basic Data science and basic Pytorch. Compensated of course :)

  • @yessinekhanfir4157
    @yessinekhanfir4157 Před rokem

    good job. you got one part wrong tho, 0.5^2 = 0.025 not 0.25

  • @ahasanhabibsajeeb1979
    @ahasanhabibsajeeb1979 Před 3 lety

    You have made it complicated - Mini batch SGD or SGD

  • @karthickd537
    @karthickd537 Před 2 lety

    i didnt understand anything i nthis.. it is too high level. is that i need to learn anything else before this video. i dnt know from whr the formula comes gamma vt

  • @aksadhamirani7868
    @aksadhamirani7868 Před 3 lety

    try in 1.25 speed

  • @suvratshukla5943
    @suvratshukla5943 Před rokem

    so much advertisement 😔😔😔

  • @k_anu7
    @k_anu7 Před 4 lety

    After being impressed by 13 videos, I got unimpressed by this as here one can clearly see that you yourself are not clear in depth. No offence

    • @krishnaik06
      @krishnaik06  Před 4 lety +1

      Thanks :)

    • @vijaypatneedi
      @vijaypatneedi Před 4 lety

      Agree

    • @latifbhanger
      @latifbhanger Před 4 lety +1

      Comon guys. it was good enough to give more than a basic concept. everyone is not perfect but this is more than average. Thumbs up. KN.