The Unreasonable Effectiveness of Stochastic Gradient Descent (in 3 minutes)

Sdílet
Vložit
  • čas přidán 6. 09. 2024

Komentáře • 43

  • @dudelookslikealady12
    @dudelookslikealady12 Před 2 lety +77

    Your saddle point animation took two seconds to illustrate why SGD might outperform vanilla GD. Amazing

  • @hnbmm
    @hnbmm Před 8 měsíci +2

    2:22 and after is just magical. Thanks for the amazing video.

  • @josepht4799
    @josepht4799 Před 2 lety +7

    Didnt expect to see Dota gameplay lol. Very useful video btw

  • @PeppeMarino
    @PeppeMarino Před 2 lety +5

    Awesome explanation, better than many books

  • @anikdas567
    @anikdas567 Před 3 měsíci +3

    very nice animations, and well explained. But just to be a bit technical isn't what you described called "mini-batch gradient descent". Because for stochastic gradient descent don't we just use one training example per iteration?? 😅😅

    • @suyog8955
      @suyog8955 Před hodinou

      are you too coming from kapathy's video?

  • @kaynkayn9870
    @kaynkayn9870 Před 9 měsíci

    I love watching these videos when I just need a short refresher. Great content.

  • @JuanCamiloAcostaArango
    @JuanCamiloAcostaArango Před 7 měsíci +1

    Finds better solutions? Isn't it just offering faster convergence?

  • @anoojjilladwar203
    @anoojjilladwar203 Před 2 dny

    Hello Mr. Bachir El Khadir,
    I recently came across your channel and was truly impressed by your videos and your clear explanations. I've just started working with AI and am also using the Manim library (created by Grant Sanderson) to make animated explanations.
    I would really appreciate any advice you could offer, and I'm also curious to learn more about how you create your videos.

  • @seasong7655
    @seasong7655 Před 2 lety +9

    If the outcome of the SGD step is random, do you think it could be done multiple times and we could chose the best step?

    • @VisuallyExplained
      @VisuallyExplained  Před 2 lety +6

      Absolutely, this can help sometimes.

    • @cristian-bull
      @cristian-bull Před 9 měsíci

      If you want to try N times per step to see which point "is better", that would require running N forwards, N backward steps, N updates of parameters, and N forwards again to see which one gave the best result.
      Not only computation is increases N times, you would also need N copies of the model (which can mean a lot of GPU), or keep a temporal copy of the model and do N copies 1 at a time, which can also mean a lot of extra time.
      Not saying it's not "technically possible", but I doubt anyone would use that.
      I don't know what @visuallyExpained was talking about saying *that* can help, like it's a common practice or something.
      Am I wrong or am I missing something here?

  • @jessielesbian6791
    @jessielesbian6791 Před 9 měsíci

    TinyGPT uses adam (an SGD variant) with a small batchsize for pre-training warmup, and large batchsize for fine-tuning

  • @MrKohlenstoff
    @MrKohlenstoff Před 6 měsíci

    These are very nice visualizations, and a great explanation of the fundamental idea of SGD. But I'm very skeptical of some of the intuitive seeming explanations of why it's better than regular gradient descent. In particular, the saddle point example seems extremely constructed. It works with simple R²->R functions like the one we see, but even there only if the starting point (=model weights) are placed perfectly on the line, the probability of which is basically 0. Given model weights are usually initialized randomly, and we're in R^n space with n >> 2, I doubt that such cases ever happen in actual deep learning.
    Secondly, of course you can argue that SGD due to its noisiness may better escape local minima. But 1) do local minima actually exist in these extremely high dimensional spaces? If you have a billion dimensions, it's exceedingly unlikely that the derivative of all of them is 0 at the same time, and it may be almost impossible to run into them with a discrete approach. I think all these R²->R visualizations build some very strong yet incorrect intuitions about what high-dimensoinal gradient descent actually looks like. And 2) it could via the same process also _miss_ a _global_ minimum that regular GD would find (or avoids a local minimum but never makes it to any point that's better than the local minimum that GD would have found - so avoiding it then was a _bad_ thing). Noise is not inherently good. We can construct specific examples where it happens to help, but we could just as well find many examples where GD would win against SGD, and in the end the reality is probably that noise in SGD _hurts_ less than the gained performance helps, meaning overall it's the better option.
    It kind of makes sense: evaluating all your training samples to compute the gradient has diminishing returns. So the first, say, 10% of samples are much more important than the last 10%. But if you always used the same 10% of samples, you of course lose a lot of information overall. And if you choose some systematic process of how to select them, you might get strange biases. So naturally you pick them randomly. And hence, SGD.

  • @NikolajKuntner
    @NikolajKuntner Před 2 lety +15

    I enjoy slow and sloppy the most.

  • @evyats9127
    @evyats9127 Před 2 lety

    Thabks a lot, this great short video closed that corner for me

  • @trendish9456
    @trendish9456 Před 6 měsíci

    Watching these videos gives Way better enjoyment than memes.

  • @DG123z
    @DG123z Před 3 měsíci

    It's like being less restrictive keeps you from optimizing the wrong thing and getting stuck in the wrong valley (or hill for evolution). Feels a lot like how i kept trying to optimize being a nice guy bc there was some positive responses and without some chaos i never would have seen another valley of being a bad boy which has much less cost and better results

  • @handlenull
    @handlenull Před 2 lety +1

    Great channel. Thanks!

  • @sidhpandit5239
    @sidhpandit5239 Před rokem

    amazing explaination

  • @ashimov1970
    @ashimov1970 Před 3 měsíci

    Brilliantly Genius!

  • @xuanthanhnguyen6741
    @xuanthanhnguyen6741 Před rokem +1

    nice explanation

  • @sinasec
    @sinasec Před rokem +1

    Such a great work. May i ask which software did you use for animation?

  • @EdeYOlorDSZs
    @EdeYOlorDSZs Před 2 lety

    You're awesome, subbed!

  • @Gapi505
    @Gapi505 Před rokem +1

    I'm trying to program my own neural network, but there training algorithms can't get in my head. Thanx.

  • @chinokyou
    @chinokyou Před 2 lety +1

    good one

  • @igorg4129
    @igorg4129 Před rokem

    Sorry, I do not get something
    What do you randomly take random obsevations? Or random set of features? Or random amount of weights (=random amount of neurons)

  • @chogy7875
    @chogy7875 Před rokem

    Hello,
    I don't understand the meaning of 2:38
    (the formula can be understood).
    Can you explain more?

    • @radhikadesai7781
      @radhikadesai7781 Před rokem

      Has something to do with momentum I guess ? Search on CZcams SGD with momentum. It is basically a math technique that smooths out the function using running average

  • @fatihburakakcay5026
    @fatihburakakcay5026 Před 2 lety

    Perfect

  • @bennicholl7643
    @bennicholl7643 Před 2 lety +1

    stochastic gradient descent doesn't take some constant number of terms, it takes one training example at random, then performs the feed forward and back propagation with that one training example.

    • @VisuallyExplained
      @VisuallyExplained  Před 2 lety +1

      Sure, that's how it is usually defined. But in practice, it's way more common to pick a random mini-batch of size > 1 for training.

    • @bennicholl7643
      @bennicholl7643 Před 2 lety +8

      @@VisuallyExplained Yes, then that would be called mini batch gradient descent, not stochastic gradient descent

    • @nathanwycoff4627
      @nathanwycoff4627 Před rokem

      @@bennicholl7643 In general optimization, SGD is defined in any situation where we have a random gradient; it doesn't even have to be a finite sum problem. The restriction of the term "stochastic gradient descent" to batch size 1approximations to finite sum problems is terminology specific to Machine Learning.

  • @ianthehunter3532
    @ianthehunter3532 Před 11 měsíci

    how to use emoji in manim

  • @yashmundada2483
    @yashmundada2483 Před 8 dny

    Was that comment at the end what I thought it was 💀

  • @tariq_dev3116
    @tariq_dev3116 Před 2 lety +1

    You are insane with those animations, please tell me which software you use for doing that

    • @VisuallyExplained
      @VisuallyExplained  Před 2 lety +4

      Thanks Tariq! I use Blender3D for all of my 3D animations.

    • @tariq_dev3116
      @tariq_dev3116 Před 2 lety

      @@VisuallyExplained thank you 💜💜❤❤

  • @tsunningwah3471
    @tsunningwah3471 Před měsícem

    持續支持行政程序中

  • @tsunningwah3471
    @tsunningwah3471 Před měsícem

    summizaitok

  • @tsunningwah3471
    @tsunningwah3471 Před měsícem

    hdj😂