The Unreasonable Effectiveness of Stochastic Gradient Descent (in 3 minutes)

Visually Explained

zhlédnutí 64 572

Přidat do
- Můj playlist
- Přehrát později
Sdílet

Sdílet

Vložit

Velikost videa:

Zobrazit ovladače přehrávání

Automatické přehrávání

Přehrát

čas přidán 6. 09. 2024

Komentáře • 43

@dudelookslikealady12 Před 2 lety ⁺⁷⁷
Your saddle point animation took two seconds to illustrate why SGD might outperform vanilla GD. Amazing
@hnbmm Před 8 měsíci ⁺²
2:22 and after is just magical. Thanks for the amazing video.
@josepht4799 Před 2 lety ⁺⁷
Didnt expect to see Dota gameplay lol. Very useful video btw
@PeppeMarino Před 2 lety ⁺⁵
Awesome explanation, better than many books
@anikdas567 Před 3 měsíci ⁺³
very nice animations, and well explained. But just to be a bit technical isn't what you described called "mini-batch gradient descent". Because for stochastic gradient descent don't we just use one training example per iteration?? 😅😅
@suyog8955 Před hodinou
are you too coming from kapathy's video?
@kaynkayn9870 Před 9 měsíci
I love watching these videos when I just need a short refresher. Great content.
@JuanCamiloAcostaArango Před 7 měsíci ⁺¹
Finds better solutions? Isn't it just offering faster convergence?
@anoojjilladwar203 Před 2 dny
Hello Mr. Bachir El Khadir,
I recently came across your channel and was truly impressed by your videos and your clear explanations. I've just started working with AI and am also using the Manim library (created by Grant Sanderson) to make animated explanations.
I would really appreciate any advice you could offer, and I'm also curious to learn more about how you create your videos.
@seasong7655 Před 2 lety ⁺⁹
If the outcome of the SGD step is random, do you think it could be done multiple times and we could chose the best step?
@VisuallyExplained Před 2 lety ⁺⁶
Absolutely, this can help sometimes.
@cristian-bull Před 9 měsíci
If you want to try N times per step to see which point "is better", that would require running N forwards, N backward steps, N updates of parameters, and N forwards again to see which one gave the best result.
Not only computation is increases N times, you would also need N copies of the model (which can mean a lot of GPU), or keep a temporal copy of the model and do N copies 1 at a time, which can also mean a lot of extra time.
Not saying it's not "technically possible", but I doubt anyone would use that.
I don't know what @visuallyExpained was talking about saying *that* can help, like it's a common practice or something.
Am I wrong or am I missing something here?
@jessielesbian6791 Před 9 měsíci
TinyGPT uses adam (an SGD variant) with a small batchsize for pre-training warmup, and large batchsize for fine-tuning
@MrKohlenstoff Před 6 měsíci
These are very nice visualizations, and a great explanation of the fundamental idea of SGD. But I'm very skeptical of some of the intuitive seeming explanations of why it's better than regular gradient descent. In particular, the saddle point example seems extremely constructed. It works with simple R²->R functions like the one we see, but even there only if the starting point (=model weights) are placed perfectly on the line, the probability of which is basically 0. Given model weights are usually initialized randomly, and we're in R^n space with n >> 2, I doubt that such cases ever happen in actual deep learning.
Secondly, of course you can argue that SGD due to its noisiness may better escape local minima. But 1) do local minima actually exist in these extremely high dimensional spaces? If you have a billion dimensions, it's exceedingly unlikely that the derivative of all of them is 0 at the same time, and it may be almost impossible to run into them with a discrete approach. I think all these R²->R visualizations build some very strong yet incorrect intuitions about what high-dimensoinal gradient descent actually looks like. And 2) it could via the same process also _miss_ a _global_ minimum that regular GD would find (or avoids a local minimum but never makes it to any point that's better than the local minimum that GD would have found - so avoiding it then was a _bad_ thing). Noise is not inherently good. We can construct specific examples where it happens to help, but we could just as well find many examples where GD would win against SGD, and in the end the reality is probably that noise in SGD _hurts_ less than the gained performance helps, meaning overall it's the better option.
It kind of makes sense: evaluating all your training samples to compute the gradient has diminishing returns. So the first, say, 10% of samples are much more important than the last 10%. But if you always used the same 10% of samples, you of course lose a lot of information overall. And if you choose some systematic process of how to select them, you might get strange biases. So naturally you pick them randomly. And hence, SGD.
@NikolajKuntner Před 2 lety ⁺¹⁵
I enjoy slow and sloppy the most.
@nathansmith8187 Před 8 měsíci ⁺²
Came here to find this comment.
@evyats9127 Před 2 lety
Thabks a lot, this great short video closed that corner for me
@trendish9456 Před 6 měsíci
Watching these videos gives Way better enjoyment than memes.
@DG123z Před 3 měsíci
It's like being less restrictive keeps you from optimizing the wrong thing and getting stuck in the wrong valley (or hill for evolution). Feels a lot like how i kept trying to optimize being a nice guy bc there was some positive responses and without some chaos i never would have seen another valley of being a bad boy which has much less cost and better results
@handlenull Před 2 lety ⁺¹
Great channel. Thanks!
@sidhpandit5239 Před rokem
amazing explaination
@ashimov1970 Před 3 měsíci
Brilliantly Genius!
@xuanthanhnguyen6741 Před rokem ⁺¹
nice explanation
@sinasec Před rokem ⁺¹
Such a great work. May i ask which software did you use for animation?
@EdeYOlorDSZs Před 2 lety
You're awesome, subbed!
@Gapi505 Před rokem ⁺¹
I'm trying to program my own neural network, but there training algorithms can't get in my head. Thanx.
@chinokyou Před 2 lety ⁺¹
good one
@igorg4129 Před rokem
Sorry, I do not get something
What do you randomly take random obsevations? Or random set of features? Or random amount of weights (=random amount of neurons)
@chogy7875 Před rokem
Hello,
I don't understand the meaning of 2:38
(the formula can be understood).
Can you explain more?
@radhikadesai7781 Před rokem
Has something to do with momentum I guess ? Search on CZcams SGD with momentum. It is basically a math technique that smooths out the function using running average
@fatihburakakcay5026 Před 2 lety
Perfect
@bennicholl7643 Před 2 lety ⁺¹
stochastic gradient descent doesn't take some constant number of terms, it takes one training example at random, then performs the feed forward and back propagation with that one training example.
@VisuallyExplained Před 2 lety ⁺¹
Sure, that's how it is usually defined. But in practice, it's way more common to pick a random mini-batch of size > 1 for training.
@bennicholl7643 Před 2 lety ⁺⁸
@@VisuallyExplained Yes, then that would be called mini batch gradient descent, not stochastic gradient descent
@nathanwycoff4627 Před rokem
@@bennicholl7643 In general optimization, SGD is defined in any situation where we have a random gradient; it doesn't even have to be a finite sum problem. The restriction of the term "stochastic gradient descent" to batch size 1approximations to finite sum problems is terminology specific to Machine Learning.
@ianthehunter3532 Před 11 měsíci
how to use emoji in manim
@yashmundada2483 Před 8 dny
Was that comment at the end what I thought it was 💀
@tariq_dev3116 Před 2 lety ⁺¹
You are insane with those animations, please tell me which software you use for doing that
@VisuallyExplained Před 2 lety ⁺⁴
Thanks Tariq! I use Blender3D for all of my 3D animations.
@tariq_dev3116 Před 2 lety
@@VisuallyExplained thank you 💜💜❤❤
@tsunningwah3471 Před měsícem
持續支持行政程序中
@tsunningwah3471 Před měsícem
summizaitok
@tsunningwah3471 Před měsícem
hdj😂

Další v pořadí

Automatické přehrávání

Accelerate Gradient Descent with Momentum (in 3 minutes)