very nice animations, and well explained. But just to be a bit technical isn't what you described called "mini-batch gradient descent". Because for stochastic gradient descent don't we just use one training example per iteration?? 😅😅
Hello Mr. Bachir El Khadir, I recently came across your channel and was truly impressed by your videos and your clear explanations. I've just started working with AI and am also using the Manim library (created by Grant Sanderson) to make animated explanations. I would really appreciate any advice you could offer, and I'm also curious to learn more about how you create your videos.
If you want to try N times per step to see which point "is better", that would require running N forwards, N backward steps, N updates of parameters, and N forwards again to see which one gave the best result. Not only computation is increases N times, you would also need N copies of the model (which can mean a lot of GPU), or keep a temporal copy of the model and do N copies 1 at a time, which can also mean a lot of extra time. Not saying it's not "technically possible", but I doubt anyone would use that. I don't know what @visuallyExpained was talking about saying *that* can help, like it's a common practice or something. Am I wrong or am I missing something here?
These are very nice visualizations, and a great explanation of the fundamental idea of SGD. But I'm very skeptical of some of the intuitive seeming explanations of why it's better than regular gradient descent. In particular, the saddle point example seems extremely constructed. It works with simple R²->R functions like the one we see, but even there only if the starting point (=model weights) are placed perfectly on the line, the probability of which is basically 0. Given model weights are usually initialized randomly, and we're in R^n space with n >> 2, I doubt that such cases ever happen in actual deep learning. Secondly, of course you can argue that SGD due to its noisiness may better escape local minima. But 1) do local minima actually exist in these extremely high dimensional spaces? If you have a billion dimensions, it's exceedingly unlikely that the derivative of all of them is 0 at the same time, and it may be almost impossible to run into them with a discrete approach. I think all these R²->R visualizations build some very strong yet incorrect intuitions about what high-dimensoinal gradient descent actually looks like. And 2) it could via the same process also _miss_ a _global_ minimum that regular GD would find (or avoids a local minimum but never makes it to any point that's better than the local minimum that GD would have found - so avoiding it then was a _bad_ thing). Noise is not inherently good. We can construct specific examples where it happens to help, but we could just as well find many examples where GD would win against SGD, and in the end the reality is probably that noise in SGD _hurts_ less than the gained performance helps, meaning overall it's the better option. It kind of makes sense: evaluating all your training samples to compute the gradient has diminishing returns. So the first, say, 10% of samples are much more important than the last 10%. But if you always used the same 10% of samples, you of course lose a lot of information overall. And if you choose some systematic process of how to select them, you might get strange biases. So naturally you pick them randomly. And hence, SGD.
It's like being less restrictive keeps you from optimizing the wrong thing and getting stuck in the wrong valley (or hill for evolution). Feels a lot like how i kept trying to optimize being a nice guy bc there was some positive responses and without some chaos i never would have seen another valley of being a bad boy which has much less cost and better results
Sorry, I do not get something What do you randomly take random obsevations? Or random set of features? Or random amount of weights (=random amount of neurons)
Has something to do with momentum I guess ? Search on CZcams SGD with momentum. It is basically a math technique that smooths out the function using running average
stochastic gradient descent doesn't take some constant number of terms, it takes one training example at random, then performs the feed forward and back propagation with that one training example.
@@bennicholl7643 In general optimization, SGD is defined in any situation where we have a random gradient; it doesn't even have to be a finite sum problem. The restriction of the term "stochastic gradient descent" to batch size 1approximations to finite sum problems is terminology specific to Machine Learning.
Your saddle point animation took two seconds to illustrate why SGD might outperform vanilla GD. Amazing
2:22 and after is just magical. Thanks for the amazing video.
Didnt expect to see Dota gameplay lol. Very useful video btw
Awesome explanation, better than many books
very nice animations, and well explained. But just to be a bit technical isn't what you described called "mini-batch gradient descent". Because for stochastic gradient descent don't we just use one training example per iteration?? 😅😅
are you too coming from kapathy's video?
I love watching these videos when I just need a short refresher. Great content.
Finds better solutions? Isn't it just offering faster convergence?
Hello Mr. Bachir El Khadir,
I recently came across your channel and was truly impressed by your videos and your clear explanations. I've just started working with AI and am also using the Manim library (created by Grant Sanderson) to make animated explanations.
I would really appreciate any advice you could offer, and I'm also curious to learn more about how you create your videos.
If the outcome of the SGD step is random, do you think it could be done multiple times and we could chose the best step?
Absolutely, this can help sometimes.
If you want to try N times per step to see which point "is better", that would require running N forwards, N backward steps, N updates of parameters, and N forwards again to see which one gave the best result.
Not only computation is increases N times, you would also need N copies of the model (which can mean a lot of GPU), or keep a temporal copy of the model and do N copies 1 at a time, which can also mean a lot of extra time.
Not saying it's not "technically possible", but I doubt anyone would use that.
I don't know what @visuallyExpained was talking about saying *that* can help, like it's a common practice or something.
Am I wrong or am I missing something here?
TinyGPT uses adam (an SGD variant) with a small batchsize for pre-training warmup, and large batchsize for fine-tuning
These are very nice visualizations, and a great explanation of the fundamental idea of SGD. But I'm very skeptical of some of the intuitive seeming explanations of why it's better than regular gradient descent. In particular, the saddle point example seems extremely constructed. It works with simple R²->R functions like the one we see, but even there only if the starting point (=model weights) are placed perfectly on the line, the probability of which is basically 0. Given model weights are usually initialized randomly, and we're in R^n space with n >> 2, I doubt that such cases ever happen in actual deep learning.
Secondly, of course you can argue that SGD due to its noisiness may better escape local minima. But 1) do local minima actually exist in these extremely high dimensional spaces? If you have a billion dimensions, it's exceedingly unlikely that the derivative of all of them is 0 at the same time, and it may be almost impossible to run into them with a discrete approach. I think all these R²->R visualizations build some very strong yet incorrect intuitions about what high-dimensoinal gradient descent actually looks like. And 2) it could via the same process also _miss_ a _global_ minimum that regular GD would find (or avoids a local minimum but never makes it to any point that's better than the local minimum that GD would have found - so avoiding it then was a _bad_ thing). Noise is not inherently good. We can construct specific examples where it happens to help, but we could just as well find many examples where GD would win against SGD, and in the end the reality is probably that noise in SGD _hurts_ less than the gained performance helps, meaning overall it's the better option.
It kind of makes sense: evaluating all your training samples to compute the gradient has diminishing returns. So the first, say, 10% of samples are much more important than the last 10%. But if you always used the same 10% of samples, you of course lose a lot of information overall. And if you choose some systematic process of how to select them, you might get strange biases. So naturally you pick them randomly. And hence, SGD.
I enjoy slow and sloppy the most.
Came here to find this comment.
Thabks a lot, this great short video closed that corner for me
Watching these videos gives Way better enjoyment than memes.
It's like being less restrictive keeps you from optimizing the wrong thing and getting stuck in the wrong valley (or hill for evolution). Feels a lot like how i kept trying to optimize being a nice guy bc there was some positive responses and without some chaos i never would have seen another valley of being a bad boy which has much less cost and better results
Great channel. Thanks!
amazing explaination
Brilliantly Genius!
nice explanation
Such a great work. May i ask which software did you use for animation?
You're awesome, subbed!
I'm trying to program my own neural network, but there training algorithms can't get in my head. Thanx.
good one
Sorry, I do not get something
What do you randomly take random obsevations? Or random set of features? Or random amount of weights (=random amount of neurons)
Hello,
I don't understand the meaning of 2:38
(the formula can be understood).
Can you explain more?
Has something to do with momentum I guess ? Search on CZcams SGD with momentum. It is basically a math technique that smooths out the function using running average
Perfect
stochastic gradient descent doesn't take some constant number of terms, it takes one training example at random, then performs the feed forward and back propagation with that one training example.
Sure, that's how it is usually defined. But in practice, it's way more common to pick a random mini-batch of size > 1 for training.
@@VisuallyExplained Yes, then that would be called mini batch gradient descent, not stochastic gradient descent
@@bennicholl7643 In general optimization, SGD is defined in any situation where we have a random gradient; it doesn't even have to be a finite sum problem. The restriction of the term "stochastic gradient descent" to batch size 1approximations to finite sum problems is terminology specific to Machine Learning.
how to use emoji in manim
Was that comment at the end what I thought it was 💀
You are insane with those animations, please tell me which software you use for doing that
Thanks Tariq! I use Blender3D for all of my 3D animations.
@@VisuallyExplained thank you 💜💜❤❤
持續支持行政程序中
summizaitok
hdj😂