L11.6 Xavier Glorot and Kaiming He Initialization
Vložit
- čas přidán 10. 03. 2021
- IMPORTANT NOTE: In the video, I talk about the number of input units in the denominator ("fan in"), but to be correct, it should have been number of input units for both the current and the next layer ("fan in" + "fan out").
Slides: sebastianraschka.com/pdf/lect...
Papers:
Xavier Glorot and Yoshua Bengio. "Understanding the difficulty of training deep feedforward neural networks." Proceedings of the thirteenth international conference on artificial intelligence and statistics. 2010. proceedings.mlr.press/v9/gloro...]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. "Delving deep into rectifiers: Surpassing human-level performance on imagenet classification." In Proceedings of the IEEE international conference on computer vision, pp. 1026-1034. 2015. arxiv.org/abs/1502.01852
-------
This video is part of my Introduction of Deep Learning course.
Next video: • L11.7 Weight Initializ...
The complete playlist: • Intro to Deep Learning...
A handy overview page with links to the materials: sebastianraschka.com/blog/202...
-------
If you want to be notified about future videos, please consider subscribing to my channel: / sebastianraschka - Věda a technologie
The terms fan- in and fan-out come from Digital Electronics . Fan-in is the max number of logic Gates can be connected to the input of a particular Gate. Fan-out is the same to the output.
At 6:12 on the second line of equations, the part which is marked by blue circle, can someone please clarify how the variance of the product of 2 independent variables can be expanded to the product of the variances of those variables? I can't seem to find any such property...can someone point to some helpful material . Thank you
at 5:46 should the summation iterator variable be 'k' instead of 'j'?
I think you are not describing Xavier initialization. Xavier initialization is equations (16) in the paper. Equation (1) is what you are showing with only fan_in, and this is what they argue was a common but bad heuristic
Thanks for the note, you are right. Wasn't careful here. Will make a note to fix that.
thanks for the mention. I was wondering after reading the paper why no one talks about equations (16). I must say I see so many different interpretations that I am totally confused. Also, where does He take into account the nonlinearity of ReLU? We see sqrt in both formulas ... the multiplication by 2 is due to the fact that ReLU cuts off half below 0, right?
I noticed that too. Shame as I would like to understand better where the root 6 comes from in the actual Xavier initialisation equation