Batch Normalization - EXPLAINED!

Sdílet
Vložit
  • čas přidán 2. 08. 2024
  • What is Batch Normalization? Why is it important in Neural networks? We get into math details too. Code in references.
    Follow me on M E D I U M: towardsdatascience.com/likeli...
    REFERENCES
    [1] 2015 paper that introduced Batch Normalization: arxiv.org/abs/1502.03167
    [2] The paper that claims Batch Norm does NOT reduce internal covariate shift as claimed in [1]: arxiv.org/abs/1805.11604
    [3] Using BN + Dropout: arxiv.org/abs/1905.05928
    [4] Andrew Ng on why normalization speeds up training: www.coursera.org/lecture/deep...
    [5] Ian Goodfellow on how Batch Normalization helps regularization: www.quora.com/Is-there-a-theo...
    [6] Code Batch Normalization from scratch: kratzert.github.io/2016/02/12...

Komentáře • 129

  • @ssshukla26
    @ssshukla26 Před 4 lety +48

    Shouldn't be Gamma should approximate to the true variance of the neuron activation and beta should approximate to the true mean of the neuron activation? I am just confused...

    • @CodeEmporium
      @CodeEmporium  Před 4 lety +25

      You're right. Misspoke there. Nice catch!

    • @ssshukla26
      @ssshukla26 Před 4 lety

      @@CodeEmporium Cool

    • @dhananjaysonawane1996
      @dhananjaysonawane1996 Před 3 lety +1

      How is this approximation happening?
      And how do we use beta, gamma at test time? We have only one example at a time during testing.

    • @FMAdestroyer
      @FMAdestroyer Před 2 lety +1

      @@dhananjaysonawane1996 in most frameworks when you create a BN Layer, the mean and variance (Beta and gamma) are both learnable parameters usually represented as the weights and bias from the layer. You can deduce that from Torch BN2D Layer's description bellow
      "The mean and standard-deviation are calculated per-dimension over the mini-batches and γ and β are learnable parameter vectors of size C (where C is the input size)."

    • @AndyLee-xq8wq
      @AndyLee-xq8wq Před rokem

      Thanks for clarification!

  • @efaustmann
    @efaustmann Před 4 lety +23

    Exactly what I was looking for. Very well researched and explained in a simply way with visualizations. Thank you very much!

  • @sumanthbalaji1768
    @sumanthbalaji1768 Před 4 lety +9

    Just found your channel and binged through all your videos so heres a general review. As a student i assure you your content is on point and goes in depth unlike other channels that just skim the surface. Keep it up and dont be afraid to go more in depth on concepts. We love it. Keep it up brother you have earned a supporter till your channels end

    • @CodeEmporium
      @CodeEmporium  Před 4 lety +2

      Thanks ma guy. I'll keep pushing up content. Good to know my audience loves the details ;)

    • @sumanthbalaji1768
      @sumanthbalaji1768 Před 4 lety

      @@CodeEmporium damn did not actually expect you to reply lol. Maybe let me throw a topic suggestion then. More NLP please, take a look at summarisation tasks as a topic. Would be damn interesting.

  • @jodumagpi
    @jodumagpi Před 4 lety

    This is good! I think that giving an example as well as the use cases (advantages) before diving into the details alwayd gets the job done

  • @maxb5560
    @maxb5560 Před 4 lety +1

    Love your videos. They help me alot understanding machine learning more and more

  • @parthshastri2451
    @parthshastri2451 Před 4 lety +9

    why did you plot the cost against height and the age isnt it supposed to be a function of weights in a neural network

  • @ultrasgreen1349
    @ultrasgreen1349 Před 2 lety

    thats actually a very very good and intuitive video. Honestly Thank you

  • @balthiertsk8596
    @balthiertsk8596 Před 2 lety

    Hey man, thank you.
    I really appreciate this quality content!

  • @yeripark1135
    @yeripark1135 Před 2 lety

    I clearly understand the need of batch normalization and its advantages! Thanks !!

  • @Slisus
    @Slisus Před 2 lety

    Awesome video. I really like, how you go into the actual papers behind it.

  • @EB3103
    @EB3103 Před 3 lety +2

    The loss is not a function of the features but a function of the weights

  • @luisfraga3281
    @luisfraga3281 Před 4 lety

    Hello, I wonder what if we don't normalize the image input data (RGB 0-255) and then we use batch normalization? Is it going to work smoothly? or is it going to mess up with the learning?

  • @ahmedshehata9522
    @ahmedshehata9522 Před 2 lety

    You are really and also really good because you reference paper and introduce the idea

  • @JapiSandhu
    @JapiSandhu Před 2 lety

    can I add a Batch Normalization layer after an LSTM layer in pytorch?

  • @mohammadkaramisheykhlan9

    How can we use batch normalization in the test set?

  • @SillyMakesVids
    @SillyMakesVids Před 4 lety

    Sorry, but where did gamma and beta come from and how is it used?

  • @dragonman101
    @dragonman101 Před 3 lety +1

    Quick note: at 6:50 there should be brackets after 1/3 (see below)
    Yours: 1/3 (4 - 5.33)^2 + (5 - 5.33)^2 + (7 - 5.33)^2

  • @user-nx8ux5ls7q
    @user-nx8ux5ls7q Před 2 lety

    Do we calculate the mean and SD across a mini-batch for a given neutron or across all the neurone in a layer? Andrew NG says it's across each layer. Thanks.

  • @oheldad
    @oheldad Před 4 lety +6

    Hey there . Im on my way to become data scientist , and your videos help me a lot ! Keep going Im sure I am not the only one you inspired :) thank you !!

    • @CodeEmporium
      @CodeEmporium  Před 4 lety +1

      Awesome! Glad these videos help! Good luck with your Data science ventures :)

    • @ccuuttww
      @ccuuttww Před 4 lety +2

      Your aim should not become a data scientist to fit other people expectation you should become a people who can deal with data and estimate any unknown parameter with your own standard

    • @oheldad
      @oheldad Před 4 lety

      @@ccuuttww dont know why you decided that Im fulfilling others expectations on me - its not true. Im on the last semester of my electrical engineering degree , and decided to change path a little :)

    • @ccuuttww
      @ccuuttww Před 4 lety

      because most of people think in the following pattern : Finish all exam semester and graduate with good marks send mass CV and try to get a job titled:"Data Scientist"
      try to fit their jobs what they learn from university like a trained monkey however u are not deal with a real wold situation u just try to deal with your customer or your boss since this topic never have standard answer u can only define by yourself and your client only trust your title
      I fell this is really bad

  • @user-nx8ux5ls7q
    @user-nx8ux5ls7q Před 2 lety

    Also if someone can say how to make gamma and beta learnable? gamma can be thought as an additional weight attached to the activation but how about beta? how to train that?

  • @sriharihumbarwadi5981
    @sriharihumbarwadi5981 Před 4 lety +1

    Can you please make a video on how batch normalization and l1/l2 regularization interact with each other ?

  • @seyyedpooyahekmatiathar624

    Subtracting the mean and dividing by std is standardization. Normalization is when you change the range of the dataset to be [0,1].

  • @SunnySingh-tp6nt
    @SunnySingh-tp6nt Před 3 měsíci

    can I get these slides?

  • @mizzonimirko
    @mizzonimirko Před rokem

    I do not understand property how this Is going to be implemented. At the end of an epoch actually we perform those operations right? At the end of that epoch, at this point the layer where i have applied It Is normalized right?

  • @abhishekp4818
    @abhishekp4818 Před 4 lety

    @CodeEmporium , could you please tell me that why do we need to normalize the outputs of activation function whe they are already within a small range(example sigmoid ranges from 0 to 1)?
    and if we do normalize them, then how do we compute and updates of its parameters during backpropgation?
    please answer.

    • @boke6184
      @boke6184 Před 4 lety

      The activation function should be the modifiing the predictability of error or learning too

  • @ryanchen6147
    @ryanchen6147 Před 2 lety +2

    at 3:27, I think your axises should be the *weight* for the height feature and the *weight* for the age feature if that is a contour plot of the cost function

  • @angusbarr7952
    @angusbarr7952 Před 4 lety +16

    Hey! Just cited you in my undergrad project because your example finally made me understand batch norm. Thanks a lot!

  • @SetoAjiNugroho
    @SetoAjiNugroho Před 4 lety

    what about layer norm ?

  • @MaralSheikhzadeh
    @MaralSheikhzadeh Před 2 lety

    thanks, this video helped me understand BN better. and I liked your sense of humor. made watching is more fun.:)

  • @pranavjangir8338
    @pranavjangir8338 Před 4 lety +1

    Is not Batch Normalization also used to counter the exploding gradient problem? Would have loved some explanation on that too..

  • @hervebenganga8561
    @hervebenganga8561 Před 2 lety

    This is beautiful. Thank you

  • @anishjain8096
    @anishjain8096 Před 4 lety

    Hey brother can you please tell me how on fly data augmentation increase the image data set every on blogs and vedios they said it increase the data size but hiw

    • @CodeEmporium
      @CodeEmporium  Před 4 lety

      For images, you would need to make minor distortions (rotation, crop, scale, blur) in an image such that the result is a realistic input. This way, you have more training data for your model to generalize

  • @ccuuttww
    @ccuuttww Před 4 lety +1

    I wonder is it suitable to use population estimator?
    I think nowadays most of the machine learning learner/student/fans
    spent very less time on statistics after several year study I find that The model selection and the statistical theory take the most important part
    especially the Bayesian learning the most underrated topic today

  • @ayandogra2952
    @ayandogra2952 Před 3 lety

    Amazing work
    really liked it

  • @iliasaarab7922
    @iliasaarab7922 Před 3 lety

    Great explanation, thanks!

  • @rockzzstartzz2339
    @rockzzstartzz2339 Před 4 lety

    Why to use beta and gamma?

  • @taghyeertaghyeer5974
    @taghyeertaghyeer5974 Před rokem +3

    Hello, thank you for your video.
    I am wondering regarding the batch normalisation speeding up the training: you showed at 2:42 the contour plot of the loss as a function of height and age. However, the loss function contours should be plotted against the weights (the optimization is performed in the weights' space, and not the input space). In other words, why did you base your argument on the loss function with weight and and height being the variable (they should be held constant during optimization)?
    Thank you! Lana

    • @marcinstrzesak346
      @marcinstrzesak346 Před 10 měsíci

      For me, it also seemed quite confusing. I'm glad someone else noticed it too.

    • @atuldivekar
      @atuldivekar Před 6 měsíci

      The contour plot is being shown as a function of height and age to show the dependence of the loss on the input distribution, not the weights

  • @chandnimaria9748
    @chandnimaria9748 Před 10 měsíci

    Just what I was looking for, thanks.

  • @SaifMohamed-de8uo
    @SaifMohamed-de8uo Před 2 měsíci

    Great explanation thank you!

  • @superghettoindian01
    @superghettoindian01 Před rokem

    I see you are checking all these comments - so will try to comment on all the videos I see going forward and how I’m using these videos.
    Currently using this video as supplement to Andrej Karpathy’s makemore series pt 3.
    The other video has a more detailed implementation of batch normalization but you do a great job of summarizing the key concepts. I hope one day you and Andrej can create a video together 😊.

    • @CodeEmporium
      @CodeEmporium  Před rokem +1

      Thanks a ton for the comment. Honestly, any critical feedback is appreciated. So thanks you. It would certainly be a privilege to collaborate with Andrej for sure. Maybe in the future :)

  • @user-wf2fq2vn5m
    @user-wf2fq2vn5m Před 3 lety

    Awesome explanation.

  • @God-vl5uz
    @God-vl5uz Před 2 měsíci

    Thank you!

  • @shaz-z506
    @shaz-z506 Před 4 lety

    Good video, could you please make a video on capsule network.

  • @uniquetobin4real
    @uniquetobin4real Před 4 lety

    The best I have seen so far

  • @thoughte2432
    @thoughte2432 Před 3 lety +4

    I found this a really good and intuitive explanation, thanks for that. But there was one thing that confused me: isn't the effect of batch normalization the smoothing of the loss function? I found it difficult to associate the loss function directly to the graph shown at 2:50.

    • @Paivren
      @Paivren Před rokem

      yes, the graph is a bit weird in the sense that the loss function is not a function of the features but of the model parameters.

  • @strateeg32
    @strateeg32 Před 2 lety

    Awesome thank you!

  • @hemaswaroop7970
    @hemaswaroop7970 Před 4 lety

    Thanks, Man!

  • @aminmw5258
    @aminmw5258 Před rokem

    Thank you bro.

  • @danieldeychakiwsky1928
    @danieldeychakiwsky1928 Před 4 lety +7

    Thanks for the video. I wanted to add that there's debate in the community over whether to normalize pre vs. post non-linearity within the layers, i.e., for a given neuron in some layer, do you normalize the result of the linear function that gets piped through non-linearity or do you pipe the linear combination through non-linearity and then apply normalization, in both cases, over the mini-batch.

    • @kennethleung4487
      @kennethleung4487 Před 3 lety +3

      Here's what I found from MachineLearningMastery:
      o Batch normalization may be used on inputs to the layer before or after the activation function in the previous layer
      o It may be more appropriate after the activation function if for S-shaped functions like the hyperbolic tangent and logistic function
      o It may be appropriate before the activation function for activations that may result in non-Gaussian distributions like the rectified linear activation function, the modern default for most network types

  • @enveraaa8414
    @enveraaa8414 Před 3 lety

    Bro you have made the perfect video

  • @erich_l4644
    @erich_l4644 Před 4 lety +1

    This was so well put together- why less than 10k views? Oh... it's batch normalization

  • @JapiSandhu
    @JapiSandhu Před 2 lety

    this is a great video

  • @manthanladva6547
    @manthanladva6547 Před 4 lety

    Thanks for awesome video
    Get many idea about Batch Norm

  • @kriz1718
    @kriz1718 Před 4 lety

    Very helpfull!!

  • @akremgomri9085
    @akremgomri9085 Před 2 měsíci

    Very good explanation. However, there is something I didn't understand. Doesn't batch normalisation modify the inout data so that m=0 and v=1 as explained in the beginning ?? So how the heck we moved from normalisation being applied on inputs, to normalisation affecting activation function ? 😅😅

  • @PavanTripathi-rj7bd
    @PavanTripathi-rj7bd Před rokem

    great explanation

    • @CodeEmporium
      @CodeEmporium  Před rokem

      Thank you! Enjoy your stay on the channel :)

  • @QuickTechNow
    @QuickTechNow Před 13 dny

    Thanks

  • @sultanatasnimjahan5114
    @sultanatasnimjahan5114 Před 8 měsíci

    thanks

  • @nobelyhacker
    @nobelyhacker Před 3 lety

    Nice video, but i guess there is a little error at 6:57? I guess you have to multiply the whole with 1/3 not only the first term

  • @priyankakaswan7528
    @priyankakaswan7528 Před 3 lety

    the real magic starts at 6.07, this video was exactly what I needed

  • @sanjaykrish8719
    @sanjaykrish8719 Před 4 lety

    Fantastic explanation using contour plots.

  • @lamnguyentrong275
    @lamnguyentrong275 Před 4 lety +3

    wow, easy to understand , and clear accent. Thank you, sir. u done a great job

  • @gyanendradas
    @gyanendradas Před 4 lety

    Can u make a video for all types pooling layers

    • @CodeEmporium
      @CodeEmporium  Před 4 lety +1

      Interesting. I'll look into this. Thanks for the idea

  • @pranaysingh3950
    @pranaysingh3950 Před 2 lety

    Thanks!

  • @its_azmii
    @its_azmii Před 4 lety

    hey can u link the graph that you used please?

  • @samratkorupolu
    @samratkorupolu Před 3 lety

    wow, you explained pretty clearly

  • @ajayvishwakarma6943
    @ajayvishwakarma6943 Před 4 lety

    Thanks buddy

  • @aaronk839
    @aaronk839 Před 4 lety +26

    Good explanation until 7:17 after which, I think, you miss the point which makes the whole thing very confusing. You say: "Gamma should approximate to the true mean of the neuron activation and beta should approximate to the true variance of the neuron activation." Apart from the fact that this should be the other way around, as you acknowledge in the comments, you don't say what you mean by "true mean" and "true variance".
    I learned from Andrew Ng's video (czcams.com/video/tNIpEZLv_eg/video.html) that the actual reason for introducing two learnable parameters is that you actually don't necessarily want all batch data to be normalized to mean 0 and variance 1. Instead, shifting and scaling all normalized data at one neuron to obtain a different mean (beta) and variance (gamma) might be advantageous in order to exploit the non-linearity of your activation functions.
    Please don't skip over important parts like this one with sloppy explanations in future videos. This gives people the impression that they understand what's going on, when they actually don't.

    • @dragonman101
      @dragonman101 Před 3 lety +3

      Thank you very much for this explanation. The link and the correction are very helpful and do provide some clarity to a question I had.
      That being said, I don't think it's fair to call his explanation sloppy. He broke down complicated material in a fantastic and clear way for the most part. He even linked to research so we could do further reading, which is great because now I have a solid foundation to understand what I read in the papers. He should be encouraged to fix his few mistakes rather than slapped on the wrist.

    • @sachinkun21
      @sachinkun21 Před 2 lety

      thanks a ton!! I was actually looking for this comment as I had the same question as to why do we even need to approximate!

  • @novinnouri764
    @novinnouri764 Před 2 lety

    thansk

  • @abheerchrome
    @abheerchrome Před 3 lety

    grate video bro keep it up

  • @elyasmoshirpanahi7184

    Nice content

  • @themightyquinn100
    @themightyquinn100 Před rokem

    Wasn't there an episode where Peter was playing against Larry Bird?

  • @akhileshpandey123
    @akhileshpandey123 Před 3 lety

    Nice explanation :+1

  • @ai__76
    @ai__76 Před 3 lety

    Nice animations

  • @adosar7261
    @adosar7261 Před rokem

    And why not just normalizing the whole training set instead of batch normalization?

    • @CodeEmporium
      @CodeEmporium  Před rokem

      Batch normalization will normalize through different steps of the network. If we want to “normalize the whole training set”, we need to pass all training examples at once to the network as a single batch. This is what we see in “batch gradient descent”, but isn’t super common for large datasets because of memory constraints.

  • @sevfx
    @sevfx Před rokem

    Great explanation, but missing parantheses at 6:52 :p

  • @99dynasty
    @99dynasty Před 2 lety

    BatchNorm reparametrizes the underlying optimization problem to make it more stable (in the sense of loss Lipschitzness) and smooth (in the sense of “effective” β-smoothness of the loss).
    Not my words

  • @pupfer
    @pupfer Před 2 lety

    The only difficult part of batch norm, namely the back prop isn't explained.

  • @GauravSharma-ui4yd
    @GauravSharma-ui4yd Před 4 lety

    Awesome, keep going like this

    • @CodeEmporium
      @CodeEmporium  Před 4 lety +1

      Thanks for watching every video Gaurav :)

  • @Acampandoconfrikis
    @Acampandoconfrikis Před 3 lety

    Hey 🅱eter, did you make it to the NBA?

  • @lazarus8011
    @lazarus8011 Před 2 měsíci

    Good video
    here's a comment for the algorithm

  • @PierreH1968
    @PierreH1968 Před 3 lety

    Great explanation, very helpful!

  • @boke6184
    @boke6184 Před 4 lety

    This is good for ghost box

  • @eniolaajiboye4399
    @eniolaajiboye4399 Před 3 lety

    🤯

  • @xuantungnguyen9719
    @xuantungnguyen9719 Před 3 lety

    good visualization

  • @SAINIVEDH
    @SAINIVEDH Před 3 lety

    For RNN's Batch Normalisation should be avoided, use Layer Normalisation instead

  • @alexdalton4535
    @alexdalton4535 Před 3 lety

    why didnt peter make it..

  • @sealivezentrum
    @sealivezentrum Před 3 lety +1

    fuck me, you explained way better than my prof did

  • @nyri0
    @nyri0 Před 2 lety

    Your visualizations are misleading. Normalization doesn't turn the shape on the left into the circle seen on the right. It will be less elongated but still keep a diagonal ellipse shape.

  • @roeeorland
    @roeeorland Před rokem

    Peter is most definitely not 1.9m
    That’s 6’3

  • @ahmedelsabagh6990
    @ahmedelsabagh6990 Před 3 lety

    55555 you get it :) HaHa

  • @rodi4850
    @rodi4850 Před 4 lety +2

    Sorry to say but very poor video. Intro was way too long and explaining more the math and why BN works was left for 1-2mins.

    • @CodeEmporium
      @CodeEmporium  Před 4 lety +5

      Thanks for watching till the end. I tried going for a layered approach to the explanation - get the big picture. Then the applications. Then details. I wasn't sure how much more math was necessary. This was the main math in the paper, so I thought that was adequate. Always open to suggestions if you have any. If you've looked at my recent videos, you can tell the delivery is not consistent. Trying to see what works

    • @PhilbertLin
      @PhilbertLin Před 4 lety

      I think the intro with the samples in the first few minutes was a little drawn out but the majority of the video spent on intuition and visuals without math was nice. Didn’t go through the paper so can’t comment on how much more math detail is needed.