Which Activation Function Should I Use?

Sdílet
Vložit
  • čas přidán 2. 08. 2024
  • All neural networks use activation functions, but the reasons behind using them are never clear! Let's discuss what activation functions are, when they should be used, and what the difference between them is.
    Sample code from this video:
    github.com/llSourcell/Which-A...
    Please subscribe! And like. And comment. That's what keeps me going.
    More Learning resources:
    www.kdnuggets.com/2016/08/role...
    cs231n.github.io/neural-networ...
    www.quora.com/What-is-the-rol...
    stats.stackexchange.com/quest...
    en.wikibooks.org/wiki/Artific...
    stackoverflow.com/questions/9...
    papers.nips.cc/paper/874-how-...
    neuralnetworksanddeeplearning....
    / activation-functions-i...
    / mathematical-foundatio...
    Join us in the Wizards Slack channel:
    wizards.herokuapp.com/
    And please support me on Patreon:
    www.patreon.com/user?u=3191693
    Follow me:
    Twitter: / sirajraval
    Facebook: / sirajology Instagram: / sirajraval Instagram: / sirajraval
    Signup for my newsletter for exciting updates in the field of AI:
    goo.gl/FZzJ5w
    Hit the Join button above to sign up to become a member of my channel for access to exclusive content! Join my AI community: chatgptschool.io/ Sign up for my AI Sports betting Bot, WagerGPT! (500 spots available):
    www.wagergpt.co

Komentáře • 462

  • @Skythedragon
    @Skythedragon Před 7 lety +261

    Thanks, my biological neural network now has learned how to choose activation functions!

    • @SirajRaval
      @SirajRaval  Před 7 lety +23

      awesome

    • @GilangD21
      @GilangD21 Před 6 lety +1

      Hahahah

    • @rs-tarxvfz
      @rs-tarxvfz Před 4 lety

      Remember whole is not in its parts. Whole behaviour is different from its elements

  • @TheCodingTrain
    @TheCodingTrain Před 7 lety +203

    Great video, super helpful!

  • @StephenRoseDuo
    @StephenRoseDuo Před 7 lety +35

    From experience I'd recommend in order, ELU (exponential linear units) >> leaky ReLU > ReLU > tanh, sigmoid. I agree that you basically never have an excuse to use tanh or sigmoid.

    • @gorkemvids4839
      @gorkemvids4839 Před 6 lety +1

      I'm using tanh but i always read saturated neurons as 0.95 or -0.95 while backpropagating so gradient doesnt disapear.

    • @JorgetePanete
      @JorgetePanete Před 5 lety

      @@gorkemvids4839 doesn't*

  • @BOSS-bk2jx
    @BOSS-bk2jx Před 6 lety +3

    I love you man, 4 f***** months passed and my stupid prof. could not explain it as you did, not even partially. keep up the good work.
    Thanks a lot

  • @drhf1214
    @drhf1214 Před 5 lety +1

    Amazing video! THank you! I've never heard of neural networks until I started my internship. This is really fascinating.

  • @cali4nicated
    @cali4nicated Před 4 lety +1

    Wow, man, this is a seriously amazing video. Very entertaining and informative at the same time. Keep up great work! I'm now watching all your other videos :)

  • @quant-trader-010
    @quant-trader-010 Před 2 lety +1

    I really like your videos as they strike the very sweet spot between being concise and precise!

  • @pouyan74
    @pouyan74 Před 4 lety +1

    Dude! DUUUDE! You are AMAZING! I've read multiple papers already, but now the stuff are really making sense to me!

  • @CristianMargiotta
    @CristianMargiotta Před 7 lety +1

    Valuable introduction to generative methods for establishement of sense in artificial intelligence. A great way of bringing things together and express in one single indescret language.
    Thanks Siraj Raval, great!

  • @gydo1942
    @gydo1942 Před 6 lety +1

    this guy needs more subs. Finally a good explanation. Thanks man!

  • @hussain5755
    @hussain5755 Před 7 lety +1

    just watched your speech @TNW Conference 2017, I am really happy that you are growing every day, You are my motivation and my idol. proud of you love you

  • @kalreensdancevelventures5512

    By far the best videos of Machine Learning Ive watched. Amazing work! Love the energy and Vibe!

  • @MrJnsc
    @MrJnsc Před 6 lety

    Learning more from your videos than all my college classes together!

  • @rafiakhan8721
    @rafiakhan8721 Před rokem +1

    Really enjoyed the video as you add subtle humor in between.

  • @grainfrizz
    @grainfrizz Před 6 lety

    I gained a lot of understanding and got that "click" moment after you explained linear vs non linearity. Thanks man. Keep up w/ the dank memes. My dream is that some day, I'd see a collab video between you, Dan Shiffman, and 3Blue1Brown. Love lots from Philippines!

  • @slowcoding
    @slowcoding Před 5 lety

    Cool. Your lecture cleared the cloud in my brain. I now have better understanding about the whole picture of the activation function.

  • @supremehype3227
    @supremehype3227 Před 5 lety +1

    Excellent and entertaining at a high level of entropy reduction. A fan.

  • @plouismarie
    @plouismarie Před 7 lety

    @Siraj
    NN can potentially grow in so many directions, you will always have something to explain to us.
    As you used to say 'this is only the beginning'.
    And ohh maaan ! you're so clear when you explain NN ;)
    Please keep doing what you're doing again and again and again...and again !
    You are for NN, what Neil de Grass is for astrophysics.
    thx for sharing the github source that detail each activation source

  • @jb.1412
    @jb.1412 Před 7 lety

    Hard stuff made easy. Congrats to a great video! Keep it up, mate!

  • @gowriparameswaribellala4423

    Great explanation of activation functions. Now I need to tweak my model.

  • @waleedtahir2072
    @waleedtahir2072 Před 7 lety +1

    Dank memes and dank learning, both in the same video. Who would have thought. Thanks Raj!

  • @prateekraghuwanshi5645

    Super clear & concise. Amazing simplicity. You Rock !!!

  • @sedthh
    @sedthh Před 7 lety +19

    but isn't RELU a linear function? you mentioned at the beginning that linear functions should be avoided as both calculating backpropagation on non-linear functions as classifying data points that do not fit a single hyperplane is easier
    or did I get the whole thing wrong?

    • @jeffwells641
      @jeffwells641 Před 6 lety +14

      It's not linear because any -X sits at zero on the Y axis. "Linear" basically means "straight line". The ReLU line is bent, hard, at 0. So it's linear if you're only looking at > 0 or < 0, but if you look at the whole line it's kinked in the middle, which makes it non-linear.

    • @anshu957
      @anshu957 Před 5 lety +4

      it is a piece-wise linear function which is essentially a nonlinear function. For more info, google "piece-wise linear functions".

    • @10parth10
      @10parth10 Před 4 lety +1

      The sparisty of the activations add to the non linearity of the neural net.

    • @UnrecycleRubdish
      @UnrecycleRubdish Před 2 lety

      @@10parth10 that explanation helped. Thanks

  • @captainwalter
    @captainwalter Před 4 lety

    hey Siraj- just wanted to say thanks again. Apparently you got carried away and got busted being sneaky w crediting. I still respect your hustle and hunger. I think your means justify your ends- if you didn't make the moves that you did to prop up the image etc, I probably wouldn't have found you and your resources. At the end of the day, you are in fact legit bc you really bridge the gap of 1) knowing what ur talking about (i hope) 2) empathizing w someone learning this stuff (needed to break it down) 3) raising awareness about low hanging fruit that ppl outside the realm might not be aware of. Thank you again!!!!

  • @gigeg7708
    @gigeg7708 Před 5 lety +1

    Super Siraj Raval!!!!! Great compilation Bro.

  • @TuyoIsaza
    @TuyoIsaza Před 5 lety +1

    Dude.... exactly what i needed.. Thanks again!

  • @yatinarora9650
    @yatinarora9650 Před 4 lety

    Noi i understood wtf we are using this activation function, til now i was just using them now I know why am using them, thanks siraj

  • @WillTesler
    @WillTesler Před 6 lety

    Love this video so much. Helped me so much with my LSTM RNN network

  • @CrazySkillz15
    @CrazySkillz15 Před 5 lety

    Thank you so much for such informative content explained with such clarity after taking so much efforts. Appreciate it! :) :D

  • @ManajitPal95
    @ManajitPal95 Před 5 lety +5

    Update: There is another activation function called "elu" which is faster than "relu" when it comes to speed of training. Try it out guys! :D

  • @joshiyogendra
    @joshiyogendra Před 6 lety

    this guys makes learning so much fun!

  • @venkateshkolpakwar5757

    presentation is good,learned how to choose the activation function and thanks for the video,it helped a lot

  • @rahulsbhatt
    @rahulsbhatt Před 5 lety

    Entire video is a GEM 💎
    Totally makes sense to use ML

  • @akhilguptavibrantjava
    @akhilguptavibrantjava Před 6 lety

    Crystal clear explanation, just loved it

  • @killthesource4740
    @killthesource4740 Před 4 lety +2

    I have a question to the vanishing gradient problem when using sigmoid. Could sigmoid be a more useful activation function when using shortcut connections in the NN?
    For those who don't know what these are: In a normal neural network each layer is connected with the other but it isn't directly connected to further layers. (like a Neuron n11 from layer1 is connected to a Neuron n21 in layer2. But n11 isn't connected to the Neuron n31 in the third layer). A shortcut connection would be when a connection exist between normally not connected Layers (like a connection between Neuron n11 and n31) and thus bypassing the layer in between (n21). It is still also connected to n21.

  • @jennycotan7080
    @jennycotan7080 Před 7 měsíci

    Sir, likes for your memetics and fun explanation! All the spice you add to this video might bring some tech kids like me to the realm of Machine Learning!
    (And today, a mysterious graph sheet with the plot of max(0,x), a.k.a. ReLU function, appeared in my High School Maths notebook, between the pages about piecewise functions, after I get up and arrived at school.)

  • @nrewik
    @nrewik Před 7 lety

    Thanks Siraj. Awesome explanation.
    I'm am new to deep learning. It would be great if you can make videos about regularization, and cost functions.

  • @nicodaunt
    @nicodaunt Před 5 lety

    digging your vids and enthusiasm from Portland Oregon!

  • @fersilvil
    @fersilvil Před 7 lety +3

    If we use GA we do not need differentiable activation functions , inclusive we can build our own function.The issue is the back propagation method , this limits the activation functions

  • @MohammedAli-pg2fw
    @MohammedAli-pg2fw Před 5 lety

    Thanks @Siraj. What amazing and easy to digest explanation.

  • @akompsupport
    @akompsupport Před 7 lety

    Hey Siraj, here is a great trick: show us a neural net that can perform inductive reasoning! Great videos as always, keep them coming! Learning so much!

  • @anjali7778
    @anjali7778 Před 4 lety

    Woah ! thanks man, you made things so clear !!!

  • @satyamskillz
    @satyamskillz Před 4 lety +1

    I can't control the gradient, the Best part of the video.

  • @dipeshbhandari4746
    @dipeshbhandari4746 Před 4 lety

    Your channel is GOLD!

  • @JeromeFortias
    @JeromeFortias Před 6 lety

    just a great training I LOVE how you did it :-)

  • @sahand5277
    @sahand5277 Před 5 lety +4

    8:44 i liked this motto on the wall.

  • @hectoralvarorojas1918
    @hectoralvarorojas1918 Před 6 lety

    Hi Siraj:
    Your videos are great!
    CONGRATULATIONS!

  • @vijayabhaskar-j
    @vijayabhaskar-j Před 7 lety +7

    Great video! Also make a video on How to choose the number of hidden layers and number of nodes in each layer?

    • @SirajRaval
      @SirajRaval  Před 7 lety +4

      will do thx

    • @TheQuickUplifts
      @TheQuickUplifts Před 5 lety

      If I understand the subject right, you'll always only need one hidden layer, because of Cover's Theorem

  • @LongDanzi
    @LongDanzi Před 4 lety

    Thanks, that was super helpful!

  • @dyjiang1350
    @dyjiang1350 Před 6 lety

    This video is very easy to understand!

  • @bem7069
    @bem7069 Před 7 lety

    Thank you for posting this.

  • @RastaZak
    @RastaZak Před 6 lety

    I have a question... For Sigmoid activation functions with an output close to 1, would the vanishing gradient problem still cause no signal to flow through it? Or instead would it cause the output to be fully saturated permanently? Either way it would be an issue but i'm just trying to wrap my head around this.

  • @madhumithak3338
    @madhumithak3338 Před 3 lety

    your teaching way is so cool and crazy :)

  • @jindagi_ka_safar
    @jindagi_ka_safar Před 5 lety

    Great insight on Activation Functions , thanks

  • @guilhermeabreu3131
    @guilhermeabreu3131 Před 3 lety

    Excellent explanation!!! You're really funny and I loved the way you explain things. Thank you!!!

  • @drip888
    @drip888 Před rokem

    omg, this is the first time i am seeing his video and its quite entertaining

  • @mswai5020
    @mswai5020 Před 5 lety

    Loving the KEK :) Awesome Siraj :) can you do a piece on CFR+ and it's geopolitical implications?

  • @pure_virtual
    @pure_virtual Před 7 lety +8

    How do you detect dead ReLUs in your model though?

    • @Fr0zenFireV
      @Fr0zenFireV Před 6 lety

      By viewing the activation function values in each layer?

    • @TheOnlySaneAmerican
      @TheOnlySaneAmerican Před 3 lety

      After each epoch check to see if any neurons have activations that are coverging toward zero. The best way to do ths would be monitor the neurons over a series of epochs and calculate a delta or differential between training epochs.

  • @robertodisco
    @robertodisco Před 3 lety

    Awesome video! Thanks!

  • @Omar-kw5ui
    @Omar-kw5ui Před 5 lety

    u covered half of what my ai principles course covered on learning in 3 and half hrs in 8 mins. nice

  • @harveynorman8787
    @harveynorman8787 Před 5 lety

    This channel is gold! Thanks

  • @cenyingyang1611
    @cenyingyang1611 Před 2 lety

    Curious why does ReLU avoid vanishing gradient problem? When z is below 0, since y is always 0, the gradient seems to be 0, which means the gradient vanishes? Or do I misunderstand about the vanishing gradient?

  • @Jotto999
    @Jotto999 Před 7 lety +38

    Still can't decide if I like the number of memes in these videos. It's humorous of course and I did grow up on the internet, but I'm trying to learn a viciously hard subject and they are somewhat distracting. I suppose it helps the less-intrinsically-motivated keep watching, and I can always read more about it elsewhere, as these videos are more like cursory summaries. Great channel.

    • @SirajRaval
      @SirajRaval  Před 7 lety +5

      this is a well thought out comment. so is the reply to it i see. making them more relevant and spare should help. ill do that

    • @SirajRaval
      @SirajRaval  Před 7 lety +2

      spare = sparse*

    • @austinmoran456
      @austinmoran456 Před 7 lety +5

      Siraj I agree with Jotto. I enjoy them, but at some critical points in the video I found myself replaying several times as the first time through I was a little distracted.

    • @Kaiz0kuSama
      @Kaiz0kuSama Před 7 lety +2

      i read papers and articles... but a 10 min video helped me more tha all of that :D

    • @TuyoIsaza
      @TuyoIsaza Před 5 lety

      @@SirajRaval It keeps it fresh and help me remember. I find I remember things you say by remembering the joke! Relu, relu, relu....

  • @toadfrommariokart64
    @toadfrommariokart64 Před 3 lety

    despised the stale memes. loved the explanation

  • @antonylawler3423
    @antonylawler3423 Před 7 lety

    Excellent, as usual.
    I think that the reason RELU hasn't been popular prior to now is that it is mathematically inelegant, in that it can't be used in commutable functions,, and a sigmoid function can.
    It does beg the question though - if RELU is being used, do we need to use the back propagation algorithm at all ? Perhaps some simpler recursive algorithm can be used.

  • @paulbloemen7256
    @paulbloemen7256 Před 5 lety +1

    1. The (activation) value of a neuron should be between 0 and 1, or? ReLu has a leaking minimum around 0, shouldn't ReLu have also a (leaking) maximum around 1?
    2. Is there one best activation function, delivering the best neural network with the least amount of effort, like the amount of tests needed, and computer power?
    3. Should weights and biases be between 0 and 1 or between -1 and 1? Or any different values?
    4. Against vanishing and exploding gradients: can this be prevented with a (leaking) correction minimum and maximum for the weights and biases? There would be some symmetry then with the activation function suggested in the first paragraph.

  • @NilavraPathak
    @NilavraPathak Před 7 lety

    Is there any article which I can refer ... for citation purposes ... I found out this is the best combo for my LSTM from training ... but it will be good if I can get a paper which says use relu ...

  • @midhunrajr372
    @midhunrajr372 Před 5 lety

    Awesome explanation. +1 for creating such a big shadow over the Earth.

  • @Zerksis79
    @Zerksis79 Před 7 lety

    Thanks for another great video!

  • @valentinduranddegevigney332

    Hey Siraj, thanks for the vid' :)
    I'm currently reading Deepmind's DNC's paper and they use sigmoid (for the LSTMs and the interface), tanh (for the LSTMs) and oneplus (for the read/write strengths) but I have found no trace of ReLu.
    Since the paper is quite recent, I was wondering : do they have a good reason ?

    • @user-xy1zi7hk4o
      @user-xy1zi7hk4o Před 7 lety +2

      LSTM already addresses gradient vanishing/explosion problem by its inner design (using gates), so it's an exception not covered in that video. From my experience, ReLu won't do any good here. If somebody can prove me wrong, it'd be grateful.

    • @valentinduranddegevigney332
      @valentinduranddegevigney332 Před 7 lety

      Yes, I didn't think about that. I guess this is the same for the interface since there are gates (alloc gate, write gate, erase gate, ...). Thank you :)

  • @nikksengaming933
    @nikksengaming933 Před 6 lety

    I love watching these videos, even if I don't understand 90% of what he is saying.

  • @harshmankodiya9397
    @harshmankodiya9397 Před 3 lety

    Thanks for simplifying

  • @mohamednoordeen6331
    @mohamednoordeen6331 Před 7 lety +2

    very helpful video, thanks a lot, actually to introduce non-linearities we are introducing activation function. But how does ReLU which is linear is doing the justification over other non-linear functions ? Can you please give the correct intuition behind this ? Thanks in advance :)

  • @akshaysreekumar1997
    @akshaysreekumar1997 Před 6 lety

    Awesome video. Can you explain a bit more on why we aren't using an activation funciton in the outer layer?

  • @chaidaro
    @chaidaro Před 7 lety +1

    What is your thought on softplus?

  • @jflow5601
    @jflow5601 Před 4 lety

    Thanks, very nice explanation.

  • @abhayranade5815
    @abhayranade5815 Před 6 lety +1

    Excellent

  • @thedeliverguy879
    @thedeliverguy879 Před 7 lety +49

    I've been wondering what loss function to use D: Can you make a video for loss functions pls :)

    • @sinjaxsan
      @sinjaxsan Před 7 lety +27

      Crashed2DesktoP this is a little less generically answerable than which activation.
      For standard tasks there are a few loss functions available, binary cross entropy and categorical cross entropy for classification, mean squared error for regression. but more generally the cost function encodes the nature of your problem. Once you go deeper and your problem fleshes out a bit the exact loss you use might change to reflect your task. Custom losses might reflect auxiliary learning tasks, domain specific weights, and many other things. because of this "which loss should I use" is quite close to asking "how should I encode my problem" and so can be a little trickier to answer beyond the well studied settings.

    • @RoxanaNoe
      @RoxanaNoe Před 7 lety

      Sina Samangooei thanks for your answer. It is useful for me too. 😃

    • @SirajRaval
      @SirajRaval  Před 7 lety +8

      hmm. sina has a good answer but more vids similar to this coming

    • @rishabh4082
      @rishabh4082 Před 6 lety

      log- likelihood cost function with softmax output layer for classification

  • @anithapriya5601
    @anithapriya5601 Před 4 lety

    Superb video! !

  • @prateekgupta5945
    @prateekgupta5945 Před 6 lety

    Hey Siraj, if I want to visualize and understand a neural network using C/C++ data structures and syntaxes, how would I do it?

  • @drjoriv
    @drjoriv Před 7 lety

    Siraj Raval -> do you have any videos on continues hopfield network or an article for me to read only? I had a hard time finding a good one.

  • @amitmauryathecoolcoder267

    Bro, I loved your content.

  • @gwadada6969
    @gwadada6969 Před 2 lety

    Awesome explanation.

  • @prasadphatak1503
    @prasadphatak1503 Před 3 lety

    Good Work!

  • @bazluhrman
    @bazluhrman Před 6 lety

    This is really good great teacher!

  • @killthesource4740
    @killthesource4740 Před 4 lety +1

    5:15 Let's say x=Sum of Input*weight. When using sigmoid you calculate sig(x)=1/(1+e^-x) which is correct in your case. But the derivative in the programmatic part is shown as sig'(x)=x*(1-x) which isn't true. The derivative should be sig'(x)=sig(x)*(1-sig(x)). I think the way you meant for it to be used is by passing the output/sig(x) as the variable x to this function. In this case it would be correct but I find it highly critical in terms of confusing others who don't know the real derivative of sigmoid. They are going to think that the detivative of sig is x*(1-x) (with x being the sum).
    I know it's a bit of a detail but I'd recommend to change that in case someone gets their most confusing day by that lol.

  • @aiMonk
    @aiMonk Před 7 lety

    Which software do you use to create neural network and activation function animation like @1:15 to @2:03 and @5:27 to @5:54

  • @fadezzgameplay7077
    @fadezzgameplay7077 Před 4 lety

    are two rtx 2080oc will be good with i5 9400f in deep learning only

  • @jchhjchh
    @jchhjchh Před 6 lety

    Hi, I am confused. ReLU will kill the neuron only during the forward pass? Or also during the backward pass?

  • @Djneckbeard
    @Djneckbeard Před 6 lety

    Great video!!

  • @mohamednoordeen6331
    @mohamednoordeen6331 Před 7 lety +1

    Initially we are applying activation functions to squash the output of each neuron to the range (0,1) or (1,-1). But in ReLU, the range is (0,x) and the x can take any large number of values. Can you please give the correct intuition behind this ? Thanks in advance :)

  • @ilyassalhi
    @ilyassalhi Před 7 lety +3

    Siraj, ur videos inspired me to study machine learning. I've been learning python for the past month, and am looking to start playing around with more advanced stuff. Do you have any good book recommendations for machine or deep learning, or online resources that beginners should start with?

    • @SirajRaval
      @SirajRaval  Před 7 lety

      aweomse. watch my playlist learn python for data science

    • @lubnaaashaikh8901
      @lubnaaashaikh8901 Před 7 lety

      Siraj Raval Do you have videos on matlab using nn?

  • @UsmanAhmed-sq9bl
    @UsmanAhmed-sq9bl Před 7 lety +2

    Siraj great video. Your views about Parametric Rectified Linear Unit (PReLU)?

  • @datasciencetutorials8537

    Thanks for this superb video

  • @samuelajayi3748
    @samuelajayi3748 Před 5 lety

    This was awesome

  • @user-xl9zr5is2b
    @user-xl9zr5is2b Před 7 lety

    this video is so helpful ! thx!

  • @evanchakrabarti3276
    @evanchakrabarti3276 Před 6 lety

    Thanks, super helpful video. I've been confused about softmax... I've been implementing a basic backprop network in python and I've gotten stuck on it. I know it's a function that returns probability and makes the sum of the network's output vector 1, but I don't know how to implement it or it's derivative.

  • @clark87
    @clark87 Před 5 lety

    siraj you are a good ai teacher

  • @firespark804
    @firespark804 Před 7 lety +17

    Another question is what the difference is if I use more hidden
    layers or more hidden neurons

    • @davidfortini3205
      @davidfortini3205 Před 7 lety +1

      I think that at this moment there's not a cut and clear approach to how to choose the NN architecture

    • @trainraider8
      @trainraider8 Před 6 lety

      More layers makes learning very slow compared to more neurons. Before training, all the biases will overcome the inputs and make the output side of the network static. It takes a long time to get past that.

    • @gorkemvids4839
      @gorkemvids4839 Před 6 lety

      Maybe you should limit the starting biases so you can pass that phase quicker. I always apply biases betven 0 and 0.5

    • @paras8361
      @paras8361 Před 6 lety

      2 Hidden layers are enough.

    • @kayrunjaavice1421
      @kayrunjaavice1421 Před 5 lety

      depends on situation, like a simple text recognition kind of thing is fine with 2 layers but something like a convolutional neural network may have to have 10. but for the majority of tings in this day and age, 2 is plenty.

  • @timkellermann7669
    @timkellermann7669 Před 6 lety

    but if you use a relu could ' t get the value from layer to layer to big to compute ?