Which Activation Function Should I Use?
Vložit
- čas přidán 2. 08. 2024
- All neural networks use activation functions, but the reasons behind using them are never clear! Let's discuss what activation functions are, when they should be used, and what the difference between them is.
Sample code from this video:
github.com/llSourcell/Which-A...
Please subscribe! And like. And comment. That's what keeps me going.
More Learning resources:
www.kdnuggets.com/2016/08/role...
cs231n.github.io/neural-networ...
www.quora.com/What-is-the-rol...
stats.stackexchange.com/quest...
en.wikibooks.org/wiki/Artific...
stackoverflow.com/questions/9...
papers.nips.cc/paper/874-how-...
neuralnetworksanddeeplearning....
/ activation-functions-i...
/ mathematical-foundatio...
Join us in the Wizards Slack channel:
wizards.herokuapp.com/
And please support me on Patreon:
www.patreon.com/user?u=3191693
Follow me:
Twitter: / sirajraval
Facebook: / sirajology Instagram: / sirajraval Instagram: / sirajraval
Signup for my newsletter for exciting updates in the field of AI:
goo.gl/FZzJ5w
Hit the Join button above to sign up to become a member of my channel for access to exclusive content! Join my AI community: chatgptschool.io/ Sign up for my AI Sports betting Bot, WagerGPT! (500 spots available):
www.wagergpt.co
Thanks, my biological neural network now has learned how to choose activation functions!
awesome
Hahahah
Remember whole is not in its parts. Whole behaviour is different from its elements
Great video, super helpful!
thx Dan love u
You are both awesome
You are both awesome
I absolutely love the energy you both have in your videos :)
Be soo cool if both did a collab video!
From experience I'd recommend in order, ELU (exponential linear units) >> leaky ReLU > ReLU > tanh, sigmoid. I agree that you basically never have an excuse to use tanh or sigmoid.
I'm using tanh but i always read saturated neurons as 0.95 or -0.95 while backpropagating so gradient doesnt disapear.
@@gorkemvids4839 doesn't*
I love you man, 4 f***** months passed and my stupid prof. could not explain it as you did, not even partially. keep up the good work.
Thanks a lot
Amazing video! THank you! I've never heard of neural networks until I started my internship. This is really fascinating.
Wow, man, this is a seriously amazing video. Very entertaining and informative at the same time. Keep up great work! I'm now watching all your other videos :)
I really like your videos as they strike the very sweet spot between being concise and precise!
Dude! DUUUDE! You are AMAZING! I've read multiple papers already, but now the stuff are really making sense to me!
Valuable introduction to generative methods for establishement of sense in artificial intelligence. A great way of bringing things together and express in one single indescret language.
Thanks Siraj Raval, great!
this guy needs more subs. Finally a good explanation. Thanks man!
just watched your speech @TNW Conference 2017, I am really happy that you are growing every day, You are my motivation and my idol. proud of you love you
thx stevey love u
By far the best videos of Machine Learning Ive watched. Amazing work! Love the energy and Vibe!
Learning more from your videos than all my college classes together!
Really enjoyed the video as you add subtle humor in between.
I gained a lot of understanding and got that "click" moment after you explained linear vs non linearity. Thanks man. Keep up w/ the dank memes. My dream is that some day, I'd see a collab video between you, Dan Shiffman, and 3Blue1Brown. Love lots from Philippines!
Cool. Your lecture cleared the cloud in my brain. I now have better understanding about the whole picture of the activation function.
Excellent and entertaining at a high level of entropy reduction. A fan.
@Siraj
NN can potentially grow in so many directions, you will always have something to explain to us.
As you used to say 'this is only the beginning'.
And ohh maaan ! you're so clear when you explain NN ;)
Please keep doing what you're doing again and again and again...and again !
You are for NN, what Neil de Grass is for astrophysics.
thx for sharing the github source that detail each activation source
Hard stuff made easy. Congrats to a great video! Keep it up, mate!
Great explanation of activation functions. Now I need to tweak my model.
Dank memes and dank learning, both in the same video. Who would have thought. Thanks Raj!
Super clear & concise. Amazing simplicity. You Rock !!!
but isn't RELU a linear function? you mentioned at the beginning that linear functions should be avoided as both calculating backpropagation on non-linear functions as classifying data points that do not fit a single hyperplane is easier
or did I get the whole thing wrong?
It's not linear because any -X sits at zero on the Y axis. "Linear" basically means "straight line". The ReLU line is bent, hard, at 0. So it's linear if you're only looking at > 0 or < 0, but if you look at the whole line it's kinked in the middle, which makes it non-linear.
it is a piece-wise linear function which is essentially a nonlinear function. For more info, google "piece-wise linear functions".
The sparisty of the activations add to the non linearity of the neural net.
@@10parth10 that explanation helped. Thanks
hey Siraj- just wanted to say thanks again. Apparently you got carried away and got busted being sneaky w crediting. I still respect your hustle and hunger. I think your means justify your ends- if you didn't make the moves that you did to prop up the image etc, I probably wouldn't have found you and your resources. At the end of the day, you are in fact legit bc you really bridge the gap of 1) knowing what ur talking about (i hope) 2) empathizing w someone learning this stuff (needed to break it down) 3) raising awareness about low hanging fruit that ppl outside the realm might not be aware of. Thank you again!!!!
Super Siraj Raval!!!!! Great compilation Bro.
Dude.... exactly what i needed.. Thanks again!
Noi i understood wtf we are using this activation function, til now i was just using them now I know why am using them, thanks siraj
Love this video so much. Helped me so much with my LSTM RNN network
Thank you so much for such informative content explained with such clarity after taking so much efforts. Appreciate it! :) :D
Update: There is another activation function called "elu" which is faster than "relu" when it comes to speed of training. Try it out guys! :D
this guys makes learning so much fun!
presentation is good,learned how to choose the activation function and thanks for the video,it helped a lot
Entire video is a GEM 💎
Totally makes sense to use ML
Crystal clear explanation, just loved it
I have a question to the vanishing gradient problem when using sigmoid. Could sigmoid be a more useful activation function when using shortcut connections in the NN?
For those who don't know what these are: In a normal neural network each layer is connected with the other but it isn't directly connected to further layers. (like a Neuron n11 from layer1 is connected to a Neuron n21 in layer2. But n11 isn't connected to the Neuron n31 in the third layer). A shortcut connection would be when a connection exist between normally not connected Layers (like a connection between Neuron n11 and n31) and thus bypassing the layer in between (n21). It is still also connected to n21.
Sir, likes for your memetics and fun explanation! All the spice you add to this video might bring some tech kids like me to the realm of Machine Learning!
(And today, a mysterious graph sheet with the plot of max(0,x), a.k.a. ReLU function, appeared in my High School Maths notebook, between the pages about piecewise functions, after I get up and arrived at school.)
Thanks Siraj. Awesome explanation.
I'm am new to deep learning. It would be great if you can make videos about regularization, and cost functions.
digging your vids and enthusiasm from Portland Oregon!
If we use GA we do not need differentiable activation functions , inclusive we can build our own function.The issue is the back propagation method , this limits the activation functions
Thanks @Siraj. What amazing and easy to digest explanation.
Hey Siraj, here is a great trick: show us a neural net that can perform inductive reasoning! Great videos as always, keep them coming! Learning so much!
thx will do
Woah ! thanks man, you made things so clear !!!
I can't control the gradient, the Best part of the video.
Your channel is GOLD!
just a great training I LOVE how you did it :-)
8:44 i liked this motto on the wall.
Hi Siraj:
Your videos are great!
CONGRATULATIONS!
Great video! Also make a video on How to choose the number of hidden layers and number of nodes in each layer?
will do thx
If I understand the subject right, you'll always only need one hidden layer, because of Cover's Theorem
Thanks, that was super helpful!
This video is very easy to understand!
Thank you for posting this.
I have a question... For Sigmoid activation functions with an output close to 1, would the vanishing gradient problem still cause no signal to flow through it? Or instead would it cause the output to be fully saturated permanently? Either way it would be an issue but i'm just trying to wrap my head around this.
your teaching way is so cool and crazy :)
Great insight on Activation Functions , thanks
Excellent explanation!!! You're really funny and I loved the way you explain things. Thank you!!!
omg, this is the first time i am seeing his video and its quite entertaining
Loving the KEK :) Awesome Siraj :) can you do a piece on CFR+ and it's geopolitical implications?
How do you detect dead ReLUs in your model though?
By viewing the activation function values in each layer?
After each epoch check to see if any neurons have activations that are coverging toward zero. The best way to do ths would be monitor the neurons over a series of epochs and calculate a delta or differential between training epochs.
Awesome video! Thanks!
u covered half of what my ai principles course covered on learning in 3 and half hrs in 8 mins. nice
This channel is gold! Thanks
Curious why does ReLU avoid vanishing gradient problem? When z is below 0, since y is always 0, the gradient seems to be 0, which means the gradient vanishes? Or do I misunderstand about the vanishing gradient?
Still can't decide if I like the number of memes in these videos. It's humorous of course and I did grow up on the internet, but I'm trying to learn a viciously hard subject and they are somewhat distracting. I suppose it helps the less-intrinsically-motivated keep watching, and I can always read more about it elsewhere, as these videos are more like cursory summaries. Great channel.
this is a well thought out comment. so is the reply to it i see. making them more relevant and spare should help. ill do that
spare = sparse*
Siraj I agree with Jotto. I enjoy them, but at some critical points in the video I found myself replaying several times as the first time through I was a little distracted.
i read papers and articles... but a 10 min video helped me more tha all of that :D
@@SirajRaval It keeps it fresh and help me remember. I find I remember things you say by remembering the joke! Relu, relu, relu....
despised the stale memes. loved the explanation
Excellent, as usual.
I think that the reason RELU hasn't been popular prior to now is that it is mathematically inelegant, in that it can't be used in commutable functions,, and a sigmoid function can.
It does beg the question though - if RELU is being used, do we need to use the back propagation algorithm at all ? Perhaps some simpler recursive algorithm can be used.
1. The (activation) value of a neuron should be between 0 and 1, or? ReLu has a leaking minimum around 0, shouldn't ReLu have also a (leaking) maximum around 1?
2. Is there one best activation function, delivering the best neural network with the least amount of effort, like the amount of tests needed, and computer power?
3. Should weights and biases be between 0 and 1 or between -1 and 1? Or any different values?
4. Against vanishing and exploding gradients: can this be prevented with a (leaking) correction minimum and maximum for the weights and biases? There would be some symmetry then with the activation function suggested in the first paragraph.
Is there any article which I can refer ... for citation purposes ... I found out this is the best combo for my LSTM from training ... but it will be good if I can get a paper which says use relu ...
Awesome explanation. +1 for creating such a big shadow over the Earth.
Thanks for another great video!
np
Hey Siraj, thanks for the vid' :)
I'm currently reading Deepmind's DNC's paper and they use sigmoid (for the LSTMs and the interface), tanh (for the LSTMs) and oneplus (for the read/write strengths) but I have found no trace of ReLu.
Since the paper is quite recent, I was wondering : do they have a good reason ?
LSTM already addresses gradient vanishing/explosion problem by its inner design (using gates), so it's an exception not covered in that video. From my experience, ReLu won't do any good here. If somebody can prove me wrong, it'd be grateful.
Yes, I didn't think about that. I guess this is the same for the interface since there are gates (alloc gate, write gate, erase gate, ...). Thank you :)
I love watching these videos, even if I don't understand 90% of what he is saying.
Thanks for simplifying
very helpful video, thanks a lot, actually to introduce non-linearities we are introducing activation function. But how does ReLU which is linear is doing the justification over other non-linear functions ? Can you please give the correct intuition behind this ? Thanks in advance :)
Awesome video. Can you explain a bit more on why we aren't using an activation funciton in the outer layer?
What is your thought on softplus?
Thanks, very nice explanation.
Excellent
I've been wondering what loss function to use D: Can you make a video for loss functions pls :)
Crashed2DesktoP this is a little less generically answerable than which activation.
For standard tasks there are a few loss functions available, binary cross entropy and categorical cross entropy for classification, mean squared error for regression. but more generally the cost function encodes the nature of your problem. Once you go deeper and your problem fleshes out a bit the exact loss you use might change to reflect your task. Custom losses might reflect auxiliary learning tasks, domain specific weights, and many other things. because of this "which loss should I use" is quite close to asking "how should I encode my problem" and so can be a little trickier to answer beyond the well studied settings.
Sina Samangooei thanks for your answer. It is useful for me too. 😃
hmm. sina has a good answer but more vids similar to this coming
log- likelihood cost function with softmax output layer for classification
Superb video! !
Hey Siraj, if I want to visualize and understand a neural network using C/C++ data structures and syntaxes, how would I do it?
Siraj Raval -> do you have any videos on continues hopfield network or an article for me to read only? I had a hard time finding a good one.
Bro, I loved your content.
Awesome explanation.
Good Work!
This is really good great teacher!
5:15 Let's say x=Sum of Input*weight. When using sigmoid you calculate sig(x)=1/(1+e^-x) which is correct in your case. But the derivative in the programmatic part is shown as sig'(x)=x*(1-x) which isn't true. The derivative should be sig'(x)=sig(x)*(1-sig(x)). I think the way you meant for it to be used is by passing the output/sig(x) as the variable x to this function. In this case it would be correct but I find it highly critical in terms of confusing others who don't know the real derivative of sigmoid. They are going to think that the detivative of sig is x*(1-x) (with x being the sum).
I know it's a bit of a detail but I'd recommend to change that in case someone gets their most confusing day by that lol.
Which software do you use to create neural network and activation function animation like @1:15 to @2:03 and @5:27 to @5:54
are two rtx 2080oc will be good with i5 9400f in deep learning only
Hi, I am confused. ReLU will kill the neuron only during the forward pass? Or also during the backward pass?
Great video!!
Initially we are applying activation functions to squash the output of each neuron to the range (0,1) or (1,-1). But in ReLU, the range is (0,x) and the x can take any large number of values. Can you please give the correct intuition behind this ? Thanks in advance :)
Siraj, ur videos inspired me to study machine learning. I've been learning python for the past month, and am looking to start playing around with more advanced stuff. Do you have any good book recommendations for machine or deep learning, or online resources that beginners should start with?
aweomse. watch my playlist learn python for data science
Siraj Raval Do you have videos on matlab using nn?
Siraj great video. Your views about Parametric Rectified Linear Unit (PReLU)?
Thanks for this superb video
This was awesome
this video is so helpful ! thx!
Thanks, super helpful video. I've been confused about softmax... I've been implementing a basic backprop network in python and I've gotten stuck on it. I know it's a function that returns probability and makes the sum of the network's output vector 1, but I don't know how to implement it or it's derivative.
siraj you are a good ai teacher
Another question is what the difference is if I use more hidden
layers or more hidden neurons
I think that at this moment there's not a cut and clear approach to how to choose the NN architecture
More layers makes learning very slow compared to more neurons. Before training, all the biases will overcome the inputs and make the output side of the network static. It takes a long time to get past that.
Maybe you should limit the starting biases so you can pass that phase quicker. I always apply biases betven 0 and 0.5
2 Hidden layers are enough.
depends on situation, like a simple text recognition kind of thing is fine with 2 layers but something like a convolutional neural network may have to have 10. but for the majority of tings in this day and age, 2 is plenty.
but if you use a relu could ' t get the value from layer to layer to big to compute ?