Live -Transformers Indepth Architecture Understanding- Attention Is All You Need
Vložit
- čas přidán 2. 09. 2020
- All Credits To Jay Alammar
Reference Link: jalammar.github.io/illustrated...
Research Paper: papers.nips.cc/paper/7181-att...
youtube channel : • Jay's Visual Intro to AI
Please donate if you want to support the channel through GPay UPID,
Gpay: krishnaik06@okicici
Discord Server Link: / discord
Telegram link: t.me/joinchat/N77M7xRvYUd403D...
Please join as a member in my channel to get additional benefits like materials in Data Science, live streaming for Members and many more
/ @krishnaik06
Please do subscribe my other channel too
/ @krishnaikhindi
Connect with me here:
Twitter: / krishnaik06
Facebook: / krishnaik06
instagram: / krishnaik06
@ 40:00 why we consider 64? - It is based on the how many multi head attention you want to apply. We used embedding size for each word = 512 and want to apply 8 multi head self attention; there fore for each attention we are using (512/8 =) 64 dimensional Q, K, V vector. So that, when we concatenate all the multi attention heads afterward, we can achieve the same 512 dimensional word embeddings which will be the input to the feed forward layer.
Now, for instance, if you want 16 multi attention head, in that case you can use 32 dimensional Q, K, and V vector. My opinion is that, the initial word embedding size and the number of multi attention head are the hyperparameters.
I cannot express the amount of appreciation enough of your videos, especially NLP deep learning related topics! They are extremely helpful and so easy to understand from scratch! Thank you very much!
You are a really good teacher that always check your audiences weather they get the concept or not. Also, I appreciate your patience and the way you try to rephrase to have a better explanations.
I am very new to the world of AI. I was looking for easy videos to teach me about the different models. I cannot imagine that I was totally enthralled by this video as long as you taught. You are a very good teacher. Thank you for publishing this video free. Thanks to Jay as well for simplifying such complex topic.
Krish, I really see the honesty in you man, lot of humility, very humble person. In the beginning of this video, you gave credit to Jay several times who created amazing blog for Transformers. I really liked that. Be like that.
For anyone having a doubt at 40:00 as to why we have taken a square root of 64 is because, as per the research it was mathematically proven to be the best method to keep the gradients stable! Also, note that the value 64, which is the size of the Query, Keys and Values vectors, is in itself a hyperparameter which was found to be working the best. Hope this helps.
The embedding vector dimension is 512. We divide this in 8 heads. We 512/8 =64. therefore size of query, keys and values is 64. therefore size is not hyperparameter.
normlizing the data
Another reason is that we generally we want weights, inputs, etc. to follow normal distribution N(0, 1) and when we compute dot product then its summation of 64 values which mathematically increases the variance to sqrt(64) and make the distribution as N(0, sqrt(64)) and therefore dividing it by sqrt(64) will normalize it.
The paper states that:
" While for small values of dk the two mechanisms(attention functions: additive attention and dot product attention) (note: paper uses dot product attention (q*k))
perform similarly, additive attention outoerforms dot product attention without scaling for larger values of dk. We suspect that for larger values of dk, the dot product grows large in magnitude, pushing the softmax function into regions where it has extremely small gradients. To counteract this effect, we scale the dot product by (1/sqrt(dk))"
This might help the guy who asked why we take the square root and also for other aspirants :
The scores get scaled down by getting divided by the square root of the dimension of query and key. This is to allow for more stable gradients, as multiplying values can have exploding effects.
nice. i was also wondering about the same . it all started from gradient exploding or vansihing , how can i forget that :D
can this attention encoder-decoder be used in financial time series as well.. multivariate time series?
Hello, I think the sq root od dimension is not chosen just empirically but actually it's to normalize the length of vector or smth similar, it holds the vector length scales by sq root with increasing dimension size when some conditions I forgot are met, this way you scale it down to 1 ans thus prevent exploding dot product scores
@@apicasharma2499 yes, although i have not used it, but it can be used.
The normalizing should come from softmax or by using the tri function to zero out the bottom of the matrix concatenated q, k and V MATRIX. to have good initialization weights, i think
Very well covered GPT-3 topic. Very important from NLP point of view. Thank you for your efforts.
Krish is a hard working person, not for himself but for our country in the best way he could...We need more persons like him in our country
Alu kavale ya lu kavale ahhh ahhh ahhh ahhh dhing chiki chiki chiki dhingi chiki chiki chiki
thank you, appreciate your time going through this material
Sir, Please release the video of Bert. Eagerly waiting for it.
I really admire you now. Just because you give the credit to the deserving at the beginning of the video.
That attitude will make you a great leader. All the best!!
You can skim through all the youtube videos explaining transformers, but nobody comes close to this video.
Thank you Sir🙏🙏🙏
Difficult to understand foreign accents. Desi away zindabad
Million tons appreciation for making this video. Thank you soo much for your amazing work.
Yes this is the best video explaining these Models so far even non computer science people can understand what is happening , great work
Thank you. Your teaching and jay's blog combination pull this topic. I like the way you are teaching. Keep going.
Great Session Krish. Because of Research paper I understand things very easily and clearly.
How did I miss the subscription to your channel? Thank you so much for this thorough explanation, and hats off to Jay Alammar.
Excellent blog from Jay, Thanks Krish for introducing this blog on ur channel !!
Very helpful! Thank you all contributors!
Thanks for explaining Jay's blog. To add to the explanation at 39:30, the reason for using sqrt(dk) is to prevent the problem of vanishing gradient as mentioned in the paper. Since we are applying softmax on Q*K and if we consider a high dimension of these matrices, it will produce a high value which will get transformed close to 1 after softmax and hence leads to a small update in gradient.
Thanks for this Harshit
Thanks for explanation but I guess it will be called as exploding gradient not vanishing gradient. Hope I am not wrong.
For those getting confused with 8 heads, all the words would be going to all the heads. It's not one word per head. The X matrix remains the same only the W matrix would change in case of multi-head attention.
Wonderful explanation of blog, thanks for introducing with jay. Your teaching style is awesome.
Really nice sir, looking forward to Bert Implementation 😊
Jay alammar blog is of course awesome. But you made it even more simpler while explaining. Thanks a lot
Very huge and tremendous effort, million thanks for your dedication
@31:45 If my understanding is correct, reson why we have 64, is because we we divide 512 into 8 equal heads. As we are computing the dot products to get the attention vaue, if we do the dot product of 512 embedding dimension length it will not only be computationally expensive but also the fact that we will get only one relation between the words . Taking advantage of parallel computation we divide 512 into 8 equal parts. this is why we call it as multi head attention. This way its computationally fast and we also get 8 different relation between the words. (FIY Attention is basically a relation between the words ). Any way Good work on explaining the architecture krish.
Thanks for your fantastic LLM/Transformer series content, and I admire your positive attitude and support for the authors of these wonderful articles! 👏
Could you please help me to get started on llm series, could you pls share the playlist link
Wow what an explanation of transformers.. perfect for us.. it aligns with the way we r taught at school…
Every time I get confused or distracted while listening to the Transformers, I have to watch the video again; this is my third time watching it, and now I understand it better.
After watching your lecture it's more clear to me
Thanks Krish
sir thanks a alot , mza agia sir , your way of teaching with so humble and honest and most important patience , awesome video sir, too gud
Very clear explanation. And Jay's blog is also amazing!!
I love your patience how many times you go around explaining things until they get clear even for such dumb guys as me. BTW residual connection are not due some layers are not important and we have to skip them, it is for to solve the vanishing gradients problem.
Thanks Krish, Awesome session, keep doing the great work!
Thank you Krish. I learned so many things from your video.
Great to overcome confusions. I hope next to get hands on Bert.
Superb. Well done and thank you for this.
Very underrated video... this is super awesome explanation. I m watching and commenting 2nd time after a month.
Very well explained Sir! Thank you.
Great effort Krish, Thanks
THank you Krish and Jay for this work.
Took More than 5 hours to understand this. Thanks Krish wonderful explanation.
Great Effort. Very well explained
Thanks to jay alamaar sir and you for the great explanation.
Great presentation! I understand it fully now I think.
Video was so good, i understand each and every thing just except only decoder side .
Hey Krish, Had a quick question related to the explanation at 1:01:07 about positional encodings. How do we exactly create those embeddings, as in the paper the authors have used sine and cosine waves to produce these embeddings, I could not understand the intuition behind this, could you please help me understand this part, Thanks in advance.
Great Session!....looking forward to Transformer Based recommender system
great explanation.. I understood Transformers now..
Very well explained. Thank you sir.
sir the way you explained the topics is ultimate sir
Thank you so much sir for this superb session.
Hi Krish, great session.
I have a question - the Z we get after the self-attention block of the encoder, is it interpretable? that means if we could figure out by just looking at Z what results does the multi-head self-attention block gives?
Kindly help me out with this.
Thank you, sir,
that's a nice explanation.
also thanks to Jay Alammar sir.
More this kind of videos on Research paper explanations and advanced concepts of deep learning and reinforcement learning sir.
superby you made the things look so easy
I don't know how to thank you and jay enough!
Hey Krish, thanks for the session. Great explanation! Could you please suggest if you have already uploaded session on Bert? And if not do you have still on plans? Would be very interesting to deep dive into practical application of Transformers.
thanks for such free contents!!...u r awesome sir!
superbly explained
The reason to divide by sq of k is to prevent a constant value of x. That x = 1/2 for values near x = 0 from the left or right f(x) approaches y = 1/2. Look at the shape of the sigmoid function.
Thank you sir. It was awsome
Awesome explanation.. when will you post BERT video ? waiting for it and if possible please cover GPT-2 as well.. Thanks a lot for this amazing playlist.
krish sir, it's amazing!!!!
In my opinion, At 40:00 the under root is taken for the purpose of scaling to normalize the value from larger value to be transformed to smaller value so that SoftMax function of these values can also be calculated easily. Dk is the dimension whose under root is taken to scale the values.
Another reason probably is that we generally we want weights, inputs, etc. to follow normal distribution N(0, 1) and when we compute dot product then its summation of 64 values which mathematically increases the variance to sqrt(64) and make the distribution as N(0, sqrt(64)) and therefore dividing it by sqrt(64) will normalize it.
Good job Krish.
Thanks krish don!!!
Thank you Krish for making such a great video. Really appreciate your hard work. One thing I have not understood here is that where is the loss getting calculated? Is it happening on the multiple heads or at the encoder decoder attention layer. What I am assuming is that while we are training the model, the translations will not be accurate and we should get some loss which we will try to minimize but I am not understanding where is that comparison is happening?
This video describes the inference of the Transformer. Can you do a video on training Architecture? I suppose we would need to give both languages datasets for training.
Always helpful Sir!
Thanks for the wonderful explanation .. For the decoder in the 2nd time instance we passed word/letter 'I', then in 3rd time instance do we pass both the words 'I' and 'Am' or only the word 'Am' is passed? Similarly for the 3rd time instance do we pass the words 'I', 'am' and 'a' or just the word/letter 'a' is passed?
Hi Krish,
When you gonna make a video on "Bert" with practical implementation ??
Superb explanation
Pretty good Explanation Mate
watching through this video, I can only conclude that the whole process is more of a Art than it is a science
Definitely!
My MS SE thesis completion totally depends on your videos. Just AWESOME!!!
Bro are you pursuing your ms?
@@pratheeeeeesh4839 yes
@@digitalmbk where brother?
@@pratheeeeeesh4839 GCUF Pakistan
AFAIK Resnet is not like dropout, instead it brings information from the previous layer to the n_th layer by doing this, vanishing gradients are less likely to occur.
You are amazing as always !
Great Explanation
Answer to why we are diving by the square root of dimension. basically, we are finding the similarity between the query and each key, there are different ways to get the similarity like dot product or scaled dot product so basically, here we are taking scaled dot product to keep the values in a fixed range
Well Explained Sir
great!!!!!!! Krish
i really want to thank you for your nice explanation actually i could not be able to understsnd it befor watchining this video
The layer normalization does (X + Z) here X is input Z is result of self attention calculation. You mentioned when the Self attention doesn't perform well, the self attention calculation will be skipped and jumps to Layer Normalization, hence the Z value will be 'EMPTY' (Please correct me here, if I'm wrong). In this case the layer normalization happens only on X (the imput). Am I correct?
thanks, Question: in step 1 (30:52), what if the randomly initialized weights have the same value during the start? then all resulting vectors will have same values.
Thanks a lot for detailed explaination. Really appreciate your effort for creating these videos
thank you🙏
You are the best 😇
Thanks a lot!
Thank you.
Thank you so much ..
awesome bro
Clear explaining
Thanks!
For the doubt at 40:00, the attention technique used in the paper is dot-product attention (refer page 2, section 3.2.1, para 2).
So for larger values of d_k (dimensions of query, key and value), the dot product might grow very high in magnitude. Also, keep in mind that the layer following the attention is a Softmax. So for higher values of x, the softmax output will tend towards 1; hence, the resulting gradients (during backpropagation) would be very close to 0. This would eventually mean the model doesn't learn as the weights don't get updated.
ty
Why they multiply each value vector by the softmax score because they want to keep intact the values of the all word(s) and they want to focus on, and drown-out irrelevant words (by multiplying them by tiny numbers like 0.001, for example) ... they wanted to immerse whatever that sentence has irrelevant words ...
at 58.49 it is told that if we increase no of heads it will give more importance to different words. so 'it' can give more importance to 'street' also. so between 'The animal' and 'street' which word will be more prioritized?
Thankyou ❤️
After the encoder. Is there any repository like which store all the output of encoder and then one by one it will pas to decoder to get one on one decoded output!