Live -Transformers Indepth Architecture Understanding- Attention Is All You Need

Sdílet
Vložit
  • čas přidán 2. 09. 2020
  • All Credits To Jay Alammar
    Reference Link: jalammar.github.io/illustrated...
    Research Paper: papers.nips.cc/paper/7181-att...
    youtube channel : • Jay's Visual Intro to AI
    Please donate if you want to support the channel through GPay UPID,
    Gpay: krishnaik06@okicici
    Discord Server Link: / discord
    Telegram link: t.me/joinchat/N77M7xRvYUd403D...
    Please join as a member in my channel to get additional benefits like materials in Data Science, live streaming for Members and many more
    / @krishnaik06
    Please do subscribe my other channel too
    / @krishnaikhindi
    Connect with me here:
    Twitter: / krishnaik06
    Facebook: / krishnaik06
    instagram: / krishnaik06

Komentáře • 223

  • @mohammadmasum4483
    @mohammadmasum4483 Před rokem +10

    @ 40:00 why we consider 64? - It is based on the how many multi head attention you want to apply. We used embedding size for each word = 512 and want to apply 8 multi head self attention; there fore for each attention we are using (512/8 =) 64 dimensional Q, K, V vector. So that, when we concatenate all the multi attention heads afterward, we can achieve the same 512 dimensional word embeddings which will be the input to the feed forward layer.
    Now, for instance, if you want 16 multi attention head, in that case you can use 32 dimensional Q, K, and V vector. My opinion is that, the initial word embedding size and the number of multi attention head are the hyperparameters.

  • @dandyyu0220
    @dandyyu0220 Před 2 lety +5

    I cannot express the amount of appreciation enough of your videos, especially NLP deep learning related topics! They are extremely helpful and so easy to understand from scratch! Thank you very much!

  • @apppurchaser2268
    @apppurchaser2268 Před rokem

    You are a really good teacher that always check your audiences weather they get the concept or not. Also, I appreciate your patience and the way you try to rephrase to have a better explanations.

  • @shrikanyaghatak
    @shrikanyaghatak Před 11 měsíci +1

    I am very new to the world of AI. I was looking for easy videos to teach me about the different models. I cannot imagine that I was totally enthralled by this video as long as you taught. You are a very good teacher. Thank you for publishing this video free. Thanks to Jay as well for simplifying such complex topic.

  • @ss-dy1tw
    @ss-dy1tw Před 3 lety +1

    Krish, I really see the honesty in you man, lot of humility, very humble person. In the beginning of this video, you gave credit to Jay several times who created amazing blog for Transformers. I really liked that. Be like that.

  • @suddhasatwaAtGoogle
    @suddhasatwaAtGoogle Před 2 lety +36

    For anyone having a doubt at 40:00 as to why we have taken a square root of 64 is because, as per the research it was mathematically proven to be the best method to keep the gradients stable! Also, note that the value 64, which is the size of the Query, Keys and Values vectors, is in itself a hyperparameter which was found to be working the best. Hope this helps.

    • @latikayadav3751
      @latikayadav3751 Před 10 měsíci

      The embedding vector dimension is 512. We divide this in 8 heads. We 512/8 =64. therefore size of query, keys and values is 64. therefore size is not hyperparameter.

    • @afsalmuhammed4239
      @afsalmuhammed4239 Před 10 měsíci +1

      normlizing the data

    • @sg042
      @sg042 Před 7 měsíci

      Another reason is that we generally we want weights, inputs, etc. to follow normal distribution N(0, 1) and when we compute dot product then its summation of 64 values which mathematically increases the variance to sqrt(64) and make the distribution as N(0, sqrt(64)) and therefore dividing it by sqrt(64) will normalize it.

    • @sartajbhuvaji
      @sartajbhuvaji Před 6 měsíci

      The paper states that:
      " While for small values of dk the two mechanisms(attention functions: additive attention and dot product attention) (note: paper uses dot product attention (q*k))
      perform similarly, additive attention outoerforms dot product attention without scaling for larger values of dk. We suspect that for larger values of dk, the dot product grows large in magnitude, pushing the softmax function into regions where it has extremely small gradients. To counteract this effect, we scale the dot product by (1/sqrt(dk))"

  • @roshankumargupta46
    @roshankumargupta46 Před 3 lety +43

    This might help the guy who asked why we take the square root and also for other aspirants :
    The scores get scaled down by getting divided by the square root of the dimension of query and key. This is to allow for more stable gradients, as multiplying values can have exploding effects.

    • @tarunbhatia8652
      @tarunbhatia8652 Před 3 lety +1

      nice. i was also wondering about the same . it all started from gradient exploding or vansihing , how can i forget that :D

    • @apicasharma2499
      @apicasharma2499 Před 2 lety

      can this attention encoder-decoder be used in financial time series as well.. multivariate time series?

    • @matejkvassay7993
      @matejkvassay7993 Před 2 lety

      Hello, I think the sq root od dimension is not chosen just empirically but actually it's to normalize the length of vector or smth similar, it holds the vector length scales by sq root with increasing dimension size when some conditions I forgot are met, this way you scale it down to 1 ans thus prevent exploding dot product scores

    • @kunalkumar2717
      @kunalkumar2717 Před 2 lety

      @@apicasharma2499 yes, although i have not used it, but it can be used.

    • @generationgap416
      @generationgap416 Před rokem

      The normalizing should come from softmax or by using the tri function to zero out the bottom of the matrix concatenated q, k and V MATRIX. to have good initialization weights, i think

  • @TusharKale9
    @TusharKale9 Před 3 lety +1

    Very well covered GPT-3 topic. Very important from NLP point of view. Thank you for your efforts.

  • @Want_to_escape
    @Want_to_escape Před 3 lety +12

    Krish is a hard working person, not for himself but for our country in the best way he could...We need more persons like him in our country

    • @lohithklpteja
      @lohithklpteja Před měsícem

      Alu kavale ya lu kavale ahhh ahhh ahhh ahhh dhing chiki chiki chiki dhingi chiki chiki chiki

  • @user-or7ji5hv8y
    @user-or7ji5hv8y Před 3 lety +2

    thank you, appreciate your time going through this material

  • @jeeveshkataria6439
    @jeeveshkataria6439 Před 3 lety +22

    Sir, Please release the video of Bert. Eagerly waiting for it.

  • @prasad5164
    @prasad5164 Před 3 lety

    I really admire you now. Just because you give the credit to the deserving at the beginning of the video.
    That attitude will make you a great leader. All the best!!

  • @anusikhpanda9816
    @anusikhpanda9816 Před 3 lety +27

    You can skim through all the youtube videos explaining transformers, but nobody comes close to this video.
    Thank you Sir🙏🙏🙏

    • @kiran5918
      @kiran5918 Před 3 měsíci

      Difficult to understand foreign accents. Desi away zindabad

  • @akhilgangavarapu9728
    @akhilgangavarapu9728 Před 3 lety +3

    Million tons appreciation for making this video. Thank you soo much for your amazing work.

  • @tshepisosoetsane4857
    @tshepisosoetsane4857 Před rokem

    Yes this is the best video explaining these Models so far even non computer science people can understand what is happening , great work

  • @harshavardhanachyuta2055

    Thank you. Your teaching and jay's blog combination pull this topic. I like the way you are teaching. Keep going.

  • @hiteshyerekar9810
    @hiteshyerekar9810 Před 3 lety +2

    Great Session Krish. Because of Research paper I understand things very easily and clearly.

  • @Adil-qf1xe
    @Adil-qf1xe Před rokem

    How did I miss the subscription to your channel? Thank you so much for this thorough explanation, and hats off to Jay Alammar.

  • @sarrae100
    @sarrae100 Před 2 lety

    Excellent blog from Jay, Thanks Krish for introducing this blog on ur channel !!

  • @mequanentargaw
    @mequanentargaw Před 8 měsíci

    Very helpful! Thank you all contributors!

  • @harshitjain4923
    @harshitjain4923 Před 3 lety +12

    Thanks for explaining Jay's blog. To add to the explanation at 39:30, the reason for using sqrt(dk) is to prevent the problem of vanishing gradient as mentioned in the paper. Since we are applying softmax on Q*K and if we consider a high dimension of these matrices, it will produce a high value which will get transformed close to 1 after softmax and hence leads to a small update in gradient.

    • @neelambujchaturvedi6886
      @neelambujchaturvedi6886 Před 3 lety

      Thanks for this Harshit

    • @shaktirajput4711
      @shaktirajput4711 Před 2 lety

      Thanks for explanation but I guess it will be called as exploding gradient not vanishing gradient. Hope I am not wrong.

  • @faezakamran3793
    @faezakamran3793 Před rokem +3

    For those getting confused with 8 heads, all the words would be going to all the heads. It's not one word per head. The X matrix remains the same only the W matrix would change in case of multi-head attention.

  • @ashishjindal2677
    @ashishjindal2677 Před 2 lety

    Wonderful explanation of blog, thanks for introducing with jay. Your teaching style is awesome.

  • @MuhammadShahzad-dx5je
    @MuhammadShahzad-dx5je Před 3 lety +7

    Really nice sir, looking forward to Bert Implementation 😊

  • @madhu1987ful
    @madhu1987ful Před 2 lety

    Jay alammar blog is of course awesome. But you made it even more simpler while explaining. Thanks a lot

  • @lshagh6045
    @lshagh6045 Před rokem

    Very huge and tremendous effort, million thanks for your dedication

  • @junaidiqbal5018
    @junaidiqbal5018 Před rokem +1

    @31:45 If my understanding is correct, reson why we have 64, is because we we divide 512 into 8 equal heads. As we are computing the dot products to get the attention vaue, if we do the dot product of 512 embedding dimension length it will not only be computationally expensive but also the fact that we will get only one relation between the words . Taking advantage of parallel computation we divide 512 into 8 equal parts. this is why we call it as multi head attention. This way its computationally fast and we also get 8 different relation between the words. (FIY Attention is basically a relation between the words ). Any way Good work on explaining the architecture krish.

  • @nim-cast
    @nim-cast Před 9 měsíci +6

    Thanks for your fantastic LLM/Transformer series content, and I admire your positive attitude and support for the authors of these wonderful articles! 👏

    • @sivakrishna5557
      @sivakrishna5557 Před 7 hodinami

      Could you please help me to get started on llm series, could you pls share the playlist link

  • @kiran5918
    @kiran5918 Před 4 měsíci

    Wow what an explanation of transformers.. perfect for us.. it aligns with the way we r taught at school…

  • @shanthan9.
    @shanthan9. Před 2 měsíci

    Every time I get confused or distracted while listening to the Transformers, I have to watch the video again; this is my third time watching it, and now I understand it better.

  • @louerleseigneur4532
    @louerleseigneur4532 Před 3 lety

    After watching your lecture it's more clear to me
    Thanks Krish

  • @gurdeepsinghbhatia2875
    @gurdeepsinghbhatia2875 Před 3 lety +1

    sir thanks a alot , mza agia sir , your way of teaching with so humble and honest and most important patience , awesome video sir, too gud

  • @wentaowang8622
    @wentaowang8622 Před rokem

    Very clear explanation. And Jay's blog is also amazing!!

  • @underlecht
    @underlecht Před 3 lety +2

    I love your patience how many times you go around explaining things until they get clear even for such dumb guys as me. BTW residual connection are not due some layers are not important and we have to skip them, it is for to solve the vanishing gradients problem.

  • @tarunbhatia8652
    @tarunbhatia8652 Před 3 lety

    Thanks Krish, Awesome session, keep doing the great work!

  • @abrarfahim2042
    @abrarfahim2042 Před 2 lety

    Thank you Krish. I learned so many things from your video.

  • @zohaibramzan6381
    @zohaibramzan6381 Před 3 lety +1

    Great to overcome confusions. I hope next to get hands on Bert.

  • @michaelpadilla141
    @michaelpadilla141 Před 2 lety

    Superb. Well done and thank you for this.

  • @smilebig3884
    @smilebig3884 Před 2 lety

    Very underrated video... this is super awesome explanation. I m watching and commenting 2nd time after a month.

  • @markr9640
    @markr9640 Před 5 měsíci

    Very well explained Sir! Thank you.

  • @jimharrington2087
    @jimharrington2087 Před 3 lety

    Great effort Krish, Thanks

  • @avijitbalabantaray5883

    THank you Krish and Jay for this work.

  • @thepresistence5935
    @thepresistence5935 Před 2 lety

    Took More than 5 hours to understand this. Thanks Krish wonderful explanation.

  • @aqibfayyaz1619
    @aqibfayyaz1619 Před 3 lety

    Great Effort. Very well explained

  • @pavantripathi1890
    @pavantripathi1890 Před 7 měsíci

    Thanks to jay alamaar sir and you for the great explanation.

  • @MrChristian331
    @MrChristian331 Před 2 lety

    Great presentation! I understand it fully now I think.

  • @RanjitSingh-rq1qx
    @RanjitSingh-rq1qx Před 5 měsíci

    Video was so good, i understand each and every thing just except only decoder side .

  • @neelambujchaturvedi6886
    @neelambujchaturvedi6886 Před 3 lety +2

    Hey Krish, Had a quick question related to the explanation at 1:01:07 about positional encodings. How do we exactly create those embeddings, as in the paper the authors have used sine and cosine waves to produce these embeddings, I could not understand the intuition behind this, could you please help me understand this part, Thanks in advance.

  • @Deepakkumar-sn6tr
    @Deepakkumar-sn6tr Před 3 lety

    Great Session!....looking forward to Transformer Based recommender system

  • @Ram-oj4gn
    @Ram-oj4gn Před 5 měsíci

    great explanation.. I understood Transformers now..

  • @sujithsaikalakonda4863
    @sujithsaikalakonda4863 Před 7 měsíci

    Very well explained. Thank you sir.

  • @kameshyuvraj5693
    @kameshyuvraj5693 Před 3 lety

    sir the way you explained the topics is ultimate sir

  • @ganeshkshirsagar5806
    @ganeshkshirsagar5806 Před 6 měsíci

    Thank you so much sir for this superb session.

  • @parmeetsingh4580
    @parmeetsingh4580 Před 3 lety +1

    Hi Krish, great session.
    I have a question - the Z we get after the self-attention block of the encoder, is it interpretable? that means if we could figure out by just looking at Z what results does the multi-head self-attention block gives?
    Kindly help me out with this.

  • @pranthadebonath7719
    @pranthadebonath7719 Před 8 měsíci

    Thank you, sir,
    that's a nice explanation.
    also thanks to Jay Alammar sir.

  • @toulikdas3915
    @toulikdas3915 Před 3 lety

    More this kind of videos on Research paper explanations and advanced concepts of deep learning and reinforcement learning sir.

  • @tapabratacse
    @tapabratacse Před rokem

    superby you made the things look so easy

  • @elirhm5926
    @elirhm5926 Před 2 lety

    I don't know how to thank you and jay enough!

  • @Schneeirbisify
    @Schneeirbisify Před 3 lety +1

    Hey Krish, thanks for the session. Great explanation! Could you please suggest if you have already uploaded session on Bert? And if not do you have still on plans? Would be very interesting to deep dive into practical application of Transformers.

  • @utkarshsingh2675
    @utkarshsingh2675 Před rokem

    thanks for such free contents!!...u r awesome sir!

  • @manikaggarwal9781
    @manikaggarwal9781 Před 4 měsíci

    superbly explained

  • @generationgap416
    @generationgap416 Před rokem

    The reason to divide by sq of k is to prevent a constant value of x. That x = 1/2 for values near x = 0 from the left or right f(x) approaches y = 1/2. Look at the shape of the sigmoid function.

  • @armingh9283
    @armingh9283 Před 3 lety

    Thank you sir. It was awsome

  • @ruchisaboo29
    @ruchisaboo29 Před 3 lety +3

    Awesome explanation.. when will you post BERT video ? waiting for it and if possible please cover GPT-2 as well.. Thanks a lot for this amazing playlist.

  • @raghavsharma6430
    @raghavsharma6430 Před 3 lety

    krish sir, it's amazing!!!!

  • @sweela1
    @sweela1 Před rokem +1

    In my opinion, At 40:00 the under root is taken for the purpose of scaling to normalize the value from larger value to be transformed to smaller value so that SoftMax function of these values can also be calculated easily. Dk is the dimension whose under root is taken to scale the values.

    • @sg042
      @sg042 Před 7 měsíci

      Another reason probably is that we generally we want weights, inputs, etc. to follow normal distribution N(0, 1) and when we compute dot product then its summation of 64 values which mathematically increases the variance to sqrt(64) and make the distribution as N(0, sqrt(64)) and therefore dividing it by sqrt(64) will normalize it.

  • @joydattaraj5625
    @joydattaraj5625 Před 3 lety

    Good job Krish.

  • @Rider12374
    @Rider12374 Před 6 dny

    Thanks krish don!!!

  • @jaytube277
    @jaytube277 Před měsícem

    Thank you Krish for making such a great video. Really appreciate your hard work. One thing I have not understood here is that where is the loss getting calculated? Is it happening on the multiple heads or at the encoder decoder attention layer. What I am assuming is that while we are training the model, the translations will not be accurate and we should get some loss which we will try to minimize but I am not understanding where is that comparison is happening?

  • @ranjanarch4890
    @ranjanarch4890 Před 2 lety

    This video describes the inference of the Transformer. Can you do a video on training Architecture? I suppose we would need to give both languages datasets for training.

  • @happilytech1006
    @happilytech1006 Před 2 lety

    Always helpful Sir!

  • @sagaradoshi
    @sagaradoshi Před 2 lety

    Thanks for the wonderful explanation .. For the decoder in the 2nd time instance we passed word/letter 'I', then in 3rd time instance do we pass both the words 'I' and 'Am' or only the word 'Am' is passed? Similarly for the 3rd time instance do we pass the words 'I', 'am' and 'a' or just the word/letter 'a' is passed?

  • @121MrVital
    @121MrVital Před 3 lety +5

    Hi Krish,
    When you gonna make a video on "Bert" with practical implementation ??

  • @dhirendra2.073
    @dhirendra2.073 Před 2 lety

    Superb explanation

  • @dataflex4440
    @dataflex4440 Před rokem

    Pretty good Explanation Mate

  • @bofloa
    @bofloa Před rokem +1

    watching through this video, I can only conclude that the whole process is more of a Art than it is a science

  • @digitalmbk
    @digitalmbk Před 3 lety +2

    My MS SE thesis completion totally depends on your videos. Just AWESOME!!!

  • @desrucca
    @desrucca Před rokem +1

    AFAIK Resnet is not like dropout, instead it brings information from the previous layer to the n_th layer by doing this, vanishing gradients are less likely to occur.

  • @mdmamunurrashid4112
    @mdmamunurrashid4112 Před 11 měsíci

    You are amazing as always !

  • @kiran082
    @kiran082 Před 2 lety

    Great Explanation

  • @MayankKumar-nn7lk
    @MayankKumar-nn7lk Před 3 lety

    Answer to why we are diving by the square root of dimension. basically, we are finding the similarity between the query and each key, there are different ways to get the similarity like dot product or scaled dot product so basically, here we are taking scaled dot product to keep the values in a fixed range

  • @BINARYACE2419
    @BINARYACE2419 Před 2 lety

    Well Explained Sir

  • @captiandaasAI
    @captiandaasAI Před 10 měsíci

    great!!!!!!! Krish

  • @hudaalfigi2742
    @hudaalfigi2742 Před 2 lety

    i really want to thank you for your nice explanation actually i could not be able to understsnd it befor watchining this video

  • @BalaguruGupta
    @BalaguruGupta Před 3 lety

    The layer normalization does (X + Z) here X is input Z is result of self attention calculation. You mentioned when the Self attention doesn't perform well, the self attention calculation will be skipped and jumps to Layer Normalization, hence the Z value will be 'EMPTY' (Please correct me here, if I'm wrong). In this case the layer normalization happens only on X (the imput). Am I correct?

  • @learnvik
    @learnvik Před 6 měsíci

    thanks, Question: in step 1 (30:52), what if the randomly initialized weights have the same value during the start? then all resulting vectors will have same values.

  • @prekshagampa5889
    @prekshagampa5889 Před rokem

    Thanks a lot for detailed explaination. Really appreciate your effort for creating these videos

  • @bruceWayne19993
    @bruceWayne19993 Před 6 měsíci

    thank you🙏

  • @mohammedbarkaoui5218
    @mohammedbarkaoui5218 Před rokem

    You are the best 😇

  • @AshishBamania95
    @AshishBamania95 Před 2 lety

    Thanks a lot!

  • @sreevanthat3224
    @sreevanthat3224 Před rokem

    Thank you.

  • @User-nq9ee
    @User-nq9ee Před 2 lety

    Thank you so much ..

  • @apoorvneema7717
    @apoorvneema7717 Před 9 měsíci

    awesome bro

  • @lakshmigandla8781
    @lakshmigandla8781 Před 4 měsíci

    Clear explaining

  • @muraki99
    @muraki99 Před 8 měsíci

    Thanks!

  • @adwait92
    @adwait92 Před 2 lety +3

    For the doubt at 40:00, the attention technique used in the paper is dot-product attention (refer page 2, section 3.2.1, para 2).
    So for larger values of d_k (dimensions of query, key and value), the dot product might grow very high in magnitude. Also, keep in mind that the layer following the attention is a Softmax. So for higher values of x, the softmax output will tend towards 1; hence, the resulting gradients (during backpropagation) would be very close to 0. This would eventually mean the model doesn't learn as the weights don't get updated.

  • @mayurpatilprince2936
    @mayurpatilprince2936 Před 7 měsíci +1

    Why they multiply each value vector by the softmax score because they want to keep intact the values of the all word(s) and they want to focus on, and drown-out irrelevant words (by multiplying them by tiny numbers like 0.001, for example) ... they wanted to immerse whatever that sentence has irrelevant words ...

  • @sayantikachatterjee5032
    @sayantikachatterjee5032 Před 6 měsíci

    at 58.49 it is told that if we increase no of heads it will give more importance to different words. so 'it' can give more importance to 'street' also. so between 'The animal' and 'street' which word will be more prioritized?

  • @shahveziqbal5206
    @shahveziqbal5206 Před 2 lety +1

    Thankyou ❤️

  • @ayushrathore8916
    @ayushrathore8916 Před 3 lety

    After the encoder. Is there any repository like which store all the output of encoder and then one by one it will pas to decoder to get one on one decoded output!