NLP Demystified 15: Transformers From Scratch + Pre-training and Transfer Learning With BERT/GPT

Future Mojo

zhlédnutí 66 058

Přidat do
- Můj playlist
- Přehrát později
Sdílet

Sdílet

Vložit

Velikost videa:

Zobrazit ovladače přehrávání

Automatické přehrávání

Přehrát

čas přidán 13. 11. 2022
CORRECTION:
00:34:47: that should be "each a dimension of 12x4"
Course playlist: • Natural Language Proce...
Transformers have revolutionized deep learning. In this module, we'll learn how they work in detail and build one from scratch. We'll then explore how to leverage state-of-the-art models for our projects through pre-training and transfer learning. We'll learn how to fine-tune models from Hugging Face and explore the capabilities of GPT from OpenAI. Along the way, we'll tackle a new task for this course: question answering.
Colab notebook: colab.research.google.com/git...
Timestamps
00:00:00 Transformers from scratch
00:01:05 Subword tokenization
00:04:27 Subword tokenization with byte-pair encoding (BPE)
00:06:53 The shortcomings of recurrent-based attention
00:07:55 How Self-Attention works
00:14:49 How Multi-Head Self-Attention works
00:17:52 The advantages of multi-head self-attention
00:18:20 Adding positional information
00:20:30 Adding a non-linear layer
00:22:02 Stacking encoder blocks
00:22:30 Dealing with side effects using layer normalization and skip connections
00:26:46 Input to the decoder block
00:27:11 Masked Multi-Head Self-Attention
00:29:38 The rest of the decoder block
00:30:39 [DEMO] Coding a Transformer from scratch
00:56:29 Transformer drawbacks
00:57:14 Pre-Training and Transfer Learning
00:59:36 The Transformer families
01:01:05 How BERT works
01:09:38 GPT: Language modelling at scale
01:15:13 [DEMO] Pre-training and transfer learning with Hugging Face and OpenAI
01:51:48 The Transformer is a "general-purpose differentiable computer"
This video is part of Natural Language Processing Demystified --a free, accessible course on NLP.
Visit www.nlpdemystified.org/ to learn more.

Komentáře • 139

@futuremojo Před rokem ⁺¹³
Timestamps
00:00:00 Transformers from scratch
00:01:05 Subword tokenization
00:04:27 Subword tokenization with byte-pair encoding (BPE)
00:06:53 The shortcomings of recurrent-based attention
00:07:55 How Self-Attention works
00:14:49 How Multi-Head Self-Attention works
00:17:52 The advantages of multi-head self-attention
00:18:20 Adding positional information
00:20:30 Adding a non-linear layer
00:22:02 Stacking encoder blocks
00:22:30 Dealing with side effects using layer normalization and skip connections
00:26:46 Input to the decoder block
00:27:11 Masked Multi-Head Self-Attention
00:29:38 The rest of the decoder block
00:30:39 [DEMO] Coding a Transformer from scratch
00:56:29 Transformer drawbacks
00:57:14 Pre-Training and Transfer Learning
00:59:36 The Transformer families
01:01:05 How BERT works
01:09:38 GPT: Language modelling at scale
01:15:13 [DEMO] Pre-training and transfer learning with Hugging Face and OpenAI
01:51:48 The Transformer is a "general-purpose differentiable computer"
@novantha1 Před 2 měsíci ⁺³
What was provided: A high quality, easily digestible, and calm introduction to Transformers that could take almost anyone from zero to GPT in a single video.
What I got: It will probably take me longer than I'd like to get good at martial arts.
@johnmakris9999 Před rokem ⁺³⁶
Honestly, best explanation ever. I’m a data scientist (5 year’s experience) and I was struggling to understand in depth how transformers are trained. Come across this video and booom problem solved. Cheers mate. Propose to whole company to see this video
@futuremojo Před rokem ⁺³
Love hearing that. Thanks, John!
@user-wr4yl7tx3w Před rokem ⁺¹³
Why did it take so long for CZcams to show this channel when I searched transformers? CZcams algorithm really needs to get better. This is really quality content. Well structured and clearly explained.
@malikrumi1206 Před 8 měsíci
CZcams only gives you a limited number of videos that are responsive to your request. This is deliberate, because they want to keep you on the site as long as possible. That is one of the metrics they use to charge their advertisers. If you found exactly what you wanted the first time, chances are good you will then leave the site.
But if you watch one or two of the videos, and then come back a few days later with the same search, now you will see matching videos that you were not shown before. Remember, on ad driven sites, *you* are the product being sold to advertisers.
@user-wr4yl7tx3w Před 2 měsíci ⁺²
This is really high quality content. Why did it take so long for CZcams to recommend this.
@weeb9133 Před 25 dny
Just completed the entire playlist. It was an absolute delight to watch, this last lecture was a favorite of mine because of you explained it in the form of a story. Thank you so much for sharing this knowledge with us and hope to learn more from you :D
@mahmoudreda1083 Před rokem ⁺¹⁰
I want to express my sincere gratitude for your excellent teaching and guidance in this State-of-the-art NLP course. Thank you Sir.
@kaustubhkapare807 Před 9 měsíci ⁺²
God knows how many times I've banged my head aginst the wall...just to understand it... through different videos...this is the best one so far...🙏🏻
@id-ic7ou Před rokem ⁺⁹
I spent 2 days trying to understand the paper “Attention is all you need” but lots of thing were implicit in the article. Thank you for making it crystal clear. This is the best video I saw about transformer
@futuremojo Před rokem
Thanks! Really happy to hear that.
@michaelmike8637 Před rokem ⁺³
Thank you for your effort in creating this! The explanation and illustrations are amazing!
@marttilaine6778 Před rokem
Thank you very much for this series it was a wonderful explanation of NLP for me!
@BuddingAstroPhysicist Před rokem
Your tutorials are life saver , thanks a lot for this.
@nilesh30300 Před rokem
Man... this is an awesome explanation by you. I can't thank you enough... Keep up the good work.
@anrichvanderwalt1108 Před rokem ⁺¹
Definitely my go-to video to understand and reference to anyone how Transformers work! Thanks Nitin!
@chrisogonas Před rokem
Superb! Well illustrated. Thanks
@DamianReloaded Před rokem ⁺¹
Wow, this is really a very good tutorial. Thank you very much for putting it up. Kudos!
@arrekusua Před 5 měsíci
Thank you so much for these videos!!
Definitely one of the best videos on the NLP out there!
@blindprogrammer Před rokem ⁺¹
This is the most awesome video on Transformers!! You earned my respect and a subscriber too. 🙏🙏
@JBoy340a Před rokem ⁺²
Fantastic explanation. Very detailed, slow paced, and straightforward.
@romitbarua7081 Před rokem
This video is incredible! Many thanks!!
@HazemAzim Před rokem
A Legendary Explanation on Transformers compared to 10's , 100's tutorial videos out there .. Chapeau !
@srinathkumar1452 Před rokem ⁺²
This is a remarkable piece of work. Beyond excellent!
@futuremojo Před rokem
Thank you!
@kazeemkz Před 6 měsíci
Manh thanks for the detailed explanation. Your video has been helpful.
@AxelTInd Před rokem
Phenomenal video. Well-structured, concise, professional. You have a real talent for teaching!
@futuremojo Před rokem
Thank you!
@1abc1566 Před rokem ⁺⁷
I usually don’t comment on CZcams videos but couldn’t skip this. This is the BEST NLP course I’ve seen anywhere online. THANK YOU. ❤
@futuremojo Před rokem ⁺¹
Thank you!
@mazenlahham8029 Před rokem ⁺²
WOW, your level of comprehension and presentation of your subject is the best I've ever seen. You are the best. thank you very much ❤❤❤
@RajkumarDarbar Před rokem ⁺⁷
Thank you, legend for your exceptional teaching style !! 👏👏👏
If someone looking for a bit further explanation how to pass Q, K, and V matrices to the multi-head cross-attention layer in the decoder module:
Specifically, the key vectors are obtained by multiplying the encoder outputs with a learnable weight matrix, which transforms the encoder outputs into a matrix with a shape of (sequence_length, d_model). The value vectors are obtained by applying another learnable weight matrix to the encoder outputs, resulting in a matrix of the same shape.
The resulting key and value matrices can then be used as input to the multi-head cross-attention layer in the decoder module. The query vector, which is the input to the layer from the previous layer in the decoder, is also transformed using another learnable weight matrix to ensure compatibility with the key and value matrices.
The attention mechanism then computes attention scores between the query vector and the key vectors, which are used to compute attention weights. The attention weights are used to compute a weighted sum of the value vectors, which is then used as input to the subsequent layers in the decoder.
In summary, the key and value vectors are obtained by applying learnable weight matrices to the encoder outputs, and are used in the multi-head cross-attention mechanism of the decoder to compute attention scores and generate the output sequence.
@byotikram4495 Před 10 měsíci
So the dimensions of that learnable weight matrices(both K and V) would be (d_model × d_model) ?
@RajkumarDarbar Před 10 měsíci
@@byotikram4495 yes, you got it right.
@sarat.6954 Před rokem
This video was perfect. Thank you.
@capyk5455 Před rokem ⁺¹
This is fantastic, thank you once again for your work :)
@futuremojo Před rokem ⁺¹
Hope you find it useful!
@ilyas8523 Před rokem
Amazing explanation, thank you.
@AIShipped Před rokem ⁺⁷
This is amazing, I can’t thank you enough. I only wish this was around sooner. Keep up the great work!
@AIShipped Před rokem
This is straight up how school/universities should teach
@futuremojo Před rokem ⁺¹
@@AIShipped Thank you! I'm glad you find it useful.
@ricardocorreia3687 Před rokem
man you are a legend. Best explanation ever.
@jojolovekita7424 Před 9 měsíci
excellent presentation of confusing stuff😊😊😊 ALL youtube videos explaining anything should be of this high caliber. Salamat Po! ❤
@AradAshrafi Před 7 měsíci
What an amazing tutorial. Thank you
@santuhazra1 Před rokem
Seems like channel name got changed.. Big fan of ur work.. 🙂 One of the best explanation of transformer.. Waiting for more advanced topics..
@user-nm5jl8gy1u Před rokem
great lectures. great teacher.
@karlswanson6811 Před rokem
Dude, this series is great. I do a lot of NLP in the clinical domain and I get asked a lot for a comprehensive starter for NLP for people that hop on projects. I tried to create a curriculum from things like SpaCy, NLTK, some Coursera DS courses, PyTorch/DL books, etc. but this is so well done and succinct, yet detailed when needed/if wanted. I think I will just refer people here from now on. Great work! And agree with the many comments I see about you having a radio voice lmao always a plus!
@futuremojo Před rokem
Thank you, Karl! I put a lot of thought into how to make it succinct yet detailed in the right places so I'm glad to hear it turned out well.
@jeffcav2119 Před rokem
This video is AWESOME!!!
@wilfredomartel7781 Před rokem
Amazing explication❤
@FlannelCamel Před rokem
Amazing work!
@futuremojo Před rokem
Thank you!
@SnoozeDog Před 6 měsíci
Fantastic sir
@87ggggyyyy Před 10 měsíci ⁺¹
Great video, Philip torr
@mage2754 Před rokem ⁺¹
Thank you. I had problems visualising this concept before watching the video because not much explanations/reasons were given to why things were done the way they were
@futuremojo Před rokem ⁺¹
Glad you found it helpful!
@youmna4045 Před 6 měsíci
There really aren't enough words to express how thankful I am for this awesome content. It's amazing that you've made it available to everyone for free.
Thank you so much May Allah(GOD) help you like you help other
@jenilshyara6746 Před 9 měsíci
explaination🔥🔥🔥
@caiyu538 Před rokem ⁺¹
Great. Great. Great
@xuantungnguyen9719 Před 5 měsíci
Like what the hell. You made it so simple to learn. I kept consuming and taking notes, adding thoughts, perspective, feeling super productive. (I'm using Obsidian to link concepts). About three years ago the best explanation I could get is probably from Andrew Ng and I have to admit yours is so much better. My opinion might be biased since I was going back and forth in NLP times after times, but looking at the comment secion I'm pretty sure my opinion is validated
@futuremojo Před 5 měsíci
Thank you!
@marcinbulka2829 Před rokem
great explained. I think that showing examples of implementation like you did is the best way of explaining mathematical concepts. Although I am not sure if I missed it but I think that you don't have in your notebook explained how to calculate loss during training a transformer and I think it would be good to explain this.
@futuremojo Před rokem
You can see how loss is implemented in the previous video: czcams.com/video/tvIzBouq6lk/video.html
@amortalbeing Před 8 měsíci
thanks a lot
@theindianrover2007 Před měsícem
Awesome
@wilsonbecker1881 Před 6 měsíci
Best ever
@khushbootaneja6739 Před rokem ⁺¹
Nice video
@wenmei8669 Před 8 měsíci
The material and your explaination is amazing!!! Thanks a lot. I am wondering if it is possible to get the slides for your presentation?
@chris1324_ Před 10 měsíci
Amazing videos; I'm currently working on my thesis that aims to incorporate NLP techniques in different areas. Given the immense potential of transforms, using transformer-like architecture is an easy bet. I've been trying to understand them thoroughly for a while, but not anymore. Thank you so much. I've cited your website. I hope that's okay with you. Let me know if you have a preferred citation :)
@futuremojo Před 9 měsíci
Thank you! :-) Citing the website is great.
@tilkesh Před rokem ⁺¹
Thx
@puzan7685 Před 5 měsíci
Hello goddddddddddddd. Thank you so much
@daryladhityahenry Před 6 měsíci
Hi! I'm on my first episode currently in this lesson, I really excited and hope to learn much. Did you will create another tutorial on these kind of topics? Or only these 15 videos will kind of transform me into some expert ( remember, "kinda" expert ) in NLP and transformers so I can do pretrained my self and finetune it perfectly ? ( Assuming I have capability to gather the data? )
Thankkssss
@aminekimo4606 Před 11 měsíci
This prestention of papier tenssion is all whats you need make a hands in the key of IA révolution.
@rabailkamboh8857 Před 9 měsíci
best
@loicbaconnier9150 Před rokem ⁺¹
Thanks for yor work but in your explaination of wq in code session (34 mn) you say 'dimension of 3 by 4' is it the right dimension please ?
@futuremojo Před rokem
Thanks for the catch @loicbaconnier9150! Nope, the weights are 12x4 which then help project the keys, queries, and values in each head down to 3x4. I added a correction caption and also a correction in the description.
@pictzone Před 11 měsíci
Simply astounding presentation!! Just wondering, how many years did you have to study to get to such a level of deep understanding of this field? (all connected disciplines included)
Asking because while I do get the overall ideas, understanding why certain things are done different depending on your needs seems impossible unless you have a profound understanding of the concepts.
I feel like I would essentially be like a blind man following orders if I would try to build useful apps out of these techniques, only going by what experts are suggesting because going through exactly why all the effects of these equations work will take so many years to truly figure it out. Huge respect for you!
@futuremojo Před 11 měsíci ⁺¹
Thank you! Without false modesty, I wouldn't say I have a deep understanding of the field. I think very, very few people do. And there are tons of productive people doing great work with varying levels of understanding.
I would keep two things in mind:
1. A lot of times in this field, researchers come up with an idea based on some intuition/hunch or a rework of someone else's idea, and just try it. It's rarely the case that an idea is based on some detailed, logical rationalization beforehand. They're not throwing random stuff at a wall, but it's not completely informed either. Even today, researchers still don't know why, after training goes past a certain size and time threshold, LLMs suddenly start exhibiting advanced behaviors. If you read The Genius Makers, you'll see this field is largely empirical with persistent individuals nudging it forward one experiment at a time.
2. Don't think you need to understand every last detail before building something cool. By definition, you'll never get there. And the experts themselves don't have a complete understanding! You can start building now and slowly pick up details as you go. Just start and the process itself will force you to learn.
@pictzone Před 11 měsíci ⁺¹
@@futuremojo Hey man, your comment has given me so much courage and motivation, it's unbelievable. You've really renewed my interest in these kind of things and made me realize it's ok to dive deep into advanced topics in a "blind faith" type of approach. Really appreciate your insights. You might think I'm exaggerating, but no. You've really told me exactly what I needed to hear. Thank you!
@MangaMania24 Před rokem ⁺¹
More content please!
@futuremojo Před rokem
Tell me more to generate ideas: What would you find useful? What are you trying to accomplish?
@MangaMania24 Před rokem
@@futuremojo wow that's a tough question :p
I think I'll patiently wait for the content, cant think of a topic :p
Thanks for your content though, I spent the last 1 week watching your videos everyday and man the confidence boost I've got by understanding how everything works!
Great job 👏
@futuremojo Před rokem
@@MangaMania24 I'm glad it helped!
@maj46978 Před rokem
pls make a series of videos on large language models with hands on...i m now 100% sure no body on this earth can explained like you❤
@byotikram4495 Před 10 měsíci
Thanks for this awesome in depth explanation of transformer. I'm just curious to know only one aspect. So in the explanation slides you use sine and cosine function that is also mentioned in the paper while generating the positional embedding. But in the implementation i have not seen that one. So how a random initialisation for positional embedding will catch the position information of the sequences. I may miss some points. Please clarify this point only.
@futuremojo Před 10 měsíci
Positional embeddings are trainable. So even though the positional embeddings are initialized with random values, they are adjusted over time via backprop to better achieve the goal.
@byotikram4495 Před 10 měsíci
@@futuremojo But to capture the intitial word ordering of the sequence before start training is it not necessary to encode with the position information and relative distance between the tokens in the sequence ?
@futuremojo Před 10 měsíci
@@byotikram4495 I think the confusion here is thinking that the neural network can tell the difference between position information that looks sequential to you and position information that looks random.
Whether you're using sine/cosine wave values or positional embeddings, from the network's perspective, all it's seeing is input.
All the network needs is information that helps it learn context. And so, the designers of the original transformer chose to sample values from sine/cosine waves to differentiate tokens by position.
Here's the critical point: even after you add these wave values to the embedding, the untrained network has no idea what they mean. Rather, it learns over time that this particular embedding with this particular position information provides this context.
So the word "rock" in position 1 might have an embedding that looks like "123", while the embedding for "rock" in position 2 might have an embedding that looks like "789". And the network learns what context each embedding provides to the overall sequence.
Now, because all the network is seeing is particular embeddings, we are free to use other techniques to add position information as long as it's rich enough to differentiate tokens. In this case, positional embeddings work just as well while being simpler.
@byotikram4495 Před 10 měsíci ⁺¹
@@futuremojo OK. now it's clear. So basically the position informations will learn by the network based on it's context over times through BP. Actually what I thought initially was, once we add the position information sampled from the sine/cosine wave with the embedding, the resulting vectors will capture the relative position information of the tokens in the sequence and also the distance between the tokens at the start of the training itself. That's why the confusion arises. Thank you for this thorough explanation. It means a lot.
@7900Nick Před rokem ⁺¹
Great tutorial and fantastic work!💪
I got 2 question in relation to byte-pair encoding (BPE) of tokens.
After the vocabulary of is made, you said the original transformer map tokens/words into embeddings with a dimension of 512.
I suppose that each token's word embedding is initially initialized at random, which brings me to my questions:☺
1. How are a transformer's word embedding values actually updated (sorry if I missed it!)?
2. Is a word's embedding still fixed in values when it is made like in word2vec and Glove or are they constantly updated even though a word should react differently due to the contexts of a sentence?
@futuremojo Před rokem ⁺²
Hi, thanks for the comment!
Yes, if you're training a transformer from scratch, then the embeddings are usually initialized randomly. If there happens to be a BPE package that includes the right dimension embeddings, you can initialize your embedding layer with them, too. For example, we check out BPEMB (bpemb.h-its.org/) in the demo which has 100-dimension embeddings. If your use case is ok with that, then you can initialize your embedding layer with them.
Regarding your two questions:
1. The transformer's embeddings are updated via backpropagation. So once the loss is calculated, the weights in the encoder/decoder blocks AND the word and positional embeddings are updated. We cover backpropagation in detail here if you're interested:
czcams.com/video/VS1mgwAS8EM/video.html
2. When training from scratch, the embeddings are constantly updated. Now let's say the model is trained and you want to fine-tune it for a downstream task. At that point, you can choose to freeze the already trained layers and only train the fine-tuning layer (i.e. the embeddings won't change), or you can allow the whole model (including the embeddings) to adjust as well. We take the latter approach in our fine-tuning demo. We cover word vectors here if you're interested:
czcams.com/video/IebL0RQF5lg/video.html
Did that answer your question?
@7900Nick Před rokem ⁺¹
@@futuremojo Thank you very much for your thorough and lengthy response! I'll try to rephrase my question because I'm still not sure how the word embeddings are processed by the transformer.
I could be wrong, but doesn't each subword in a vocabulary (e.g. 50k words) have its own word embedding of size 512, with the different values in that vector corresponding to linguistic features? 😊
According to how I understood the explanation, loss calculated using backpropagation only modifies the weights of various head attention layers inside the transformer and does not alter the values of the word embeddings.
Am I totally wrong or does the embeddings actually also be updated?
Based on different demos I've seen, people don't update the tokenizer of a specific transformer even after fine-tuning.
Sure, I will write a comment and thank you for the course!🙌
@futuremojo Před rokem ⁺⁴
@@7900Nick Thanks for the testimonial, Nick!
"I could be wrong, but doesn't each subword in a vocabulary (e.g. 50k words) have its own word embedding of size 512, with the different values in that vector corresponding to linguistic features?"
This is correct.
"According to how I understood the explanation, loss calculated using backpropagation only modifies the weights of various head attention layers inside the transformer and does not alter the values of the word embeddings. Am I totally wrong or does the embeddings actually also be updated?"
The embeddings *ARE* updated during PRE-training. So once the loss is calculated, the feed-forward layers, the attention layers, AND the embedding layers are updated to minimize the loss. This is how the model arrives at embeddings that capture linguistic properties of the words (in such a way that it helps with the training goal).
I think the confusion may lie in (a) the tokenizer's role and (b) the options during fine-tuning.
So let's say you decide to train a model from scratch starting with the tokenizer. You decide to use English Wikipedia as your corpus. You fit your tokenizer using BPE over the corpus and it creates a 50k-word internal vocabulary. Ok, now you have your tokenizer. At this point, there are NO embeddings in the picture. The tokenizer's only job is to take whatever text you give it, and break it down into tokens based on the corpus it was fit on. It does not contain any weights. In the demo, we showed BPE-MB which *happened to come with embeddings* but we don't use them.
Next, you initialize your model which has its various layers including embedding layers. You set the embedding layer size based on the vocabulary size and the embedding dimension you want (so let's say 50,000 x 512). Every subword in the tokenizer's vocabulary has an integer ID, and this integer ID is used to index into the embedding layer to pull out the right embedding. The embeddings are part of the model, not the tokenizer.
Alright, you then train the model end to end on whatever task, and everything is updated via backprop including all the embedding layers. The model is now pre-trained.
Ok, now you want to fine-tune it. You have multiple options:
1. You can train only the head (e.g. a classifier) and FREEZE the pre-trained part of the model. This means the attention layers, feed-forward layers, and embedding layers DO NOT change.
2. You can train the head and allow the rest of the model to ALSO be trained. In practice, it usually means the attention layers, feed-forward layers, and embedding layers will be adjusted a little via backprop to further minimize the loss. This is the option we take with BERT in the demo (i.e. we didn't freeze anything).
In both cases, the tokenizer (whose only job is to tokenize text and has no trainable parameters in it) is left alone.
Let me know if that helps.
@7900Nick Před rokem
@@futuremojo Mate, you are absolutely a gem. Nitin you are a born teacher and thank you very much for your explanation. 👏
Normally I reply my messages much faster, but I have been quite busy lately with both family and work.
You have really demystified a lot of my NLP knowledge!! 🤗
But just to be sure, there is an embedding layer inside that transformer that corresponds to the index of the words that have been tokenized in the vocabulary, right?
So, when people train from scratch, continue to pre-train (MLM), or fine-tune a transformer to a specific task, the word embedding of all the words in the vocabulary (50k) is updated inside an embedding layer of the new transformer model, correct?
Therefore, the word embeddings of the old pre-trained model aren't used/touched when retraining a new transformer, just like your explanation of the BPE-MB case.
Because these embedding layers will be updated inside the new transformer model, adding new words to a vocabulary from 50k => 60k is not a problem, since it is part of the training. ☺
I apologize for bothering you again; as I said before, you have done an excellent job; this is simply the only point on which I am unsure.
@futuremojo Před rokem ⁺²
@@7900Nick
"But just to be sure, there is an embedding layer inside that transformer that corresponds to the index of the words that have been tokenized in the vocabulary, right?"
Correct.
"So, when people train from scratch, continue to pre-train (MLM), or fine-tune a transformer to a specific task, the word embedding of all the words in the vocabulary (50k) is updated inside an embedding layer of the new transformer model, correct?"
You have the right idea but we need to be careful here with wording. When you train a transformer from scratch/pre-train it, then yes, the embeddings keep getting updated during training. When it's time to fine-tune it, it's still the same transformer model but with an additional model head attached to it. The head will vary depending on the task. At that point, you can choose to train **only** the head (which means the embeddings won't change), or you can choose to let the transformer's body weights update as well (which means the embeddings will change). You can even choose to unfreeze only the few top layers of the transformer. The point is: when fine-tuning, whether the embedding layers update is your choice.
"Therefore, the word embeddings of the old pre-trained model aren't used/touched when retraining a new transformer, just like your explanation of the BPE-MB case.
Because these embedding layers will be updated inside the new transformer model, adding new words to a vocabulary from 50k => 60k is not a problem, since it is part of the training."
If you have a pre-trained model but you decide that you want to train your own from scratch (but using the same architecture), then yeah, the embeddings will also be trained at the same time. And yes, you can have a larger vocabulary. For example, this is the default config for BERT:
huggingface.co/docs/transformers/main/en/model_doc/bert#transformers.BertConfig
It has a vocabulary size of 30,522. If you fit a tokenizer on a corpus such that it ends up with a vocabulary of 40,000, then you can instantiate a BERT model with that larger vocabulary and train it from scratch.
"this is simply the only point on which I am unsure."
It's fine. It's good to be clear on things. If something doesn't click, it probably means there's a hole in the explanation.
@gol197884266 Před rokem ⁺¹
Joya😊
@gigiopincio5006 Před rokem ⁺¹
wow.
@panditamey1 Před rokem
Fantastic video. I have a question about head_dim.
Why is the embed_dim divided by num_heads? I haven't understood it completely.
@futuremojo Před rokem
Because each head operates in a lower dimensional space.
In our example, the original embedding dimension is 12 and we have three heads. So by dividing 12 by 3, each head now operates in a lower dimensional space of 4.
Let's say we didn't do that. Instead, let's say we had each head operate in the original 12-dimensional space. That would dramatically increase memory requirements and training time. Maybe that would result in slightly better performance but the tradeoff was probably not worth it. By having each head operate in a lower dimensional space, we get the benefits of multiple heads while keeping the compute and memory requirements the same.
There's also nothing stopping us from making each head dimension different. We could make the first head dimension 5, the second head dimension 3, and the last head dimension 4 so that it still adds up to 12, but then you sacrifice convenience and clarity for no benefit (AFAIK).
Let me know if that helps.
@panditamey1 Před rokem ⁺¹
@@futuremojo Thank you for such a thorough explanation!!!
@panditamey1 Před rokem
@nitin_punjabi I have got one more question for you. I tried implementing the same but without using TensorFlow. And I was able to run when there is no batch.
However, after creating batch data, I ran into the "shapes not aligned" issue
1 def scaled_self_attention(query, key, value):
2 key_dim = key.shape[1]
----> 3 QK = np.dot(query,key.T)
Here is a snippet from the code.
Have you run into this issue?
@futuremojo Před rokem
@@panditamey1 I haven't run into this issue, but that's likely because there's subtle behaviour differences between dot and matmul.
I would first log the inputs you're getting into your function (include what the transposed keys look like) vs the Colab notebook inputs. Make sure they're the same or set up in such a way that they would lead to the same result.
If so, I would Google behaviour differences between dot and matmul. My guess is your issue is most likely related to that.
@panditamey1 Před rokem
@@futuremojo sure thanks a lot!!
@ilyas8523 Před 11 měsíci
Hi, question. If I am building an Encoder for a regression problem where the output values are found a the end of the encoder, then how would I go about this? How should I change the feed-forward network to make this work? Should it take all of the embeddings at once? I am watching the video again so maybe I'll figure out the answer but until then, some guidance or advice would be great. Thanks!
To be clear, each input sequence of my data is about 1024 long [text data], and the output I need to predict is an array of 2 numerical outputs [y1, y2].
@futuremojo Před 11 měsíci ⁺¹
If you want to stick with an Encoder solution, then one idea is to use two regressor heads on the CLS token. One regressor head outputs y1, the other regressor head outputs y2.
If the numbers are bound (e.g. between 1 and 10 inclusive), you could even have two classifiers processing the CLS token.
@ilyas8523 Před 11 měsíci
@@futuremojo The targets are float values where the min is around -2.#### and max around 4.#### in training data with no set boundaries, not sure about the test data since I have no access to it (Kaggle Competition). So I'll probably go with two regressor heads. Time to learn how to do this, I look forward to it! Thank you once again for taking the time to teach us. your lessons have been very useful in my journey.
Edit: Do I have to fine-tune BERT for this?
@100deep1001 Před rokem
just adding a comment so that the video reaches more people :)
@futuremojo Před rokem
:-)
@sourabhguptaurl Před rokem
just wonderful. How do I pay you?
@MachineLearningZuu Před 2 měsíci
Ma bro just drop the "Best NLP Course" on Planet Earth and disappeared.
@amparoconsuelo9451 Před rokem
Will you please show how your NLP demystification appear as a complete source code in Python or Llama.cpp.
@iqranaveed2660 Před rokem
Sir can you guide me what i do after built the scrach transformer your video is too good
@futuremojo Před rokem
It depends on your goal.
@iqranaveed2660 Před rokem
@@futuremojo i want to do abstractive summarization please can you guide me for further process
@futuremojo Před rokem ⁺¹
@@iqranaveed2660 At this point, you can just use an LLM. GPT, Bard, Claude, etc. Input your text along with some instructions, get a summarization.
@iqranaveed2660 Před rokem
@@futuremojo sir please can you guide me how i used the scrach transformer for summarization dony want to use pretrained tramsformer please reply me
@nlpengineer1574 Před rokem
I hope you do the same lesson using Pytorch,
I picked up some idea still struggling with the code.
Great explanation though.
@futuremojo Před rokem ⁺¹
Which part of the code are you struggling with?
@nlpengineer1574 Před rokem
@@futuremojo the theory is simple and easy but the coding starts I'm lost.
- I don't understand what happened in the embedding layer, because it seems like the (W matrices*Word_vector) are embedded in this layer, while in theory its not!
-Secondly: What is vocab_size? by my understanding it should be the length of sequence, but again every implementation I read prove how wrong I'm.
- Why should we integer divide the embedding_size // n_heads to get d_k? 16:05
..
Sorry if I'm talking like rude, but this makes me really frustrated during this week and I don't know what I miss here.. Thank you again.
@futuremojo Před rokem ⁺¹
@@nlpengineer1574
1. Re: embedding layer, have you watched the video on word vectors (czcams.com/video/IebL0RQF5lg/video.html)? That should clear up any confusion regarding embeddings. In short, you can think of the embedding layer as a lookup table. Each word in your vocabulary maps to a row in this table. So the word "dog" might map to row 1, in which case, the embedding from row 1 is used as the embedding for the word "dog". These embeddings can be pre-trained or trained along with the model. The embedding weights are unrelated to the transformer weights. See the word vectors video for more info.
2. vocab_size is exactly what it sounds like: it's the size of your vocabulary. It's not the length of the sequence. The vocabulary represents all the different character, words, or subwords your model handles. It can't be infinite because your embedding table has to be a fixed size. If you're wondering where the vocabulary comes from or how the size is determined, this is covered in the word vectors video.
3. A single attention head works on the full embedding size, right? Ok, let's say you now have three heads. If we DON'T divide, then we essentially triple the computation cost because it's going to be 3 * embedding_size. By dividing embedding_size by n_heads, we can have multiple heads for roughly the same computation cost, and even though it means the embeddings in each head will now be smaller, it turns out it still works pretty well.
Hope that helps.
@nlpengineer1574 Před rokem ⁺¹
@@futuremojo Thank you man for your time you are a true hero for me
Still have problem with that vocab size =! seq_length, but the way I think of it right now is that the embedding layer create a blueprint of where we retrieve our vocabularies (vocab_size) and the size of the embedding we will give them(embed_size).
You mention this "and even though it means the embeddings in each head will now be smaller, it turns out it still works pretty well" here you completly adress my concern, because if we divide the embeddings by num_heads we will get a smaller embedding for each head, but I think since its not a problem the embedding size is arbitrary here.
Anyway I feel more confident right now about my understanding.
Again thank you for your time and patience
@jihanfarouq6904 Před rokem
This is amazing , I need those slides, Could you sent it to me please?
@user-qj3ig7qz3y Před 5 měsíci
don't understand why always 512 as inputtokens.. how to make it bigger size..
@user-pt7gs2ei1r Před rokem
I want to kiss and hug you, and kiss and hug you, and ... till the end of the world, you are such a talented and great teacher!
@peace-it4rg Před 2 měsíci
bro really made transformer video with transformer
@mostafaadel3452 Před 7 měsíci
can you share the slides? please.
@TTTrouble Před rokem
Oh my gosh I can’t tell you how many times in your explanation of transformers, I was like…..OH GOD NOT ANOTHER wrinkle of complexity. My brain hurts…
I honestly feel like I understand the math strictly speaking, but so much of the architecture seems random or hard to understand why it works. I think you did a fantastic job explaining everything in a slow and methodical way, but alas I just find there’s something about this I can’t wrap my head around even after watching dozens of videos on it.
How do you get 3 different matrices(the key query value) from 1 input to somehow learn meaning. It’s not clear to me why the dot product of the embedding and another word can represent higher level meaning, that just sounds like magic(because obviously it works). Blargh, and that’s before you break it out into multiple heads and actually say you can get to some train 8 heads of attention from the single input vector. Like how do the KQV matrices learn generalized meaning in a sentence from SGD. why not add 100 heads or a 1000 heads if going from 1 to 8 was useful?
Blah sorry I’m rambling it’s frustrating that I’m not even sure I can articulate exactly what it is about self attention that feels like cheating to me. Something about it is not clicking, though the rote math of it all makes well enough sense.
All that aside thanks for all your hard work and sharing this for me to struggle through. It is much appreciated.
@ilyas8523 Před rokem ⁺¹
Remember that the embedding goes through a positional encoding layer where the word "dog" can have many other vectors depending on the other words in the sentence. The dot product is well-suited for this purpose because it measures the cosine similarity between two vectors. When the dot product of two vectors is high, it indicates that they are pointing in similar directions or have similar orientations. This implies that the query vector and key vector are more similar or relevant to each other.
edit: I am also not fully understanding everything but the secret is to keep doing more and more research
@TTTrouble Před rokem
@@ilyas8523 haha agreed very much so. I must have literally watched and learned at least a dozen if not more explanations of the self attention mechanism and have talked with GPT4 to try to provide analogies and better ways to abstract out what’s happening.
Refreshing some linear algebra with 3blue1brown videos helped as well. Also, writing out the expansion on my own by hand from memory of an example sentence embedding using both numbers and then generalizing the process to variables on my own with word vectors and subscripts instead of numbers was a very tedious process but I think finally when certain aspects started to click.
I still struggle to fathom how and why the KQV weight matrices generalize so well to any given sequence and seem to be the distillation of human reasoning, but slow as molasses, my brain is mulling through the theory of it all in endless wonder. If I see training as a black box and assume the described trained weight matrices are magically produced, I understand how the inference part works fairly robustly.
I keep getting distracted by all the constant developments and whatnot but it does finally feel like I’m making some incremental progress thanks to sticking with really trying to have a conceptual understanding of the fundamentals. Anyhow forcing myself to articulate my doubts is helpful for me in its own way and not meant to waste anyone’s time 😅.
Hope your journey into understanding all of this stuff is going well, and thanks for your input!
@chrs2436 Před 3 měsíci
the code in the notebook doesnt work
😮‍💨
@YHK_YT Před rokem
Fard
@YHK_YT Před měsícem
I have no recollection of writing this
@efexzium Před rokem
Listeria is not a rare word in spanish
@prashlovessamosa Před měsícem
Where are you buddy cook something please

Další v pořadí

Automatické přehrávání

The moment we stopped understanding AI [AlexNet]