![Future Mojo](/img/default-banner.jpg)
- 15
- 213 544
Future Mojo
Canada
Registrace 18. 04. 2022
Exploring emerging technology and making complex concepts accessible.
NLP Demystified 15: Transformers From Scratch + Pre-training and Transfer Learning With BERT/GPT
CORRECTION:
00:34:47: that should be "each a dimension of 12x4"
Course playlist: czcams.com/play/PLw3N0OFSAYSEC_XokEcX8uzJmEZSoNGuS.html
Transformers have revolutionized deep learning. In this module, we'll learn how they work in detail and build one from scratch. We'll then explore how to leverage state-of-the-art models for our projects through pre-training and transfer learning. We'll learn how to fine-tune models from Hugging Face and explore the capabilities of GPT from OpenAI. Along the way, we'll tackle a new task for this course: question answering.
Colab notebook: colab.research.google.com/github/futuremojo/nlp-demystified/blob/main/notebooks/nlpdemystified_transformers_and_pretraining.ipynb
Timestamps
00:00:00 Transformers from scratch
00:01:05 Subword tokenization
00:04:27 Subword tokenization with byte-pair encoding (BPE)
00:06:53 The shortcomings of recurrent-based attention
00:07:55 How Self-Attention works
00:14:49 How Multi-Head Self-Attention works
00:17:52 The advantages of multi-head self-attention
00:18:20 Adding positional information
00:20:30 Adding a non-linear layer
00:22:02 Stacking encoder blocks
00:22:30 Dealing with side effects using layer normalization and skip connections
00:26:46 Input to the decoder block
00:27:11 Masked Multi-Head Self-Attention
00:29:38 The rest of the decoder block
00:30:39 [DEMO] Coding a Transformer from scratch
00:56:29 Transformer drawbacks
00:57:14 Pre-Training and Transfer Learning
00:59:36 The Transformer families
01:01:05 How BERT works
01:09:38 GPT: Language modelling at scale
01:15:13 [DEMO] Pre-training and transfer learning with Hugging Face and OpenAI
01:51:48 The Transformer is a "general-purpose differentiable computer"
This video is part of Natural Language Processing Demystified --a free, accessible course on NLP.
Visit www.nlpdemystified.org/ to learn more.
00:34:47: that should be "each a dimension of 12x4"
Course playlist: czcams.com/play/PLw3N0OFSAYSEC_XokEcX8uzJmEZSoNGuS.html
Transformers have revolutionized deep learning. In this module, we'll learn how they work in detail and build one from scratch. We'll then explore how to leverage state-of-the-art models for our projects through pre-training and transfer learning. We'll learn how to fine-tune models from Hugging Face and explore the capabilities of GPT from OpenAI. Along the way, we'll tackle a new task for this course: question answering.
Colab notebook: colab.research.google.com/github/futuremojo/nlp-demystified/blob/main/notebooks/nlpdemystified_transformers_and_pretraining.ipynb
Timestamps
00:00:00 Transformers from scratch
00:01:05 Subword tokenization
00:04:27 Subword tokenization with byte-pair encoding (BPE)
00:06:53 The shortcomings of recurrent-based attention
00:07:55 How Self-Attention works
00:14:49 How Multi-Head Self-Attention works
00:17:52 The advantages of multi-head self-attention
00:18:20 Adding positional information
00:20:30 Adding a non-linear layer
00:22:02 Stacking encoder blocks
00:22:30 Dealing with side effects using layer normalization and skip connections
00:26:46 Input to the decoder block
00:27:11 Masked Multi-Head Self-Attention
00:29:38 The rest of the decoder block
00:30:39 [DEMO] Coding a Transformer from scratch
00:56:29 Transformer drawbacks
00:57:14 Pre-Training and Transfer Learning
00:59:36 The Transformer families
01:01:05 How BERT works
01:09:38 GPT: Language modelling at scale
01:15:13 [DEMO] Pre-training and transfer learning with Hugging Face and OpenAI
01:51:48 The Transformer is a "general-purpose differentiable computer"
This video is part of Natural Language Processing Demystified --a free, accessible course on NLP.
Visit www.nlpdemystified.org/ to learn more.
zhlédnutí: 66 374
Video
NLP Demystified 14: Machine Translation With Sequence-to-Sequence and Attention
zhlédnutí 13KPřed rokem
Course playlist: czcams.com/play/PLw3N0OFSAYSEC_XokEcX8uzJmEZSoNGuS.html Whether it's translation, summarization, or even answering questions, a lot of NLP tasks come down to transforming one type of sequence into another. In this module, we'll learn to do that using encoders and decoders. We'll then look at the weaknesses of the standard approach, and enhance our model with Attention. In the d...
NLP Demystified 13: Recurrent Neural Networks and Language Models
zhlédnutí 9KPřed rokem
Course playlist: czcams.com/play/PLw3N0OFSAYSEC_XokEcX8uzJmEZSoNGuS.html We'll learn how to get computers to generate text through a technique called recurrence. We'll also look at the weaknesses of the bag-of-words approaches we've seen so far, how to capture the information in word order, and in the demo, we'll build a part-of-speech tagger and text-generating language model. Colab notebook: ...
NLP Demystified 12: Capturing Word Meaning with Embeddings
zhlédnutí 8KPřed 2 lety
Course playlist: czcams.com/play/PLw3N0OFSAYSEC_XokEcX8uzJmEZSoNGuS.html We'll learn a method to vectorize words such that words with similar meanings have closer vectors (aka "embeddings"). This was a breakthrough in NLP and boosted performance on a variety of NLP problems while addressing the shortcomings of previous approaches. We'll look at how to create these word embeddings and how to use...
NLP Demystified 11: Essential Training Techniques for Neural Networks
zhlédnutí 6KPřed 2 lety
Course playlist: czcams.com/play/PLw3N0OFSAYSEC_XokEcX8uzJmEZSoNGuS.html In our previous deep dive into neural networks, we looked at the core mechanisms behind how they learn. In this video, we'll explore all the additional details when it comes to effectively training them. We'll look at how to converge faster to a minimum, when to use certain activation functions, when and how to scale our f...
NLP Demystified 10: Neural Networks From Scratch
zhlédnutí 13KPřed 2 lety
Course playlist: czcams.com/play/PLw3N0OFSAYSEC_XokEcX8uzJmEZSoNGuS.html Neural Networks have led to incredible breakthroughs in all things AI, but at the core, they're pretty simple. In this video, we'll learn how neural networks work and how they "learn". By the end, you'll have a clear understanding of how neural networks work under the hood. We'll take a bottom-up approach starting with sim...
NLP Demystified 9: Automatically Finding Topics in Documents with Latent Dirichlet Allocation
zhlédnutí 9KPřed 2 lety
Course playlist: czcams.com/play/PLw3N0OFSAYSEC_XokEcX8uzJmEZSoNGuS.html What do you do when you need to make sense of a pile of documents and have no other information? In this video, we'll learn one approach to this problem using Latent Dirichlet Allocation. We'll cover how it works, then build a model with spaCy and Gensim to automatically discover topics present in a document and to search ...
NLP Demystified 8: Text Classification With Naive Bayes (+ precision and recall)
zhlédnutí 9KPřed 2 lety
Course playlist: czcams.com/play/PLw3N0OFSAYSEC_XokEcX8uzJmEZSoNGuS.html In this module, we'll apply everything we've learned so far to a core task in NLP: text classification. We'll learn: - how to derive Bayes' theorem - how the Naive Bayes classifier works under the hood - how to train a Naive Bayes classifier in scikit-learn and along the way, deal with issues that come up. - how things can...
NLP Demystified 7: Building Models (ML modelling overview, bias, variance, evaluation)
zhlédnutí 6KPřed 2 lety
Course playlist: czcams.com/play/PLw3N0OFSAYSEC_XokEcX8uzJmEZSoNGuS.html Through a high-level overview of modelling, we'll - clearly define "machine learning" - look at the different types of machine learning - learn how to evaluate model performance - learn what bias and variance are - see what to do about overfitting and underfitting - explore practical concerns for model deployment. If you'r...
NLP Demystified 6: TF-IDF and Simple Document Search
zhlédnutí 8KPřed 2 lety
Course playlist: czcams.com/play/PLw3N0OFSAYSEC_XokEcX8uzJmEZSoNGuS.html We look at the problems of the previous bag-of-words approach, then use an improved technique (TF-IDF) to overcome them. In the demo, we'll use spaCy and scikit-learn to build TF-IDF vectors and build a simple document search engine. Colab notebook: colab.research.google.com/github/futuremojo/nlp-demystified/blob/main/note...
NLP Demystified 5: Basic Bag-of-Words and Measuring Document Similarity
zhlédnutí 10KPřed 2 lety
Course playlist: czcams.com/play/PLw3N0OFSAYSEC_XokEcX8uzJmEZSoNGuS.html After preprocessing our text, we take our first step in turning text into numbers so our machines can start working with them. We'll explore: - a simple "bag-of-words" (BoW) approach. - learn how to use cosine similarity to measure document similarity. - the shortcomings of this BoW approach. In the demo, we'll use a combi...
NLP Demystified 4: Advanced Preprocessing (part-of-speech tagging, entity tagging, parsing)
zhlédnutí 11KPřed 2 lety
Course playlist: czcams.com/play/PLw3N0OFSAYSEC_XokEcX8uzJmEZSoNGuS.html We'll look at tagging our tokens with useful information including part-of-speech tags and named entity tags. We'll also explore different types of sentence parsing to help extract the meaning of a sentence. In the demo, we'll explore how to get these things done with spaCy and how to use the library's "matchers" and other...
NLP Demystified 3: Basic Preprocessing (case-folding, stop words, stemming, lemmatization)
zhlédnutí 10KPřed 2 lety
Course playlist: czcams.com/play/PLw3N0OFSAYSEC_XokEcX8uzJmEZSoNGuS.html Depending on our goal, we may preprocess text further. We'll cover case-folding, stop word removal, stemming, and lemmatization. We'll go over their use cases, their tradeoffs, and how to get them done using spaCy. Colab notebook: colab.research.google.com/github/futuremojo/nlp-demystified/blob/main/notebooks/nlpdemystifie...
NLP Demystified 2: Text Tokenization
zhlédnutí 13KPřed 2 lety
Course playlist: czcams.com/play/PLw3N0OFSAYSEC_XokEcX8uzJmEZSoNGuS.html The usual first step in NLP is to chop our documents into smaller pieces in a process called Tokenization. We'll look at the challenges involved and how to get it done. Colab notebook: colab.research.google.com/github/futuremojo/nlp-demystified/blob/main/notebooks/nlpdemystified_preprocessing.ipynb Timestamps: 00:00 Tokeni...
NLP Demystified 1: Introduction
zhlédnutí 23KPřed 2 lety
Course playlist: czcams.com/play/PLw3N0OFSAYSEC_XokEcX8uzJmEZSoNGuS.html In this introduction, we learn what makes NLP useful, what makes it challenging, and what we'll learn in this course. Timestamps: 00:00:00 Introduction 00:00:27 Applications of NLP 00:01:24 What makes NLP challenging 00:03:58 The evolution of NLP 00:05:16 What you'll get from this course 00:06:04 What we'll cover in this c...
First of all, Thank you so much for this course. I have understood each and everything till this point. I have a couple of questions though. I am still confused about the relationship between Latent Dirchlet Allocation and Collapsed Gibbs Sampling. You talked about Latent Dirchlet Allocation for a long time but then kind of shifted to CGS. I am kind of having trouble understanding the similarities. (i get the multiplying thing and finding out the probabiities and shooting the dart thing). I'd love an explanation. Thanks
why aren't university teachers like you?
Concise and easily understandable. Thanks a lot for the series.
Just completed the entire playlist. It was an absolute delight to watch, this last lecture was a favorite of mine because of you explained it in the form of a story. Thank you so much for sharing this knowledge with us and hope to learn more from you :D
Where are you buddy cook something please
Awesome
11
Ma bro just drop the "Best NLP Course" on Planet Earth and disappeared.
Thank you it was very well done!
What was provided: A high quality, easily digestible, and calm introduction to Transformers that could take almost anyone from zero to GPT in a single video. What I got: It will probably take me longer than I'd like to get good at martial arts.
This series is incredible! I can't believe we get to access such content for free online... what an era
This is really high quality content. Why did it take so long for CZcams to recommend this.
bro really made transformer video with transformer
Best explanation. Crisp and exhaustive.
It's really amazing work of NLP
the code in the notebook doesnt work 😮💨
My god this video is genius.
Very well done! Thanks.
Great work 👍
This is one of the best video I enjoyed ever while learning machine learning. Explaining conditional probability to naive Bayes demo in detailed and still in concise way is art. Wow, this is excellent playlist.
Excited for this course
Thank you !
Thank you !!!
thx so much
I'm a research Scholar and came across your channel. I was truly amazed at how you broke through the concepts and explained them.
Thank you!
don't understand why always 512 as inputtokens.. how to make it bigger size..
Like what the hell. You made it so simple to learn. I kept consuming and taking notes, adding thoughts, perspective, feeling super productive. (I'm using Obsidian to link concepts). About three years ago the best explanation I could get is probably from Andrew Ng and I have to admit yours is so much better. My opinion might be biased since I was going back and forth in NLP times after times, but looking at the comment secion I'm pretty sure my opinion is validated
Thank you!
Hello goddddddddddddd. Thank you so much
Thank you so much for these videos!! Definitely one of the best videos on the NLP out there!
Excellent video
Best channel
I am bit confused about the cosine similarity metric. I thought the cosine similarity range is from -1 to 1, instead of 0 to 1. I've seen 0 to 1 threshold being used elsewhere as well but I do notice more popular embedding models generate -ve vector elements and naturally the normalized versions produce ranges from -1 to 1. Can you please clarify this? Cuz I've been struggling to wrap my head around this.
You kind of sound like Casually Explained
Manh thanks for the detailed explanation. Your video has been helpful.
In SGNS, when you are talking about matrices of context and target embeddings (10000 * 300), what do these matrices have/contain before the training has started (collection of one hot encodings or arbitrary numbers)? At 17:00, I also did not understand how only taking the target word embeddings would be sufficient to capture similarity between words.
Best ever
omggg, kudos to your efforts!!!!! I really wish you have more subscribers
Hi! I'm on my first episode currently in this lesson, I really excited and hope to learn much. Did you will create another tutorial on these kind of topics? Or only these 15 videos will kind of transform me into some expert ( remember, "kinda" expert ) in NLP and transformers so I can do pretrained my self and finetune it perfectly ? ( Assuming I have capability to gather the data? ) Thankkssss
Very helpful set of videos. However, it is unclear how is it that the weights determined for one set of input values X1 and the corresponding expected output value Y1, will hod for any other set of input values X2 and their corresponding output value Y2? In your example, the weights computed for inputs x1=2, x2=3 and expected output y=0, maybe different for any other inputs and expected output.
There really aren't enough words to express how thankful I am for this awesome content. It's amazing that you've made it available to everyone for free. Thank you so much May Allah(GOD) help you like you help other
These are very helpful videos, thank you! There are still a few concepts that are unclear. You have mentioned that documents are segmented to a list of sentences, and each sentence segmented into a list of tokens. This implies that the list of tokens is empty to begin with, and after tokenization, we end up with a list of tokens(token vocabulary?) specific to the corpus we provide. But later, when you start the tokenization using spaCy, you are loading some db??? What is this doing? Shouldn't spaCy just be a program/tool that has some "advanced rules" to tokenize a document that we provide, and create a new token vocabulary from scratch, and not use it's own db/list created from some unknown corpus as some starting point? And finally, why tokenize a sentence at a time- because a document size can be large? Could it have read in a fixed number of words at a time, say 100 words, and then tokenized them? A "sentence" should have no meaning for the tokenizer, is this right? Actually, how does a tokenizer even "know" when a sentence starts/ends?!? Thanks for any clarifications!
the db you are referring is the statistical model that was trained on some annotated data(forgot the name here). That is the thing that tokenizes the given document or sentences. Spacy is just a module that helps us tokenize our data according to those statistical model. ... All this, I think so. Just a beginner....
Fantastic sir
Thanks for your vids
What an amazing tutorial. Thank you
Je viens de France et je viens juste de tomber sur cette superbe playlist qui est pour moi la plus complète sur youtube ! Merci, un grand merci à vous ! C'est difficile de trouver des formations d'une telle qualité.
can you share the slides? please.
Hi there! Loved the series on NLP. Can you please share any link or resource on how to code up the accuracy function like you did with loss? I would like to calculate accuracy of the epochs.
what is this model accuracy? or bleu score? how to solve it brother?
Hello. Thank you for such a detailed course. I have a question about using pre-trained language models. My language (Azerbaijani) is not yet available in the library. Are you covering this topic further or is it not worth wasting time on learning without this model?
Why is there no 'end of entity' tag? I'm sure some might say it's redundant and unnecessary because when you come to the 'o' you are *obviously* at the end of the entity. But it is just as possible that the 'o' is a mistake, especially in a long multi word name. An end tag would be both more explicit and eliminate any ambiguity. But maybe that's just me....