54
6 553 153

KL Divergence - How to tell how different two distributions are

13:48

Why do we divide by n-1 to estimate the variance? A visual tour through Bessel correction

37:23

Reinforcement Learning with Human Feedback - How to train and fine-tune Transformer Models

15:31

Proximal Policy Optimization (PPO) - How to train Large Language Models

38:24

Stable Diffusion - How to build amazing images with AI

44:59

What are Transformer Models and how do they work?

44:26

Direct Preference Optimization (DPO) - How to fine-tune LLMs directly without reinforcement learning

Direct Preference Optimization (DPO) is a method used for training Large Language Models (LLMs). DPO is a direct way to train the LLM without the need for reinforcement learning, which makes it more effective and more efficient.
Learn about it in this simple video!
This is the third one in a series of 4 videos dedicated to the reinforcement learning methods used for training LLMs.
Full Playlist: czcams.com/play/PLs8w1Cdi-zvYviYYw_V3qe6SINReGF5M-.html
Video 0 (Optional): Introduction to deep reinforcement learning czcams.com/video/SgC6AZss478/video.html
Video 1: Proximal Policy Optimization czcams.com/video/TjHH_--7l8g/video.html
Video 2: Reinforcement Learning with Human Feedback czcams.com/video/Z_JUqJBpVOk/video.html
Video 3 (This one!): Deterministic Policy Optimization
00:00 Introduction
01:08 RLHF vs DPO
07:19 The Bradley-Terry Model
11:25 KL Divergence
16:32 The Loss Function
14:36 Conclusion
Get the Grokking Machine Learning book!
manning.com/books/grokking-machine-learning
Discount code (40%): serranoyt
(Use the discount code on checkout)

zhlédnutí: 2 247

Video

KL Divergence - How to tell how different two distributions are

13:48

KL Divergence - How to tell how different two distributions are

zhlédnutí 3,3KPřed dnem

Correction (10:26). The probabilities are wrong. The correct ones are here: For Die 1: 0.4^4 * 0.2^2 * 0.1^1 * 0.1^1 * 0.2^2 For Die 2: 0.4^4 * 0.1^2 * 0.2^1 * 0.2^1 * 0.1^2 For Die 3: 0.1^4 * 0.2^2 * 0.4^1 * 0.2^1 * 0.1^2 Kullback Leibler (KL) divergence is a way to measure how far apart two distributions are. In this video, we learn KL-divergence in a simple way, using a probability game with...

Why do we divide by n-1 to estimate the variance? A visual tour through Bessel correction

37:23

Why do we divide by n-1 to estimate the variance? A visual tour through Bessel correction

zhlédnutí 11KPřed měsícem

Correction: At 30:42 I write "X = Y". They're not equal, what I meant to say is "X and Y are identically distributed". The variance is a measure of how spread out a distribution is. In order to estimate the variance, one takes a sample of n points from the distribution, and calculate the average square deviation from the mean. However, this doesn't give a good estimate of the variance of the di...

Reinforcement Learning with Human Feedback - How to train and fine-tune Transformer Models

15:31

Reinforcement Learning with Human Feedback - How to train and fine-tune Transformer Models

zhlédnutí 8KPřed 4 měsíci

Reinforcement Learning with Human Feedback (RLHF) is a method used for training Large Language Models (LLMs). In the heart of RLHF lies a very powerful reinforcement learning method called Proximal Policy Optimization. Learn about it in this simple video! This is the first one in a series of 3 videos dedicated to the reinforcement learning methods used for training LLMs. Full Playlist: czcams.c...

Proximal Policy Optimization (PPO) - How to train Large Language Models

38:24

Proximal Policy Optimization (PPO) - How to train Large Language Models

zhlédnutí 18KPřed 5 měsíci

Stable Diffusion - How to build amazing images with AI

44:59

Stable Diffusion - How to build amazing images with AI

zhlédnutí 17KPřed 6 měsíci

This video is about Stable Diffusion, the AI method to build amazing images from a prompt. If you like this material, check out LLM University from Cohere! llm.university Get the Grokking Machine Learning book! manning.com/books/grokking-ma... Discount code (40%): serranoyt (Use the discount code on checkout) 0:00 Introduction 1:27 How does Stable Diffusion work? 2:55 Embeddings 12:55 Diffusion...

What are Transformer Models and how do they work?

44:26

What are Transformer Models and how do they work?

zhlédnutí 103KPřed 7 měsíci

This is the last of a series of 3 videos where we demystify Transformer models and explain them with visuals and friendly examples. Video 1: The attention mechanism in high level czcams.com/video/OxCpWwDCDFQ/video.html Video 2: The attention mechanism with math czcams.com/video/UPtG_38Oq8o/video.html Video 3 (This one): Transformer models If you like this material, check out LLM University from...

The math behind Attention: Keys, Queries, and Values matrices

36:16

The math behind Attention: Keys, Queries, and Values matrices

zhlédnutí 214KPřed 10 měsíci

This is the second of a series of 3 videos where we demystify Transformer models and explain them with visuals and friendly examples. Video 1: The attention mechanism in high level czcams.com/video/OxCpWwDCDFQ/video.html Video 2: The attention mechanism with math (this one) Video 3: Transformer models czcams.com/video/qaWMOYf4ri8/video.html If you like this material, check out LLM University fr...

The Attention Mechanism in Large Language Models

21:02

The Attention Mechanism in Large Language Models

zhlédnutí 83KPřed 11 měsíci

Attention mechanisms are crucial to the huge boom LLMs have recently had. In this video you'll see a friendly pictorial explanation of how attention mechanisms work in Large Language Models. This is the first of a series of three videos on Transformer models. Video 1: The attention mechanism in high level (this one) Video 2: The attention mechanism with math: czcams.com/video/UPtG_38Oq8o/video....

26:41

The Binomial and Poisson Distributions

zhlédnutí 10KPřed rokem

If on average, 3 people enter a store every hour, what is the probability that over the next hour, 5 people will enter the store? The answer lies in the Poisson distribution. In this video you'll learn this distribution, starting from a much simpler one, the Binomial distribution. Euler number video: czcams.com/video/oikl9FCISqU/video.html Grokking Machine Learning book: bit.ly/grokkingML 40% d...

Euler's number, derivatives, and the bank at the end of the universe

25:18

Euler's number, derivatives, and the bank at the end of the universe

zhlédnutí 3,7KPřed rokem

Euler's number, e, is defined as a limit. The function e to the x is (up to multiplying by a constant) the only function that is its own derivative. How are these two related? In this video you'll find an explanation for this phenomenon using banking interest rates, and a very particular bank, located at the end of the universe.

Decision trees - A friendly introduction

22:23

Decision trees - A friendly introduction

zhlédnutí 11KPřed rokem

A video about decision trees, and how to train them on a simple example. Accompanying blog post: medium.com/@luis.serrano/splitting-data-by-asking-questions-decision-trees-74afed9cd849 For a code implementation, check out this repo: github.com/luisguiserrano/manning/tree/master/Chapter_9_Decision_Trees Helper videos: - Gini index: czcams.com/video/u4IxOk2ijSs/video.html - Entropy and informatio...

How do you minimize a function when you can't take derivatives? CMA-ES and PSO

15:04

How do you minimize a function when you can't take derivatives? CMA-ES and PSO

zhlédnutí 8KPřed rokem

How do you minimize a function when you can't take derivatives? CMA-ES and PSO

51:32

What is Quantum Machine Learning?

zhlédnutí 11KPřed rokem

What is Quantum Machine Learning?

31:46

Denoising and Variational Autoencoders

zhlédnutí 23KPřed 2 lety

Denoising and Variational Autoencoders

Eigenvectors and Generalized Eigenspaces

19:18

Eigenvectors and Generalized Eigenspaces

zhlédnutí 26KPřed 2 lety

Eigenvectors and Generalized Eigenspaces

Thompson sampling, one armed bandits, and the Beta distribution

12:40

Thompson sampling, one armed bandits, and the Beta distribution

zhlédnutí 21KPřed 2 lety

Thompson sampling, one armed bandits, and the Beta distribution

13:31

The Beta distribution in 12 minutes!

zhlédnutí 79KPřed 3 lety

The Beta distribution in 12 minutes!

A friendly introduction to deep reinforcement learning, Q-networks and policy gradients

36:26

A friendly introduction to deep reinforcement learning, Q-networks and policy gradients

zhlédnutí 94KPřed 3 lety

A friendly introduction to deep reinforcement learning, Q-networks and policy gradients

The Gini Impurity Index explained in 8 minutes!

8:39

The Gini Impurity Index explained in 8 minutes!

zhlédnutí 38KPřed 3 lety

The Gini Impurity Index explained in 8 minutes!

13:57

The covariance matrix

zhlédnutí 93KPřed 3 lety

The covariance matrix

17:27

Gaussian Mixture Models

zhlédnutí 68KPřed 3 lety

Gaussian Mixture Models

Singular Value Decomposition (SVD) and Image Compression

28:56

Singular Value Decomposition (SVD) and Image Compression

zhlédnutí 90KPřed 3 lety

Singular Value Decomposition (SVD) and Image Compression

ROC (Receiver Operating Characteristic) Curve in 10 minutes!

10:54

ROC (Receiver Operating Characteristic) Curve in 10 minutes!

zhlédnutí 59KPřed 3 lety

ROC (Receiver Operating Characteristic) Curve in 10 minutes!

Restricted Boltzmann Machines (RBM) - A friendly introduction

36:58

Restricted Boltzmann Machines (RBM) - A friendly introduction

zhlédnutí 63KPřed 3 lety

Restricted Boltzmann Machines (RBM) - A friendly introduction

A Friendly Introduction to Generative Adversarial Networks (GANs)

21:01

A Friendly Introduction to Generative Adversarial Networks (GANs)

zhlédnutí 245KPřed 4 lety

A Friendly Introduction to Generative Adversarial Networks (GANs)

You are much better at math than you think

10:36

You are much better at math than you think

zhlédnutí 7KPřed 4 lety

You are much better at math than you think

Training Latent Dirichlet Allocation: Gibbs Sampling (Part 2 of 2)

26:31

Training Latent Dirichlet Allocation: Gibbs Sampling (Part 2 of 2)

zhlédnutí 53KPřed 4 lety

Training Latent Dirichlet Allocation: Gibbs Sampling (Part 2 of 2)

Latent Dirichlet Allocation (Part 1 of 2)

26:57

Latent Dirichlet Allocation (Part 1 of 2)

zhlédnutí 128KPřed 4 lety

Latent Dirichlet Allocation (Part 1 of 2)

Book by Luis Serrano - "Grokking Machine Learning" (40% off promo code)

1:01

Book by Luis Serrano - "Grokking Machine Learning" (40% off promo code)

zhlédnutí 14KPřed 4 lety

Book by Luis Serrano - "Grokking Machine Learning" (40% off promo code)

Komentáře

@harsharangapatil2423 Před dnem
Can you please add a video on curse of dimensionality?
@tanggenius3371 Před dnem
Thanks, the explaination is so intuitive. Finally understood the idea of attention.
@unclecode Před dnem
Appreciate the great explanation. I have a question regarding the clipping formula at 36:42. You have used the "min" function. For example, if the rate is 0.4 and the epsilon is 0.3, indicating that we should get 0.7 in this scenario. However, in the formula you introduced here is returns then 0.4. Shouldn't the formula be clipped_f(x) = max(1 - epsilon, min(f(x), 1 + epsilon))? Am I missing anything?
@WrongDescription Před 2 dny
Best explanation on the internet!!
@camzbeats6993 Před 2 dny
Top
@camzbeats6993 Před 3 dny
Very intuitive, thanks you. I like the exemple approach you take. 👏
@saralagrawal7449 Před 3 dny
Ye Be10x ko koi ban kardo please. Irritate kar diya hai.
@Cathiina Před 3 dny
Yess true. I only passed all my maths courses by learning by heart. Never quite satisfied with even good grades because I knew in my heart I understood nothing. Currently refreshing linear algebra in your coursera course and WOW! It’s addicting to actually learn what a rank in a matrix means. 😊☀️
@HoussamBIADI Před 3 dny
Thank you for this amazing explanation <3
@mekuzeeyo Před 4 dny
Great video as always. I have a question, in practice which one works best using DPO or RLHF?
@SerranoAcademy Před 3 dny
Thank you! From what I've heard, DPO works better, as it trains the network directly instead of using RL and two networks.
@mekuzeeyo Před 3 dny
@@SerranoAcademy Thank you sir for the great work. your Coursera courses have been awesome.
@hyperbitcoinizationpod Před 4 dny
And the entropy is number of bits needed to convey the information.
@martadomingues1691 Před 4 dny
Very good video, it helped clear some doubts I was having with this along with the Viterbi Algorithm. It's just too bad that the notation used was too different from class, but it did help me understand everything and make a connection between all of it. Thank you!
@Cathiina Před 5 dny
Hi Mr. Serrano! I am doing your coursera course at the moment on linear algebra for machine learning and I am having so much fun! You are a brilliant teacher, and I just wanted to say thank you! Wish more teachers would bring theoretical mathematics down to a more practical level. Obviously loving the very expensive fruit examples :)
@SerranoAcademy Před 5 dny
Thank you so much @Cathiina, what an honor to be part of your learning journey, and I’m glad you like the expensive fruit examples! :)
@vigneshram5193 Před 5 dny
Thank you Luis Serrano for this super explanatory video
@bin4ry_d3struct0r Před 6 dny
Is there an industry standard for the KLD above which two distributions are considered significantly different (like how 0.05 is the standard for the p-value)?
@SerranoAcademy Před 6 dny
Ohhh that’s a good question. I don’t think so, since normally you use it for minimization or comparison between them, but I’ll keep an eye, maybe it would make sense to have a standard for it.
@frankl1 Před 6 dny
Did anyone expect something different than Sofmax regarding the Bradley-Terry model as myself? 😅
@SerranoAcademy Před 6 dny
lol, I was expecting something different too initially 🤣
@frankl1 Před 6 dny
Really love the way you broke down the DPO loss, this direct way is more welcome by my brain :). Just one question on the video, I am wondering how important it is to choose the initial transformer carefully. I suspect that if it is very bad at the task, then we will have to change the initial response a lot, but because the loss function prevents from changing too much in one iteration, we will need to perform a lot tiny changes toward the good answer, making the training extremely long. Am I right ?
@SerranoAcademy Před 6 dny
Thank you, great question! This method is used for fine-tuning, not specifically for training. In other words, it's crucial that we start with a fully trained model. For training, you'd use normal backpropagation on the transformer, and lots of data. Once the LLM is trained and very trusted, then you use DPO (or RLHF) to fine-tune it (meaning, post train it to get from good to great). So we should assume that the model is as trained as it can, and that's why we trust the LLM and we try to only change it marginally. If we were to do this method to train a model that's not fully trained... I'm not 100% if it would work. It may or may not, but we'd still have to punish the KL divergence much less. And also, human feedback gives a lot less data than scraping the whole internet, so I would still not use this as a training method, more as refining. Let me know if you have more questions!
@frankl1 Před 6 dny
@@SerranoAcademy Thanks for the answer, I understand better. I forgot that this design is for fine-tuning.
@rb4754 Před 6 dny
Very nice lecture on attention.
@mayyutyagi Před 6 dny
Now whenever I watch Serrano's video, I first like it and the start watching it coz I know the video will gonna be outstanding as always.
@mayyutyagi Před 6 dny
Liked this video and subscribed your channel today.
@mayyutyagi Před 6 dny
Amazing video... Thanks sir for this pictorial representation and explaining this complex topic with such an easy way.
@AravindUkrd Před 7 dny
Thanks for the simplified explanation. Awesome as always. The book link in the description is not working.
@SerranoAcademy Před 6 dny
Thank you so much! And thanks for letting me know, I’ll fix it
@johnzhu5735 Před 7 dny
This was very helpful
@siddharthabhakta3261 Před 7 dny
The best explanation & depiction of SVD.
@melihozcan8676 Před 7 dny
Thanks for the excellent explanation! I used to know the KL Divergence, but now I understand it!
@saedsaify9944 Před 7 dny
Great one, the simpler it looks and harder to build!
@stephenlashley6313 Před 7 dny
This and your whole series of attention NN is a thing of beauty! There are many ways of simplifying this here, but you come the closest to understanding Attention NN and QC are identical and QC is much better. In my opinion QC has never been done correctly, the gates are too confusing and poorly understood. QC is not still in simplified infant stage, it is mature what QC can do and matches all Psychology observations. All problems in Biology and NLP are sequences of strings.
@cloudshoring Před 8 dny
awesome!
@bifidoc Před 8 dny
Thanks!
@SerranoAcademy Před 8 dny
Thank you so much for your kind contribution @bifidoc!!! 💜🙏🏼
@user-xc8vy4cw9k Před 8 dny
I would like to say thank you for the wonderful video. I want to learn reinforcement learning for my future study in the field of robotics. I have seen that you only have 4 videos about RL. I am hungry for more of your videos. I found that your videos are easier to understand because you explain well. Please add more RL videos. Thank you 🙏
@SerranoAcademy Před 8 dny
Thank you for the suggestion! Definitely! Any ideas on what topics in RL to cover?
@user-xc8vy4cw9k Před 6 dny
@@SerranoAcademy more videos in the field of Robotics please. Thank you. You may also guide me how I can approach the study of reinforcement learning.
@user-xc8vy4cw9k Před 8 dny
I would like to say thank you for the wonderful video. I want to learn reinforcement learning for my future study in the field of robotics. I have seen that you only have 4 videos about RL. I am hungry for more of your videos. I found that your videos are easier to understand because you explain well. Please add more RL videos. Thank you 🙏
@Omsip123 Před 8 dny
So well explained
@guzh Před 8 dny
DPO main equation should be PPO main equation.
@epepchuy Před 9 dny
Exvelente explciacion!!!
@iantanwx Před 9 dny
Most intuitive explanation for QKV, as someone with only an elementary understanding of linear algebra.
@VerdonTrigance Před 9 dny
It's kinda hard to remember all of these formulas and it's demotivating me from further learning.
@javiergimenezmoya86 Před 9 dny
You do not have to remember that formulas. You only have to understand the logic in them.
@IceMetalPunk Před 9 dny
I'm a little confused about one thing: the reward function, even in the Bradley-Terry model, is based on the human-given scores for individual context-prediction pairs, right? And πθ is the probability from the current iteration of the network, and πRef is the probability from the original, untuned network? So then after that "mathematical manipulation", how does the human-given set of scores become represented by the network's predictions all of a sudden?
@user-xc8vy4cw9k Před 9 dny
Thank you for the wonderful video. Please add more practical example videos for the application of reinforcement learning.
@SerranoAcademy Před 9 dny
Thank you! Definitely! Here's a playlist of applications of RL to training large language models. czcams.com/play/PLs8w1Cdi-zvYviYYw_V3qe6SINReGF5M-.html
@laodrofotic7713 Před 9 dny
noone of the videos I seen on this subject actually explain where the hell qkv values come from! its amazing people jump on making video while not understanding the concepts clearly! I guess youtube must pay a lot of money! But this video does a good job of explaining most of the things, it never does tell us where the actual qkv values come from, how do the embendings turn into them, and actually got things wrong in my oppinion. the q comes from embeddings that are multiplied by the wq, which is a weight and parameter in the model, but then the question is, where does wq wk wv come from???
@bendim94 Před 9 dny
how do you choose the number of features in the 2 matrices, i.e. how did you choose to have 2 features only?
@Priyanshuc2425 Před 9 dny
Hey I know this 👦. He is my Maths teacher who don't only teach but make us visualize why we learn the topic and how will it useful in real world ❤
@Q793148210 Před 9 dny
It‘s was just so clear. 😃
@DienTran-zh6kj Před 10 dny
I love his teaching, he makes complex things seem simple.
@shouvikdey7078 Před 10 dny
Love your videos, please make more such videos on mathematical description of generative models such as GAN, Diffusion, etc.
@SerranoAcademy Před 9 dny
Thank you! I got some on GANs and Diffusion models, check them out! GANs: czcams.com/video/8L11aMN5KY8/video.html Stable diffusion: czcams.com/video/JmATtG0yA5E/video.html
@mohammadarafah7757 Před 10 dny
We expect to describe wasserstein distance 😊
@SerranoAcademy Před 9 dny
Ah good idea! I'll add it to the list, as well as earth-mover's distance. :)
@mohammadarafah7757 Před 9 dny
@SerranoAcademy I also highly recommend to describe Explainable AI (XAI) which depends on statistics.
@mehdiberchid1974 Před 10 dny
thank u
@bernardorinconceron6139 Před 10 dny
Thank you Luis. I'm sure I'll use this very soon.
@shahnawazalam55 Před 10 dny
That was intuitive as butter
@frankl1 Před 10 dny
Great video. One question I have, why would I use KL instead of CE? are there situations in which one would be more suitable than the other ?
@SerranoAcademy Před 10 dny
That is a great question! KL(P,Q) is really the CE(P,Q), except you subtract the entropy H(P). The reason for this is that if you compare a distribution with itself, you want to get a zero. With CE, you don't get zero, so the CE of a distribution with itself could potentially be very high.
@Ashishkumar-id1nn Před 10 dny
why did you take average at 6:30 ?
@SerranoAcademy Před 10 dny
Great question! I took the average because the product is p_i^(nq^i), so the log is nq_i log(p_i), and I want to get rid of that n. It’s not super needed for the math, but I did it so that it gives exactly the KL divergence instead of n times it.
@Ashishkumar-id1nn Před 10 dny
@@SerranoAcademy thanks for the clarification

Serrano.Academy

Komentáře