Serrano.Academy
Serrano.Academy
  • 54
  • 6 553 153
Direct Preference Optimization (DPO) - How to fine-tune LLMs directly without reinforcement learning
Direct Preference Optimization (DPO) is a method used for training Large Language Models (LLMs). DPO is a direct way to train the LLM without the need for reinforcement learning, which makes it more effective and more efficient.
Learn about it in this simple video!
This is the third one in a series of 4 videos dedicated to the reinforcement learning methods used for training LLMs.
Full Playlist: czcams.com/play/PLs8w1Cdi-zvYviYYw_V3qe6SINReGF5M-.html
Video 0 (Optional): Introduction to deep reinforcement learning czcams.com/video/SgC6AZss478/video.html
Video 1: Proximal Policy Optimization czcams.com/video/TjHH_--7l8g/video.html
Video 2: Reinforcement Learning with Human Feedback czcams.com/video/Z_JUqJBpVOk/video.html
Video 3 (This one!): Deterministic Policy Optimization
00:00 Introduction
01:08 RLHF vs DPO
07:19 The Bradley-Terry Model
11:25 KL Divergence
16:32 The Loss Function
14:36 Conclusion
Get the Grokking Machine Learning book!
manning.com/books/grokking-machine-learning
Discount code (40%): serranoyt
(Use the discount code on checkout)
zhlédnutí: 2 247

Video

KL Divergence - How to tell how different two distributions are
zhlédnutí 3,3KPřed dnem
Correction (10:26). The probabilities are wrong. The correct ones are here: For Die 1: 0.4^4 * 0.2^2 * 0.1^1 * 0.1^1 * 0.2^2 For Die 2: 0.4^4 * 0.1^2 * 0.2^1 * 0.2^1 * 0.1^2 For Die 3: 0.1^4 * 0.2^2 * 0.4^1 * 0.2^1 * 0.1^2 Kullback Leibler (KL) divergence is a way to measure how far apart two distributions are. In this video, we learn KL-divergence in a simple way, using a probability game with...
Why do we divide by n-1 to estimate the variance? A visual tour through Bessel correction
zhlédnutí 11KPřed měsícem
Correction: At 30:42 I write "X = Y". They're not equal, what I meant to say is "X and Y are identically distributed". The variance is a measure of how spread out a distribution is. In order to estimate the variance, one takes a sample of n points from the distribution, and calculate the average square deviation from the mean. However, this doesn't give a good estimate of the variance of the di...
Reinforcement Learning with Human Feedback - How to train and fine-tune Transformer Models
zhlédnutí 8KPřed 4 měsíci
Reinforcement Learning with Human Feedback (RLHF) is a method used for training Large Language Models (LLMs). In the heart of RLHF lies a very powerful reinforcement learning method called Proximal Policy Optimization. Learn about it in this simple video! This is the first one in a series of 3 videos dedicated to the reinforcement learning methods used for training LLMs. Full Playlist: czcams.c...
Proximal Policy Optimization (PPO) - How to train Large Language Models
zhlédnutí 18KPřed 5 měsíci
Reinforcement Learning with Human Feedback (RLHF) is a method used for training Large Language Models (LLMs). In the heart of RLHF lies a very powerful reinforcement learning method called Proximal Policy Optimization. Learn about it in this simple video! This is the first one in a series of 3 videos dedicated to the reinforcement learning methods used for training LLMs. Full Playlist: czcams.c...
Stable Diffusion - How to build amazing images with AI
zhlédnutí 17KPřed 6 měsíci
This video is about Stable Diffusion, the AI method to build amazing images from a prompt. If you like this material, check out LLM University from Cohere! llm.university Get the Grokking Machine Learning book! manning.com/books/grokking-ma... Discount code (40%): serranoyt (Use the discount code on checkout) 0:00 Introduction 1:27 How does Stable Diffusion work? 2:55 Embeddings 12:55 Diffusion...
What are Transformer Models and how do they work?
zhlédnutí 103KPřed 7 měsíci
This is the last of a series of 3 videos where we demystify Transformer models and explain them with visuals and friendly examples. Video 1: The attention mechanism in high level czcams.com/video/OxCpWwDCDFQ/video.html Video 2: The attention mechanism with math czcams.com/video/UPtG_38Oq8o/video.html Video 3 (This one): Transformer models If you like this material, check out LLM University from...
The math behind Attention: Keys, Queries, and Values matrices
zhlédnutí 214KPřed 10 měsíci
This is the second of a series of 3 videos where we demystify Transformer models and explain them with visuals and friendly examples. Video 1: The attention mechanism in high level czcams.com/video/OxCpWwDCDFQ/video.html Video 2: The attention mechanism with math (this one) Video 3: Transformer models czcams.com/video/qaWMOYf4ri8/video.html If you like this material, check out LLM University fr...
The Attention Mechanism in Large Language Models
zhlédnutí 83KPřed 11 měsíci
Attention mechanisms are crucial to the huge boom LLMs have recently had. In this video you'll see a friendly pictorial explanation of how attention mechanisms work in Large Language Models. This is the first of a series of three videos on Transformer models. Video 1: The attention mechanism in high level (this one) Video 2: The attention mechanism with math: czcams.com/video/UPtG_38Oq8o/video....
The Binomial and Poisson Distributions
zhlédnutí 10KPřed rokem
If on average, 3 people enter a store every hour, what is the probability that over the next hour, 5 people will enter the store? The answer lies in the Poisson distribution. In this video you'll learn this distribution, starting from a much simpler one, the Binomial distribution. Euler number video: czcams.com/video/oikl9FCISqU/video.html Grokking Machine Learning book: bit.ly/grokkingML 40% d...
Euler's number, derivatives, and the bank at the end of the universe
zhlédnutí 3,7KPřed rokem
Euler's number, e, is defined as a limit. The function e to the x is (up to multiplying by a constant) the only function that is its own derivative. How are these two related? In this video you'll find an explanation for this phenomenon using banking interest rates, and a very particular bank, located at the end of the universe.
Decision trees - A friendly introduction
zhlédnutí 11KPřed rokem
A video about decision trees, and how to train them on a simple example. Accompanying blog post: medium.com/@luis.serrano/splitting-data-by-asking-questions-decision-trees-74afed9cd849 For a code implementation, check out this repo: github.com/luisguiserrano/manning/tree/master/Chapter_9_Decision_Trees Helper videos: - Gini index: czcams.com/video/u4IxOk2ijSs/video.html - Entropy and informatio...
How do you minimize a function when you can't take derivatives? CMA-ES and PSO
zhlédnutí 8KPřed rokem
How do you minimize a function when you can't take derivatives? CMA-ES and PSO
What is Quantum Machine Learning?
zhlédnutí 11KPřed rokem
What is Quantum Machine Learning?
Denoising and Variational Autoencoders
zhlédnutí 23KPřed 2 lety
Denoising and Variational Autoencoders
Eigenvectors and Generalized Eigenspaces
zhlédnutí 26KPřed 2 lety
Eigenvectors and Generalized Eigenspaces
Thompson sampling, one armed bandits, and the Beta distribution
zhlédnutí 21KPřed 2 lety
Thompson sampling, one armed bandits, and the Beta distribution
The Beta distribution in 12 minutes!
zhlédnutí 79KPřed 3 lety
The Beta distribution in 12 minutes!
A friendly introduction to deep reinforcement learning, Q-networks and policy gradients
zhlédnutí 94KPřed 3 lety
A friendly introduction to deep reinforcement learning, Q-networks and policy gradients
The Gini Impurity Index explained in 8 minutes!
zhlédnutí 38KPřed 3 lety
The Gini Impurity Index explained in 8 minutes!
The covariance matrix
zhlédnutí 93KPřed 3 lety
The covariance matrix
Gaussian Mixture Models
zhlédnutí 68KPřed 3 lety
Gaussian Mixture Models
Singular Value Decomposition (SVD) and Image Compression
zhlédnutí 90KPřed 3 lety
Singular Value Decomposition (SVD) and Image Compression
ROC (Receiver Operating Characteristic) Curve in 10 minutes!
zhlédnutí 59KPřed 3 lety
ROC (Receiver Operating Characteristic) Curve in 10 minutes!
Restricted Boltzmann Machines (RBM) - A friendly introduction
zhlédnutí 63KPřed 3 lety
Restricted Boltzmann Machines (RBM) - A friendly introduction
A Friendly Introduction to Generative Adversarial Networks (GANs)
zhlédnutí 245KPřed 4 lety
A Friendly Introduction to Generative Adversarial Networks (GANs)
You are much better at math than you think
zhlédnutí 7KPřed 4 lety
You are much better at math than you think
Training Latent Dirichlet Allocation: Gibbs Sampling (Part 2 of 2)
zhlédnutí 53KPřed 4 lety
Training Latent Dirichlet Allocation: Gibbs Sampling (Part 2 of 2)
Latent Dirichlet Allocation (Part 1 of 2)
zhlédnutí 128KPřed 4 lety
Latent Dirichlet Allocation (Part 1 of 2)
Book by Luis Serrano - "Grokking Machine Learning" (40% off promo code)
zhlédnutí 14KPřed 4 lety
Book by Luis Serrano - "Grokking Machine Learning" (40% off promo code)

Komentáře

  • @harsharangapatil2423

    Can you please add a video on curse of dimensionality?

  • @tanggenius3371
    @tanggenius3371 Před dnem

    Thanks, the explaination is so intuitive. Finally understood the idea of attention.

  • @unclecode
    @unclecode Před dnem

    Appreciate the great explanation. I have a question regarding the clipping formula at 36:42. You have used the "min" function. For example, if the rate is 0.4 and the epsilon is 0.3, indicating that we should get 0.7 in this scenario. However, in the formula you introduced here is returns then 0.4. Shouldn't the formula be clipped_f(x) = max(1 - epsilon, min(f(x), 1 + epsilon))? Am I missing anything?

  • @WrongDescription
    @WrongDescription Před 2 dny

    Best explanation on the internet!!

  • @camzbeats6993
    @camzbeats6993 Před 2 dny

    Top

  • @camzbeats6993
    @camzbeats6993 Před 3 dny

    Very intuitive, thanks you. I like the exemple approach you take. 👏

  • @saralagrawal7449
    @saralagrawal7449 Před 3 dny

    Ye Be10x ko koi ban kardo please. Irritate kar diya hai.

  • @Cathiina
    @Cathiina Před 3 dny

    Yess true. I only passed all my maths courses by learning by heart. Never quite satisfied with even good grades because I knew in my heart I understood nothing. Currently refreshing linear algebra in your coursera course and WOW! It’s addicting to actually learn what a rank in a matrix means. 😊☀️

  • @HoussamBIADI
    @HoussamBIADI Před 3 dny

    Thank you for this amazing explanation <3

  • @mekuzeeyo
    @mekuzeeyo Před 4 dny

    Great video as always. I have a question, in practice which one works best using DPO or RLHF?

    • @SerranoAcademy
      @SerranoAcademy Před 3 dny

      Thank you! From what I've heard, DPO works better, as it trains the network directly instead of using RL and two networks.

    • @mekuzeeyo
      @mekuzeeyo Před 3 dny

      @@SerranoAcademy Thank you sir for the great work. your Coursera courses have been awesome.

  • @hyperbitcoinizationpod

    And the entropy is number of bits needed to convey the information.

  • @martadomingues1691
    @martadomingues1691 Před 4 dny

    Very good video, it helped clear some doubts I was having with this along with the Viterbi Algorithm. It's just too bad that the notation used was too different from class, but it did help me understand everything and make a connection between all of it. Thank you!

  • @Cathiina
    @Cathiina Před 5 dny

    Hi Mr. Serrano! I am doing your coursera course at the moment on linear algebra for machine learning and I am having so much fun! You are a brilliant teacher, and I just wanted to say thank you! Wish more teachers would bring theoretical mathematics down to a more practical level. Obviously loving the very expensive fruit examples :)

    • @SerranoAcademy
      @SerranoAcademy Před 5 dny

      Thank you so much @Cathiina, what an honor to be part of your learning journey, and I’m glad you like the expensive fruit examples! :)

  • @vigneshram5193
    @vigneshram5193 Před 5 dny

    Thank you Luis Serrano for this super explanatory video

  • @bin4ry_d3struct0r
    @bin4ry_d3struct0r Před 6 dny

    Is there an industry standard for the KLD above which two distributions are considered significantly different (like how 0.05 is the standard for the p-value)?

    • @SerranoAcademy
      @SerranoAcademy Před 6 dny

      Ohhh that’s a good question. I don’t think so, since normally you use it for minimization or comparison between them, but I’ll keep an eye, maybe it would make sense to have a standard for it.

  • @frankl1
    @frankl1 Před 6 dny

    Did anyone expect something different than Sofmax regarding the Bradley-Terry model as myself? 😅

    • @SerranoAcademy
      @SerranoAcademy Před 6 dny

      lol, I was expecting something different too initially 🤣

  • @frankl1
    @frankl1 Před 6 dny

    Really love the way you broke down the DPO loss, this direct way is more welcome by my brain :). Just one question on the video, I am wondering how important it is to choose the initial transformer carefully. I suspect that if it is very bad at the task, then we will have to change the initial response a lot, but because the loss function prevents from changing too much in one iteration, we will need to perform a lot tiny changes toward the good answer, making the training extremely long. Am I right ?

    • @SerranoAcademy
      @SerranoAcademy Před 6 dny

      Thank you, great question! This method is used for fine-tuning, not specifically for training. In other words, it's crucial that we start with a fully trained model. For training, you'd use normal backpropagation on the transformer, and lots of data. Once the LLM is trained and very trusted, then you use DPO (or RLHF) to fine-tune it (meaning, post train it to get from good to great). So we should assume that the model is as trained as it can, and that's why we trust the LLM and we try to only change it marginally. If we were to do this method to train a model that's not fully trained... I'm not 100% if it would work. It may or may not, but we'd still have to punish the KL divergence much less. And also, human feedback gives a lot less data than scraping the whole internet, so I would still not use this as a training method, more as refining. Let me know if you have more questions!

    • @frankl1
      @frankl1 Před 6 dny

      @@SerranoAcademy Thanks for the answer, I understand better. I forgot that this design is for fine-tuning.

  • @rb4754
    @rb4754 Před 6 dny

    Very nice lecture on attention.

  • @mayyutyagi
    @mayyutyagi Před 6 dny

    Now whenever I watch Serrano's video, I first like it and the start watching it coz I know the video will gonna be outstanding as always.

  • @mayyutyagi
    @mayyutyagi Před 6 dny

    Liked this video and subscribed your channel today.

  • @mayyutyagi
    @mayyutyagi Před 6 dny

    Amazing video... Thanks sir for this pictorial representation and explaining this complex topic with such an easy way.

  • @AravindUkrd
    @AravindUkrd Před 7 dny

    Thanks for the simplified explanation. Awesome as always. The book link in the description is not working.

    • @SerranoAcademy
      @SerranoAcademy Před 6 dny

      Thank you so much! And thanks for letting me know, I’ll fix it

  • @johnzhu5735
    @johnzhu5735 Před 7 dny

    This was very helpful

  • @siddharthabhakta3261

    The best explanation & depiction of SVD.

  • @melihozcan8676
    @melihozcan8676 Před 7 dny

    Thanks for the excellent explanation! I used to know the KL Divergence, but now I understand it!

  • @saedsaify9944
    @saedsaify9944 Před 7 dny

    Great one, the simpler it looks and harder to build!

  • @stephenlashley6313
    @stephenlashley6313 Před 7 dny

    This and your whole series of attention NN is a thing of beauty! There are many ways of simplifying this here, but you come the closest to understanding Attention NN and QC are identical and QC is much better. In my opinion QC has never been done correctly, the gates are too confusing and poorly understood. QC is not still in simplified infant stage, it is mature what QC can do and matches all Psychology observations. All problems in Biology and NLP are sequences of strings.

  • @cloudshoring
    @cloudshoring Před 8 dny

    awesome!

  • @bifidoc
    @bifidoc Před 8 dny

    Thanks!

    • @SerranoAcademy
      @SerranoAcademy Před 8 dny

      Thank you so much for your kind contribution @bifidoc!!! 💜🙏🏼

  • @user-xc8vy4cw9k
    @user-xc8vy4cw9k Před 8 dny

    I would like to say thank you for the wonderful video. I want to learn reinforcement learning for my future study in the field of robotics. I have seen that you only have 4 videos about RL. I am hungry for more of your videos. I found that your videos are easier to understand because you explain well. Please add more RL videos. Thank you 🙏

    • @SerranoAcademy
      @SerranoAcademy Před 8 dny

      Thank you for the suggestion! Definitely! Any ideas on what topics in RL to cover?

    • @user-xc8vy4cw9k
      @user-xc8vy4cw9k Před 6 dny

      @@SerranoAcademy more videos in the field of Robotics please. Thank you. You may also guide me how I can approach the study of reinforcement learning.

  • @user-xc8vy4cw9k
    @user-xc8vy4cw9k Před 8 dny

    I would like to say thank you for the wonderful video. I want to learn reinforcement learning for my future study in the field of robotics. I have seen that you only have 4 videos about RL. I am hungry for more of your videos. I found that your videos are easier to understand because you explain well. Please add more RL videos. Thank you 🙏

  • @Omsip123
    @Omsip123 Před 8 dny

    So well explained

  • @guzh
    @guzh Před 8 dny

    DPO main equation should be PPO main equation.

  • @epepchuy
    @epepchuy Před 9 dny

    Exvelente explciacion!!!

  • @iantanwx
    @iantanwx Před 9 dny

    Most intuitive explanation for QKV, as someone with only an elementary understanding of linear algebra.

  • @VerdonTrigance
    @VerdonTrigance Před 9 dny

    It's kinda hard to remember all of these formulas and it's demotivating me from further learning.

    • @javiergimenezmoya86
      @javiergimenezmoya86 Před 9 dny

      You do not have to remember that formulas. You only have to understand the logic in them.

  • @IceMetalPunk
    @IceMetalPunk Před 9 dny

    I'm a little confused about one thing: the reward function, even in the Bradley-Terry model, is based on the human-given scores for individual context-prediction pairs, right? And πθ is the probability from the current iteration of the network, and πRef is the probability from the original, untuned network? So then after that "mathematical manipulation", how does the human-given set of scores become represented by the network's predictions all of a sudden?

  • @user-xc8vy4cw9k
    @user-xc8vy4cw9k Před 9 dny

    Thank you for the wonderful video. Please add more practical example videos for the application of reinforcement learning.

    • @SerranoAcademy
      @SerranoAcademy Před 9 dny

      Thank you! Definitely! Here's a playlist of applications of RL to training large language models. czcams.com/play/PLs8w1Cdi-zvYviYYw_V3qe6SINReGF5M-.html

  • @laodrofotic7713
    @laodrofotic7713 Před 9 dny

    noone of the videos I seen on this subject actually explain where the hell qkv values come from! its amazing people jump on making video while not understanding the concepts clearly! I guess youtube must pay a lot of money! But this video does a good job of explaining most of the things, it never does tell us where the actual qkv values come from, how do the embendings turn into them, and actually got things wrong in my oppinion. the q comes from embeddings that are multiplied by the wq, which is a weight and parameter in the model, but then the question is, where does wq wk wv come from???

  • @bendim94
    @bendim94 Před 9 dny

    how do you choose the number of features in the 2 matrices, i.e. how did you choose to have 2 features only?

  • @Priyanshuc2425
    @Priyanshuc2425 Před 9 dny

    Hey I know this 👦. He is my Maths teacher who don't only teach but make us visualize why we learn the topic and how will it useful in real world ❤

  • @Q793148210
    @Q793148210 Před 9 dny

    It‘s was just so clear. 😃

  • @DienTran-zh6kj
    @DienTran-zh6kj Před 10 dny

    I love his teaching, he makes complex things seem simple.

  • @shouvikdey7078
    @shouvikdey7078 Před 10 dny

    Love your videos, please make more such videos on mathematical description of generative models such as GAN, Diffusion, etc.

    • @SerranoAcademy
      @SerranoAcademy Před 9 dny

      Thank you! I got some on GANs and Diffusion models, check them out! GANs: czcams.com/video/8L11aMN5KY8/video.html Stable diffusion: czcams.com/video/JmATtG0yA5E/video.html

  • @mohammadarafah7757
    @mohammadarafah7757 Před 10 dny

    We expect to describe wasserstein distance 😊

    • @SerranoAcademy
      @SerranoAcademy Před 9 dny

      Ah good idea! I'll add it to the list, as well as earth-mover's distance. :)

    • @mohammadarafah7757
      @mohammadarafah7757 Před 9 dny

      @SerranoAcademy I also highly recommend to describe Explainable AI (XAI) which depends on statistics.

  • @mehdiberchid1974
    @mehdiberchid1974 Před 10 dny

    thank u

  • @bernardorinconceron6139

    Thank you Luis. I'm sure I'll use this very soon.

  • @shahnawazalam55
    @shahnawazalam55 Před 10 dny

    That was intuitive as butter

  • @frankl1
    @frankl1 Před 10 dny

    Great video. One question I have, why would I use KL instead of CE? are there situations in which one would be more suitable than the other ?

    • @SerranoAcademy
      @SerranoAcademy Před 10 dny

      That is a great question! KL(P,Q) is really the CE(P,Q), except you subtract the entropy H(P). The reason for this is that if you compare a distribution with itself, you want to get a zero. With CE, you don't get zero, so the CE of a distribution with itself could potentially be very high.

  • @Ashishkumar-id1nn
    @Ashishkumar-id1nn Před 10 dny

    why did you take average at 6:30 ?

    • @SerranoAcademy
      @SerranoAcademy Před 10 dny

      Great question! I took the average because the product is p_i^(nq^i), so the log is nq_i log(p_i), and I want to get rid of that n. It’s not super needed for the math, but I did it so that it gives exactly the KL divergence instead of n times it.

    • @Ashishkumar-id1nn
      @Ashishkumar-id1nn Před 10 dny

      @@SerranoAcademy thanks for the clarification