Proximal Policy Optimization | ChatGPT uses this

Reinforcement Learning with Human Feedback - How to train and fine-tune Transformer Models

Reinforcement Learning from Human Feedback explained with math derivations and the PyTorch code.

Хто зрадник?

Jak Vypadal Mariánkův Sraz s Fanoušky? #shorts #jonmarianek #marcel

This Famous Athlete Shocked the Olympics 👟

Reinforcement Learning through Human Feedback - EXPLAINED! | RLHF

CodeEmporium

zhlédnutí 16 037

Přidat do
- Můj playlist
- Přehrát později
Sdílet

Sdílet

Vložit

Velikost videa:

Zobrazit ovladače přehrávání

Automatické přehrávání

Přehrát

čas přidán 28. 08. 2024
We talk about reinforcement learning through human feedback. ChatGPT among other applications makes use of this.
ABOUT ME
⭕ Subscribe: www.youtube.co...
📚 Medium Blog: / dataemporium
💻 Github: github.com/ajh...
👔 LinkedIn: / ajay-halthor-477974bb
PLAYLISTS FROM MY CHANNEL
⭕ Reinforcement Learning: • Reinforcement Learning...
Natural Language Processing: • Natural Language Proce...
⭕ Transformers from Scratch: • Natural Language Proce...
⭕ ChatGPT Playlist: • ChatGPT
⭕ Convolutional Neural Networks: • Convolution Neural Net...
⭕ The Math You Should Know : • The Math You Should Know
⭕ Probability Theory for Machine Learning: • Probability Theory for...
⭕ Coding Machine Learning: • Code Machine Learning
MATH COURSES (7 day free trial)
📕 Mathematics for Machine Learning: imp.i384100.ne...
📕 Calculus: imp.i384100.ne...
📕 Statistics for Data Science: imp.i384100.ne...
📕 Bayesian Statistics: imp.i384100.ne...
📕 Linear Algebra: imp.i384100.ne...
📕 Probability: imp.i384100.ne...
OTHER RELATED COURSES (7 day free trial)
📕 ⭐ Deep Learning Specialization: imp.i384100.ne...
📕 Python for Everybody: imp.i384100.ne...
📕 MLOps Course: imp.i384100.ne...
📕 Natural Language Processing (NLP): imp.i384100.ne...
📕 Machine Learning in Production: imp.i384100.ne...
📕 Data Science Specialization: imp.i384100.ne...
📕 Tensorflow: imp.i384100.ne...

Komentáře • 21

@RameshKumar-ng3nf Před 3 měsíci ⁺²
Brilliant Bro 👌. Excellent explanation. I never understand RLHF reading so many books and notes. Your examples are GREAT & simple to understand 👌
I am new to your channel and subscribed.
@neetpride5919 Před 8 měsíci ⁺⁴
Great video! I have a few questions:
1) Why do we need to manually train the reward model with human feedback if the point is to evaluate responses of another pretrained model? Can't we just cut out the reward model altogether, rate the responses directly using human feedback to generate a loss value for each response, then backpropagate on that? Does it require less human input to train the reward model than to train the GPT model directly?
2) When backpropagating the loss, do you need to do recurrent backpropagation for a number of steps that is the same as the length of the token output?
3) Does the loss value apply equally to every token that is output? Seems like this would overly punish some words e.g. if the question starts with "why" it's likely the response is going to start with "because" regardless of what comes after. Does RLHF only work with sentence embeddings rather than word embeddings?
@0xabaki Před 6 měsíci
1) I think the point is to minimize the human feed back volume so humans just give enough responses to train a model for all future feedback. this way humans are not going to always have to give feedback, but instead will lay the basis, and probably come back to re-evaluate what the reward model is doing so it is still acting human
(2) and (3) seem more specific to the architecture of chatGPT and neither PPO nor RLHF. I would look into the other GPT specific videos he made
@theartofwar1750 Před 5 měsíci ⁺²
At 6:58, you have an error: PPO is not used to build the reward model.
@francisco444 Před 28 dny
That's correct. The PPO algorithm is used to fine-tune the SFT model against the reward model scores, in order to prevent the model from "cheating" and generating outputs that maximize the reward score but are no longer normal human-like text.
PPO ensures the final RLHF model's outputs remain close to the original SFT model's outputs.
@thangarajr-qw6wy Před 2 měsíci
(1) supervised fine-tuning (SFT), (2) reward model (RM) training, and (3) reinforcement learning via proximal policy optimization (PPO) on this reward model explain me
@sangeethashowrya0318 Před 4 měsíci
Sir ,please make a video on function approximation in RL
@manigoyal4872 Před 8 měsíci
Acts as a randomizing factor depending on whom you are getting feedback from
@manigoyal4872 Před 8 měsíci
what about the generation of rewards, will there be another model to check the relativity of the answer and the precision of the answer, cause we have a lot of data
@0xabaki Před 6 měsíci
haha quiz time again:
0) when the person knows me well
1)D
2)B if proper human feedback
3)C
@TheresaLopez-r7t Před dnem
Rodriguez Jennifer Miller Nancy Lewis Timothy
@ayeshariaz3382 Před 3 měsíci
where to det your slides?
@ArielOmerez Před 2 měsíci ⁺¹
B
@manigoyal4872 Před 8 měsíci ⁺¹
Aren't we users are the humans in feedback loop for openai
@akzytr Před 8 měsíci ⁺²
Yeah, however openai has the final say on what feedback goes through
@SysknShall Před 7 dny
Rodriguez Donna Miller Deborah Hernandez Frank
@063harshsahu2 Před měsícem
looking like indian but accent like britisher, where u from bro ?
@MichaelNeumann-n2v Před 10 dny
Brown Jennifer Jones Dorothy Lopez Shirley
@ArielOmerez Před 2 měsíci
D
@aswinselva03 Před 2 měsíci
The video is informative and good. but stop saying quiz time in an annoying way
@ArielOmerez Před 2 měsíci
C

Další v pořadí

Automatické přehrávání

Proximal Policy Optimization | ChatGPT uses this

Proximal Policy Optimization | ChatGPT uses this

Reinforcement Learning with Human Feedback - How to train and fine-tune Transformer Models

Reinforcement Learning with Human Feedback - How to train and fine-tune Transformer Models

Reinforcement Learning from Human Feedback explained with math derivations and the PyTorch code.

Reinforcement Learning from Human Feedback explained with math derivations and the PyTorch code.

Jak Vypadal Mariánkův Sraz s Fanoušky? #shorts #jonmarianek #marcel

Jak Vypadal Mariánkův Sraz s Fanoušky? #shorts #jonmarianek #marcel

This Famous Athlete Shocked the Olympics 👟

This Famous Athlete Shocked the Olympics 👟

Running With Bigger And Bigger Feastables

Running With Bigger And Bigger Feastables

Direct Preference Optimization: Your Language Model is Secretly a Reward Model | DPO paper explained

Direct Preference Optimization: Your Language Model is Secretly a Reward Model | DPO paper explained

Reinforcement Learning from Human Feedback Explained (and RLAIF)

Reinforcement Learning from Human Feedback Explained (and RLAIF)

Deep Q-Networks Explained!

Deep Q-Networks Explained!

Embeddings - EXPLAINED!

Embeddings - EXPLAINED!

RLHF: How to Learn from Human Feedback with Reinforcement Learning

RLHF: How to Learn from Human Feedback with Reinforcement Learning

Large Language Models (LLMs) - Everything You NEED To Know

Large Language Models (LLMs) - Everything You NEED To Know

Generative AI in a Nutshell - how to survive and thrive in the age of AI

Generative AI in a Nutshell - how to survive and thrive in the age of AI

ChatGPT and Reinforcement Learning

ChatGPT and Reinforcement Learning

Computer Scientist Explains Machine Learning in 5 Levels of Difficulty | WIRED

Computer Scientist Explains Machine Learning in 5 Levels of Difficulty | WIRED

Jak Vypadal Mariánkův Sraz s Fanoušky? #shorts #jonmarianek #marcel

Jak Vypadal Mariánkův Sraz s Fanoušky? #shorts #jonmarianek #marcel

Angelo Song Tu To Riyal😆 | Brawl Stars #shorts #brawlstars

Angelo Song Tu To Riyal😆 | Brawl Stars #shorts #brawlstars

KONČÍM CESTU NA OLYMPII A ZÁVODNÍ KARIÉRU

KONČÍM CESTU NA OLYMPII A ZÁVODNÍ KARIÉRU

Running With Bigger And Bigger Feastables

Running With Bigger And Bigger Feastables

NEJLEPŠÍ KVÍZ NA YOUTUBE @Duklock @EvilBender47

NEJLEPŠÍ KVÍZ NA YOUTUBE @Duklock @EvilBender47

老公说在家无聊，想出去打牌，我不让他去，就陪他在家这样玩#夫妻搞笑视频#惊不惊喜意不意外 #万万没想到 #逗比夫妻日常 #这操作都看傻了

老公说在家无聊，想出去打牌，我不让他去，就陪他在家这样玩#夫妻搞笑视频#惊不惊喜意不意外 #万万没想到 #逗比夫妻日常 #这操作都看傻了