ORPO: Monolithic Preference Optimization without Reference Model (Paper Explained)

Sdílet
Vložit
  • čas přidán 30. 04. 2024
  • Paper: arxiv.org/abs/2403.07691
    Abstract:
    While recent preference alignment algorithms for language models have demonstrated promising results, supervised fine-tuning (SFT) remains imperative for achieving successful convergence. In this paper, we study the crucial role of SFT within the context of preference alignment, emphasizing that a minor penalty for the disfavored generation style is sufficient for preference-aligned SFT. Building on this foundation, we introduce a straightforward and innovative reference model-free monolithic odds ratio preference optimization algorithm, ORPO, eliminating the necessity for an additional preference alignment phase. We demonstrate, both empirically and theoretically, that the odds ratio is a sensible choice for contrasting favored and disfavored styles during SFT across the diverse sizes from 125M to 7B. Specifically, fine-tuning Phi-2 (2.7B), Llama-2 (7B), and Mistral (7B) with ORPO on the UltraFeedback alone surpasses the performance of state-of-the-art language models with more than 7B and 13B parameters: achieving up to 12.20% on AlpacaEval2.0 (Figure 1), 66.19% on IFEval (instruction-level loose, Table 6), and 7.32 in MT-Bench (Figure 12). We release code and model checkpoints for Mistral-ORPO-α (7B) and Mistral-ORPO-β (7B).
    Authors: Jiwoo Hong, Noah Lee, James Thorne
    Links:
    Homepage: ykilcher.com
    Merch: ykilcher.com/merch
    CZcams: / yannickilcher
    Twitter: / ykilcher
    Discord: ykilcher.com/discord
    LinkedIn: / ykilcher
    If you want to support me, the best thing to do is to share out the content :)
    If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
    SubscribeStar: www.subscribestar.com/yannick...
    Patreon: / yannickilcher
    Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
    Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2
    Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
    Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n
  • Věda a technologie

Komentáře • 64

  • @r9999t
    @r9999t Před měsícem +28

    Glad you're back to technical content this time. Any AI CZcamsr can give us latest AI news, but you're just about the only one that can give technical insight into the stories.

  • @lone0017
    @lone0017 Před měsícem +20

    6 videos in 7 days, I'm having a holiday and this is such a perfect-timing treat.

  • @EternalKernel
    @EternalKernel Před měsícem +4

    Thank you for being awesome Yannic, I send people from the classes that I "TA" for to you because you're reliably strong with your analysis.

  • @peach412
    @peach412 Před měsícem +17

    26:30 that 'really?' and the following struggle with basic math is WAAAAY to relatable

  • @tensorturtle1566
    @tensorturtle1566 Před měsícem +12

    Great to see research from my homeland of South Korea represented!

    • @Dogo.R
      @Dogo.R Před měsícem +2

      Woo allegence to tribes!!... .. ..

    • @jawadmansoor6064
      @jawadmansoor6064 Před měsícem

      do you know Seoul?

    • @cvabds
      @cvabds Před měsícem

      There is only one korea

  • @user-bz5be9bj4k
    @user-bz5be9bj4k Před 21 dnem

    Really appreciate your explaining, very helpful. Now I see the alignment process as widening the upper part of the Y shape: x with y_w to y_l. Thanks!

  • @borisbondarenko314
    @borisbondarenko314 Před měsícem +1

    I very like more technical content from you. I usually read tech news in telegram and your NL New are greats, but very ordinal and simple. So such paper explanations are kind of impact to the DS community, such videos grands new ideas and increase understanding of the field for those, who tried to dive in the deep. Of course it less popular due to complexity of material for audience, but much more interesting. So thank you for such format.

  • @I-0-0-I
    @I-0-0-I Před měsícem

    Thanks for explaining basic terms along with the more complex stuff, for dilettantes like myself. Cheers.

  • @blender6426
    @blender6426 Před měsícem +1

    Nice I was waiting for this after you mentioned ORPO in ML News :))

  • @justheuristic
    @justheuristic Před měsícem +13

    The main loss function (7) looks like it can be meaningfully simplified with school-level math.
    Lor = -log(sigm( log ( odds(y_w|x) / odds(y_l|x)))), where sigm(a) = 1/(1 + exp(-a)) = exp(a) / (1 + exp(a))
    Let's assume that both odds(y_w|x) and odds(y_l|x) are positive (because softmax)
    By plugging in the sigmoid, we get
    Lor = - log (exp(log(odds(y_w|x) / odds(y_l|x) )) / (1 + exp(log(odds(y_w|x) / odds(y_l|x)))) )
    Note that exp(log(odds(y_w|x) / odds(y_l|x)) = odds(y_w|x) / odds(y_l|x). We use this to simplify:
    Lor = - log( [odds(y_w|x) / odds(y_l|x)] / (1 + odds(y_w|x) / odds(y_l|x)) )
    Finally, multiply both numerator and denominator by odds(y_l|x) to get
    Lor = - log(odds(y_w|x) / (odds(y_w|x) + odds(y_l|x)) )
    Intuitively, this is the negative log-probability of (the odds of good response) / (odds of good response + odds of bad response ).
    If you minimize the average loss over multiple texts, it's the same as maximizing the odds that the model chooses winning response in every pair (of winning+losing responses).

    • @peterszilvasi752
      @peterszilvasi752 Před měsícem +1

      Good job! I suppose you mean `odds(y_l|x)` instead of `odds(y_l)` in the final equation.

    • @justheuristic
      @justheuristic Před měsícem

      @@peterszilvasi752 thanks! good catch :) /* fixed the previous comment */

    • @lucidraisin
      @lucidraisin Před měsícem +1

      very cool! thank you for this

  • @kaikapioka9711
    @kaikapioka9711 Před měsícem +2

    Thx again yan! 🎉

  • @fearnworks
    @fearnworks Před měsícem +4

    You are on fire!

  • @pritioli8429
    @pritioli8429 Před 12 dny

    great explanation!

  • @gauranshsoni4011
    @gauranshsoni4011 Před měsícem +1

    Keep them comin

  • @max0x7ba
    @max0x7ba Před měsícem

    That log of probability is also a power transform often used to narrow or widen a distribution.

  • @jellyfishnexus3132
    @jellyfishnexus3132 Před měsícem +1

    Nice!

  • @Mordenor
    @Mordenor Před měsícem +1

    Thank you Mr Klicher for delving into the paper, ORPO; Monolithic Preference Optimization without Reference Model

  • @MyCiaoatutti
    @MyCiaoatutti Před měsícem +1

    "Specifically, 1 - p(y|x) in the denominators amplifies the gradients when the corresponding side of the likelihood p(y|x) is low". I think that (1 - p(y|x)) have two different meanings here: it can be the result of differentiation by coincidence and also the "corresponding side" of the likelihood, i.e., 1 - p(y|x). So, when it says the "corresponding side" of p(y|x) is low, it means that 1 - p(y|x) is low.

  • @wwkk4964
    @wwkk4964 Před měsícem

    What's going on, is it a yannic bonanza time of the year! Loving these addicting videos

  • @herp_derpingson
    @herp_derpingson Před 16 dny

    18:47 I wish they showed some loss curves of the training in the paper unless I missed it. Whenever you divide things like that in the loss function, the loss curve goes crazy. It still trains but it can go crazy as for some samples the denominator might be close to zero.
    .
    19:33 There is no ablation in the paper with no SFT since the loss is L_sft + lambda L_orpo. I think we are soon to see a followup paper "ORPO is all you need" which just drops the SFT. I think it will work great.
    .
    31:30 One of my colleagues tried the probability ratio thing before. I dont remember what came out of it. Havent checked on him for a while.

  • @Zed_Oud
    @Zed_Oud Před měsícem +1

    27:57
    “the corresponding side”
    Maybe they mistakenly switched the w l givens in the denominators?

  • @jondo7680
    @jondo7680 Před 28 dny

    You should make a video just focusing on log and explaining it's role in neuronal networks.

  • @syeshwanth6790
    @syeshwanth6790 Před měsícem +1

    Where does Yw and Yl come from. Is it from the training dataset or the LLM that is being trained generates these and are labelled by humans or reward models as W and L?

  • @chrise8153
    @chrise8153 Před měsícem

    Wow good timing to go on youtube

  • @amanprajapati8707
    @amanprajapati8707 Před 3 dny

    My top picks for this bull run are Illuvium, Verasity and Cyberopolis.

  • @mantasorantas5289
    @mantasorantas5289 Před měsícem +1

    Would be interesting to see how it compares to KTO. Would guess that KTO outperforms and is easier to implament as you dont need pairs of inputs.

  • @yannickpezeu3419
    @yannickpezeu3419 Před měsícem

    I liked the self deprecation at 32:00 haha

  • @simaogoncalves1957
    @simaogoncalves1957 Před 22 dny

    16:12 Not sure I follow the intuition behind supervised fine tuning not being able to penalize the “wrong” token that is opposite to what we want the model to mimic. I’m confused because in my view the wrong, but highly probably token, contributes more to the loss so it will be penalized heavier than the more meaningless, random output, tokens. Can someone clarify this for me?

  • @xxlvulkann6743
    @xxlvulkann6743 Před měsícem +1

    great! now apply ORPO to a reward model and round we go!

  • @thunder89
    @thunder89 Před měsícem

    The comparison in the end between OR and PR should also discuss the influence of the log sigmoid, or? And, more importantly, how the gradients for the winning and loosing output actually would look like with these simulated pars... It feels a bit handweavy why the logsigmoid of the OR should be the target ...

  • @ArijitBiswasGooglePlus

    At the beginning, you referred to a paper from Meta. Which paper is it?

  • @Jason-lm2yq
    @Jason-lm2yq Před měsícem

    Can you do one on Kolmogorov-Arnold Network from MIT

  • @SLAM2977
    @SLAM2977 Před měsícem +1

    There seems to be a conceptual problem, where are the preferences coming from given that they are expressed on multiple responses to the same prompt? Let's suppose we wish to fine-tune a foundation model for chat, we would not have the preferences before having done SFT and gathered some responses on the chat template format based prompt, that would force us to do SFT first and then SFT+ODDS_RATIO loss. Doable but surely not a single pass approach.

  • @govinda4577
    @govinda4577 Před 3 dny

    I´m bullish on VRA, Joystream and Cyberopolis. What do you think guys about my picks?

  • @lenant
    @lenant Před 6 dny

    Thanks for explanation! But what do they consider as y_l? What are these tokens, which probability shall be lower, how do they select it?

    • @lenant
      @lenant Před 6 dny

      I see in paper they use datasets argilla/ultrafeedback-binarized-preferences-cleaned and Anthropic/hh-rlhf, but i don't quite understand how teacher forcing works here with 2 different sequences.

    • @lenant
      @lenant Před 6 dny

      Reading more into parer and I think I got it: they don't add L_or to each token, but rather to whole loss from sft (gathered from generated tokens) and L_or is caluclated as probability over whole chosen and rejected sequences.

  • @rectomgris
    @rectomgris Před měsícem +1

    makes me think of PPO

  • @davidhauser7537
    @davidhauser7537 Před měsícem

    yannick can you do xlstm paper

  • @john_blues
    @john_blues Před měsícem

    I don't even know what the title of this video means 😵‍💫. But I'm going to watch anyway.

  • @drdca8263
    @drdca8263 Před měsícem +2

    0:52 : I wish we had a different term for this other than “alignment”

    • @TheRyulord
      @TheRyulord Před měsícem +1

      "Preference tuning" is used to describe it pretty often

    • @drdca8263
      @drdca8263 Před měsícem +1

      @@TheRyulord thanks!

  • @user-pj2io5cx3z
    @user-pj2io5cx3z Před 3 dny

    You forgot to mention Cyberopolis. It will destroy other alts. Still early to ape in.

  • @amber9040
    @amber9040 Před měsícem +2

    I feel like AI models have gotten more stale and same-y ever since RLHF became the norm. Playing around with GPT-3 was wild times. Hopefully alignment moves in a direction with more diverse ranges of responses in the future, and less censorship in domains where it's not needed.

    • @dinoscheidt
      @dinoscheidt Před měsícem

      LLMs are what Machine Learning has always been: input output. Quality data makes the cake…. no matter how many fancy mixers you bring to the table.

  • @Embassy_of_Jupiter
    @Embassy_of_Jupiter Před měsícem

    why hat, indeed

  • @iworeushankaonce
    @iworeushankaonce Před měsícem

    *posts videos almost every day*
    *KAN paper dropped, disappears for 2 weeks*
    I hope you alright man 🫂🤗