DPO (Direct Preference Optimization)

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Introduction

While large-scale unsupervised language models (LMs) learn broad world knowledge and some reasoning skills, achieving precise control of their behavior is difficult due to the completely unsupervised nature of their training.
Existing methods for gaining such steerability collect human labels of the relative quality of model generations and fine-tune the unsupervised LM to align with these preferences, often with reinforcement learning from human feedback (RLHF).
However, RLHF is a complex and often unstable procedure, first fitting a reward model that reflects the human preferences, and then fine-tuning the large unsupervised LM using reinforcement learning to maximize this estimated reward without drifting too far from the original model.
In this paper, we will show that the RL-based objective used by existing methods can be optimized exactly with a simple binary cross-entropy objective(classification problem).
The resulting algorithm, which we call Direct Preference Optimization (DPO), is stable, performant, and computationally lightweight, eliminating the need for fitting a reward model, sampling from the LM during fine-tuning, or performing significant hyperparameter tuning.

Untitled

SFT(supervised fine-tuning) Phase
- RLHF typically begins with a generic pre-trained LM, which is fine-tuned with supervised learning (maximum likelihood) on a high-quality dataset for the downstream task(s) of interest, such as dialogue, instruction following, summarization, etc., to obtain a model $\pi^{SFT}$.
Reward Modeling Phase
- The SFT model is prompted with prompts $x$ to produce pairs of answers $(y_1, y_2)\sim\pi^{SFT}(y|x)$
- These are then presented to human labelers who express preferences for one answer, denoted as $y_w>y_l \ |\ x$ where $y_w$ and $y_l$ denotes the preferred and dispreferred completion, respectively.
- The preferences are assumed to be generated by some latent reward model $r^*(y,x)$, which we do not have access to.
- The Bradley-Terry(BT) model stipulates that the human preference distribution $p^*$can be written as:
  
  Equation 1
- Assuming access to a static dataset of comparisons $\mathcal{D}=\{x^{(i)}, y^{(i)}w, y^{(i)}l\}^N{i=1}$ sampled from $p^*$, we can parametrize a reward model $r\phi(x,y)$ and estimate the parameters via maximum likelihood.
- Framing the problem as a binary classification we have the negative log-likelihood loss:
  
  Equation 2
RL Fine-Tuning Phase
- During the RL phase, we use the learned reward function to provide feedback to the language model.
- In particular, we formulate the following optimization problem:
  
  Equation 3

Untitled

Unlike prior RLHF methods, which learn a reward and then optimize it via RL, our approach bypasses the reward modeling step and directly optimizes a language model using preference data.
Our key insight is to leverage an analytical mapping from reward functions to optimal policies, which enables us to transform a loss function over reward functions into a loss function over policies.
Starting from Eq.3, it is straightforward to show that the optimal solution to the KL-constrained reward maximization objective takes the form :

Equation 4
- Proof
We can rearrange Eq. 4 to express the reward function in terms of its corresponding optimal policy $\pi_r$, the reference policy $\pi_{ref}$, and the unknown partition function $Z(\cdot)$.

Equation 5
Substituting the reparameterization in Eq. 5 for $r^(x,y)$ into the preference model Eq. 1, the partition function cancels, and we can express the human preference probability in terms of only the optimal policy $\pi^$ and reference policy $\pi_{ref}$.

Equation 6