Direct Preference Optimization: Your Language Model is Secretly a Reward Model

SFT(supervised fine-tuning) Phase
Reward Modeling Phase
The SFT model is prompted with prompts $x$ to produce pairs of answers $(y_1, y_2)\sim\pi^{SFT}(y|x)$
These are then presented to human labelers who express preferences for one answer, denoted as $y_w>y_l \ |\ x$ where $y_w$ and $y_l$ denotes the preferred and dispreferred completion, respectively.
The preferences are assumed to be generated by some latent reward model $r^*(y,x)$, which we do not have access to.
The Bradley-Terry(BT) model stipulates that the human preference distribution $p^*$can be written as:

Equation 1
Assuming access to a static dataset of comparisons $\mathcal{D}=\{x^{(i)}, y^{(i)}w, y^{(i)}l\}^N{i=1}$ sampled from $p^*$, we can parametrize a reward model $r\phi(x,y)$ and estimate the parameters via maximum likelihood.
Framing the problem as a binary classification we have the negative log-likelihood loss:

Equation 2
RL Fine-Tuning Phase
During the RL phase, we use the learned reward function to provide feedback to the language model.
In particular, we formulate the following optimization problem:

Equation 3

Unlike prior RLHF methods, which learn a reward and then optimize it via RL, our approach bypasses the reward modeling step and directly optimizes a language model using preference data.
Our key insight is to leverage an analytical mapping from reward functions to optimal policies, which enables us to transform a loss function over reward functions into a loss function over policies.
Starting from Eq.3, it is straightforward to show that the optimal solution to the KL-constrained reward maximization objective takes the form :

Equation 4

We can rearrange Eq. 4 to express the reward function in terms of its corresponding optimal policy $\pi_r$, the reference policy $\pi_{ref}$, and the unknown partition function $Z(\cdot)$.

Equation 5
Substituting the reparameterization in Eq. 5 for $r^(x,y)$ into the preference model Eq. 1, the partition function cancels, and we can express the human preference probability in terms of only the optimal policy $\pi^$ and reference policy $\pi_{ref}$.

Equation 6