Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Introduction

Untitled

Related Work (RLHF)

  1. SFT(supervised fine-tuning) Phase

  2. Reward Modeling Phase

  3. RL Fine-Tuning Phase

Untitled

Direct Preference Optimization(DPO)