DPO Meets PPO: Token级强化优化统一RLHF范式
ArXiv ID: 2404.18922作者: Han Zhong, Zikang Shan, Guhao Feng (Peking Univ) + Wei Xiong (Princeton) + Microsoft Research机构: Peking University, Princeton University, Microsoft Research发布日期: 2024-04-29 (最新更新: 2025-05-21)会议: ICML 2025代码: GitHub
核心创新这篇ICML 2025论文提出了Reinforced Token Optimization (RTO),首次成功将DPO和PPO两大RLHF范式统一。传统RLHF方法要么使用离线DPO(简单但性能有限),要么使用在线PPO(强大但复杂)...