GSPO: Qwen团队用序列级优化重新定义GRPO,MoE训练终于稳了
ArXiv ID: 2507.18071作者: Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, Jingren Zhou, Junyang Lin机构: Alibaba Group (Qwen Team)发布日期: 2025-07-24
引言:GRPO的致命缺陷GRPO(Group Relative Policy Optimization)自DeepSeek-R1发布以来,已经成为LLM强化学习训练的事实标准。它去掉了PPO中昂贵的Critic网络,用组内相对奖励来估计优势值,大幅降低了计算成本。但GRPO有一个被广泛忽视的根本...