Policy of Thoughts: Scaling LLM Reasoning via Test-time Policy Evolution
ArXiv ID: 2601.20379作者: Zhengbo Jiao, Hongyu Xian, Qinglong Wang, Yunpu Ma, Zhebo Wang, Zifan Zhang, Dezhang Kong, Meng Han发布日期: 2026-01-28内容级别: Deep Dive
摘要现有测试时计算扩展方法将反馈仅作为外部过滤机制,无法真正改进模型的推理策略。本文提出思维策略(Policy of Thoughts, PoT),将推理重构为实例级在线优化过程。PoT通过蒙特卡洛树搜索(MCTS)生成多样候选解,然后利用群组相对策略优化(GRPO)更新瞬时LoRA适配器,实现测试时的实时策略进化。核心理念源自波普尔...