GRPO-VPS: Enhancing Group Relative Policy Optimization with Verifiable Process Supervision for Effective Reasoning
Published in ICLR 2026 Workshop on LLM Reasoning, 2026
GRPO-VPS uses segment-wise confidence gains on ground-truth answers as dense rewards for more targeted and sample-efficient reasoning optimization.
Recommended citation: Wang, Jingyi, Lei Zhu, Tengjin Weng, Song-Li Wu, Haochen Tan, Jierun Chen, Chaofan Tao, Haoli Bai, Lu Hou, Lifeng Shang, and Xiao-Ping Zhang. (2026). "GRPO-VPS: Enhancing Group Relative Policy Optimization with Verifiable Process Supervision for Effective Reasoning." ICLR 2026 Workshop on LLM Reasoning.
Download Paper
