RESEARCH28
SPPO: Sequence-Level PPO for Long-Horizon Reasoning Tasks
arXiv CS.AIΒ·April 13, 2026
Sequence-Level PPO (SPPO) addresses the limitations of standard token-level PPO in long-horizon LLM reasoning tasks by reformulating the process as a Sequence-Level Contextual Bandit problem. This approach uses a decoupled scalar value function to derive low-variance advantage signals, offering improved sample efficiency and stability without the high computational overhead of critic-free alternatives.
Read original β