RESEARCHarXiv CS.AI·4/13/2026
SPPO: Sequence-Level PPO for Long-Horizon Reasoning Tasks
Sequence-Level PPO (SPPO) addresses the limitations of standard token-level PPO in long-horizon LLM reasoning tasks by reformulating the process as a Sequence-Level Contextual Bandit problem. This approach uses a decoupled scalar value function to derive low-variance advantage signals, offering improved sample efficiency and stability without the high computational overhead of critic-free alternatives.
28