RESEARCH28

SPPO: Sequence-Level PPO for Long-Horizon Reasoning Tasks

arXiv CS.AI·April 13, 2026

Sequence-Level PPO (SPPO) addresses the limitations of standard token-level PPO in long-horizon LLM reasoning tasks by reformulating the process as a Sequence-Level Contextual Bandit problem. This approach uses a decoupled scalar value function to derive low-variance advantage signals, offering improved sample efficiency and stability without the high computational overhead of critic-free alternatives.

LLMs reasoning tasks reinforcement learning PPO Chain-of-Thought

Read original ↗