← heapsort-ai

reasoning tasks

1 items

RESEARCHarXiv CS.AI·4/13/2026

SPPO: Sequence-Level PPO for Long-Horizon Reasoning Tasks

Sequence-Level PPO (SPPO) addresses the limitations of standard token-level PPO in long-horizon LLM reasoning tasks by reformulating the process as a Sequence-Level Contextual Bandit problem. This approach uses a decoupled scalar value function to derive low-variance advantage signals, offering improved sample efficiency and stability without the high computational overhead of critic-free alternatives.

28