RESEARCH29

Design Conditions for Intra-Group Learning of Sequence-Level Rewards: Token Gradient Cancellation

arXiv CS.LG·April 16, 2026

This paper presents a necessary condition for intra-group learning algorithm design in Reinforcement Learning, requiring objectives to maintain gradient exchangeability across token updates to prevent reward-irrelevant drift. It proposes minimal transformations to restore this cancellation structure, which stabilizes training and improves sample efficiency.

reinforcement learning large language models gradient dynamics model optimization

Read original ↗