← heapsort
RESEARCH28

Self-Distilled Policy Gradient

arXiv CS.LGΒ·June 4, 2026

This paper introduces Self-Distilled Policy Gradient (SDPG), a novel framework that enhances sparse-reward reinforcement learning through on-policy self-distillation. SDPG integrates group-relative verifier advantages, exact full-vocabulary self-distillation, and KL regularization, demonstrating improved stability and performance over existing baselines.

Read original β†—