RESEARCH28

Self-Distilled Policy Gradient

arXiv CS.LG·June 4, 2026

This paper introduces Self-Distilled Policy Gradient (SDPG), a novel framework that enhances sparse-reward reinforcement learning through on-policy self-distillation. SDPG integrates group-relative verifier advantages, exact full-vocabulary self-distillation, and KL regularization, demonstrating improved stability and performance over existing baselines.

language models deep learning reinforcement learning Policy Gradient Self-Distillation

Read original ↗