RESEARCH27

Self-Distillation Zero: Self-Revision Turns Binary Rewards into Dense Supervision

arXiv CS.CL·April 15, 2026

Self-Distillation Zero (SD-Zero) is a novel post-training method designed to be more training sample-efficient than traditional reinforcement learning, without requiring external teachers or high-quality demonstrations. It operates by having a single model act as both a Generator and a Reviser, using the Reviser's improved responses and token distributions to provide dense supervision for the Generator through on-policy self-distillation.

reinforcement learning post-training Dense Supervision Self-Distillation large language models

Read original ↗