RESEARCH27
Multi-Rollout On-Policy Distillation via Peer Successes and Failures
arXiv CS.LGΒ·May 14, 2026
The paper introduces Multi-Rollout On-Policy Distillation (MOPD), a framework that uses a student's local rollout group to construct more informative teacher signals for post-training large language models. MOPD conditions the teacher on both successful and failed peer rollouts, leveraging successes for valid reasoning patterns and failures for avoiding plausible mistakes.
Read original β