← heapsort
RESEARCH27

Multi-Rollout On-Policy Distillation via Peer Successes and Failures

arXiv CS.LGΒ·May 14, 2026

The paper introduces Multi-Rollout On-Policy Distillation (MOPD), a framework that uses a student's local rollout group to construct more informative teacher signals for post-training large language models. MOPD conditions the teacher on both successful and failed peer rollouts, leveraging successes for valid reasoning patterns and failures for avoiding plausible mistakes.

Read original β†—