RESEARCH27

Multi-Rollout On-Policy Distillation via Peer Successes and Failures

arXiv CS.LG·May 14, 2026

The paper introduces Multi-Rollout On-Policy Distillation (MOPD), a framework that uses a student's local rollout group to construct more informative teacher signals for post-training large language models. MOPD conditions the teacher on both successful and failed peer rollouts, leveraging successes for valid reasoning patterns and failures for avoiding plausible mistakes.

distillation reinforcement learning AI training machine learning large language models

Read original ↗