RESEARCH27
How Transformers Learn to Plan via Multi-Token Prediction
arXiv CS.LGΒ·April 15, 2026
This paper investigates how Multi-token prediction (MTP) enables Transformers to learn to plan, outperforming standard Next-token prediction (NTP). Empirically, MTP consistently improves performance on reasoning tasks, and theoretically, it induces a two-stage reverse reasoning process via gradient decoupling.
Read original β