RESEARCHarXiv CS.LG·4/15/2026
How Transformers Learn to Plan via Multi-Token Prediction
This paper investigates how Multi-token prediction (MTP) enables Transformers to learn to plan, outperforming standard Next-token prediction (NTP). Empirically, MTP consistently improves performance on reasoning tasks, and theoretically, it induces a two-stage reverse reasoning process via gradient decoupling.
27