RESEARCH27

How Transformers Learn to Plan via Multi-Token Prediction

arXiv CS.LG·April 15, 2026

This paper investigates how Multi-token prediction (MTP) enables Transformers to learn to plan, outperforming standard Next-token prediction (NTP). Empirically, MTP consistently improves performance on reasoning tasks, and theoretically, it induces a two-stage reverse reasoning process via gradient decoupling.

Next-token prediction Planning Multi-Token Prediction Reasoning Transformers

Read original ↗