← heapsort
RESEARCH27

The Long Delay to Arithmetic Generalization: When Learned Representations Outrun Behavior

arXiv CS.LGΒ·April 16, 2026

This research investigates the 'grokking' phenomenon in transformers, finding that the long delay to generalization in arithmetic models stems from a decoder bottleneck. The encoder acquires relevant structural knowledge early, but the decoder struggles to access it, a hypothesis supported by causal interventions like transplanting encoders.

Read original β†—