← heapsort
RESEARCH29

Data Scaling as Progressive Coverage of a Predictive Contribution Spectrum

arXiv CS.CLΒ·May 21, 2026

This research investigates whether real-data scaling laws are governed by a progressive coverage of a latent predictive contribution spectrum, rather than solely by token-frequency. Using a suffix-automaton and a global-KL predictive contribution spectrum, the study finds a strong correlation between the spectrum's tail slope and the data-scaling exponent of GPT learners, showing that effective truncation rank scales logarithmically.

Read original β†—