RESEARCH29
Data Scaling as Progressive Coverage of a Predictive Contribution Spectrum
arXiv CS.CLΒ·May 21, 2026
This research investigates whether real-data scaling laws are governed by a progressive coverage of a latent predictive contribution spectrum, rather than solely by token-frequency. Using a suffix-automaton and a global-KL predictive contribution spectrum, the study finds a strong correlation between the spectrum's tail slope and the data-scaling exponent of GPT learners, showing that effective truncation rank scales logarithmically.
Read original β