← heapsort
RESEARCH27

Data Mixing for Large Language Models Pretraining: A Survey and Outlook

arXiv CS.CLΒ·April 21, 2026

This paper provides a comprehensive survey on data mixing for Large Language Model (LLM) pretraining, a crucial factor for training efficiency and downstream generalization. It formalizes data mixture optimization as a bilevel problem and introduces a fine-grained taxonomy for existing methods.

Read original β†—