← heapsort
RESEARCH29

GEM: Geometric Entropy Mixing for Optimal LLM Data Curation

arXiv CS.LGΒ·May 27, 2026

This paper introduces GEM (Geometric Entropy Mixing), a novel framework for LLM data curation that reformulates the problem as a variational one on the hypersphere. GEM optimizes data composition for LLM pre-training, overcoming categorization flaws and discovering balanced semantic structures.

Read original β†—