RESEARCHarXiv CS.LG·13d ago
GEM: Geometric Entropy Mixing for Optimal LLM Data Curation
This paper introduces GEM (Geometric Entropy Mixing), a novel framework for LLM data curation that reformulates the problem as a variational one on the hypersphere. GEM optimizes data composition for LLM pre-training, overcoming categorization flaws and discovering balanced semantic structures.
29