RESEARCH27
Open-World Evaluations for Measuring Frontier AI Capabilities
arXiv CS.AIΒ·May 21, 2026
This paper advocates for "open-world evaluations" as a complement to traditional benchmarks for measuring frontier AI capabilities. It introduces CRUX, a project for conducting these regular, long-horizon, real-world task assessments, exemplified by an AI agent successfully publishing an iOS app.
Read original β