RESEARCHarXiv CS.AI·19d ago
Open-World Evaluations for Measuring Frontier AI Capabilities
This paper advocates for "open-world evaluations" as a complement to traditional benchmarks for measuring frontier AI capabilities. It introduces CRUX, a project for conducting these regular, long-horizon, real-world task assessments, exemplified by an AI agent successfully publishing an iOS app.
27