open-world evaluations — AI articles, news & research

RESEARCHarXiv CS.AI·19d ago

Open-World Evaluations for Measuring Frontier AI Capabilities

This paper advocates for "open-world evaluations" as a complement to traditional benchmarks for measuring frontier AI capabilities. It introduces CRUX, a project for conducting these regular, long-horizon, real-world task assessments, exemplified by an AI agent successfully publishing an iOS app.

AI capabilities CRUX project open-world evaluations frontier AI