← heapsort
RESEARCH27

Open-World Evaluations for Measuring Frontier AI Capabilities

arXiv CS.AIΒ·May 21, 2026

This paper advocates for "open-world evaluations" as a complement to traditional benchmarks for measuring frontier AI capabilities. It introduces CRUX, a project for conducting these regular, long-horizon, real-world task assessments, exemplified by an AI agent successfully publishing an iOS app.

Read original β†—