AgentAtlas: Beyond Outcome Leaderboards for LLM Agents
AgentAtlas addresses the fragmentation in benchmarks used to evaluate large language model (LLM) agents, which currently emphasize different units of measurement. It introduces four components, including a six-state control-decision taxonomy, a nine-category trajectory-failure taxonomy, and a methodology to measure model capability based on prompt supervision.