diagnostic benchmark — AI articles, news & research

RESEARCHarXiv CS.AI·4/15/2026

The Long-Horizon Task Mirage? Diagnosing Where and Why Agentic Systems Break

This research addresses the breakdown of LLM agents in long-horizon tasks, which require extended, interdependent action sequences. It introduces HORIZON, a cross-domain diagnostic benchmark designed to systematically construct tasks and analyze failure behaviors, evaluating state-of-the-art agents and proposing an LLM-as-a-Judge pipeline for scalable failure attribution.

Agentic Systems Long-horizon tasks LLM Agents failure diagnosis