RESEARCHarXiv CS.AI·4/15/2026
The Long-Horizon Task Mirage? Diagnosing Where and Why Agentic Systems Break
This research addresses the breakdown of LLM agents in long-horizon tasks, which require extended, interdependent action sequences. It introduces HORIZON, a cross-domain diagnostic benchmark designed to systematically construct tasks and analyze failure behaviors, evaluating state-of-the-art agents and proposing an LLM-as-a-Judge pipeline for scalable failure attribution.
27