RESEARCH27

The Long-Horizon Task Mirage? Diagnosing Where and Why Agentic Systems Break

arXiv CS.AI·April 15, 2026

This research addresses the breakdown of LLM agents in long-horizon tasks, which require extended, interdependent action sequences. It introduces HORIZON, a cross-domain diagnostic benchmark designed to systematically construct tasks and analyze failure behaviors, evaluating state-of-the-art agents and proposing an LLM-as-a-Judge pipeline for scalable failure attribution.

Agentic Systems Long-horizon tasks LLM Agents failure diagnosis diagnostic benchmark

Read original ↗