Long-horizon tasks

3 items

RESEARCHarXiv CS.AI·20d ago

DecisionBench: A Benchmark for Emergent Delegation in Long-Horizon Agentic Workflows

DecisionBench is introduced as a new benchmark for emergent delegation in long-horizon agentic workflows. It includes a fixed task suite, a peer-model pool, and a multi-axis metric suite to evaluate delegation quality and cost.

Long-horizon tasks workflow automation Benchmarking delegation

RESEARCHarXiv CS.AI·4/15/2026

The Long-Horizon Task Mirage? Diagnosing Where and Why Agentic Systems Break

This research addresses the breakdown of LLM agents in long-horizon tasks, which require extended, interdependent action sequences. It introduces HORIZON, a cross-domain diagnostic benchmark designed to systematically construct tasks and analyze failure behaviors, evaluating state-of-the-art agents and proposing an LLM-as-a-Judge pipeline for scalable failure attribution.

Agentic Systems Long-horizon tasks LLM Agents failure diagnosis

RESEARCHarXiv CS.LG·8d ago

LongDS-Bench: On the Failure of Long-Horizon Agentic Data Analysis

This research introduces LongDS, a new benchmark for evaluating AI agents in long-horizon, multi-turn data analysis tasks, featuring 68 tasks from real-world Kaggle notebooks. It reveals that state-of-the-art models achieve only 48.45% accuracy, with performance significantly dropping in later turns, highlighting a critical failure in tracking evolving analytical context.

Long-horizon tasks Kaggle AI Benchmarks data analysis