← heapsort
RESEARCH27

LongDS-Bench: On the Failure of Long-Horizon Agentic Data Analysis

arXiv CS.LGΒ·June 1, 2026

This research introduces LongDS, a new benchmark for evaluating AI agents in long-horizon, multi-turn data analysis tasks, featuring 68 tasks from real-world Kaggle notebooks. It reveals that state-of-the-art models achieve only 48.45% accuracy, with performance significantly dropping in later turns, highlighting a critical failure in tracking evolving analytical context.

Read original β†—