← heapsort-ai

Kaggle

4 items

RESEARCHarXiv CS.LG·8d ago

LongDS-Bench: On the Failure of Long-Horizon Agentic Data Analysis

This research introduces LongDS, a new benchmark for evaluating AI agents in long-horizon, multi-turn data analysis tasks, featuring 68 tasks from real-world Kaggle notebooks. It reveals that state-of-the-art models achieve only 48.45% accuracy, with performance significantly dropping in later turns, highlighting a critical failure in tracking evolving analytical context.

27