long-running tasks — AI articles, news & research

RESEARCHarXiv CS.AI·4d ago

SentinelBench: A Benchmark for Long-Running Monitoring Agents

SentinelBench is a new open-source benchmark for long-running AI agent monitoring tasks. It aims to measure progress on tasks requiring sustained attention rather than continuous action, across 100 tasks in 10 synthetic web environments.

monitoring Benchmarking long-running tasks AI agents