RESEARCHarXiv CS.AI·4d ago
SentinelBench: A Benchmark for Long-Running Monitoring Agents
SentinelBench is a new open-source benchmark for long-running AI agent monitoring tasks. It aims to measure progress on tasks requiring sustained attention rather than continuous action, across 100 tasks in 10 synthetic web environments.
28