long-running tasks — articles, actualités et recherches IA

RESEARCHarXiv CS.AI·il y a 4j

SentinelBench: A Benchmark for Long-Running Monitoring Agents

SentinelBench est un nouveau benchmark open-source pour les tâches de surveillance d'agents IA de longue durée. Il vise à mesurer les progrès sur des tâches nécessitant une attention soutenue plutôt qu'une action continue, à travers 100 tâches dans 10 environnements web synthétiques.

monitoring Benchmarking long-running tasks AI agents