reward hacking — artículos, noticias e investigación de IA

RESEARCHarXiv CS.AI·hace 27d

Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack

Este artículo presenta BenchJack, un sistema automatizado para auditar benchmarks de agentes de IA, con el fin de identificar la "manipulación de recompensas" donde los agentes maximizan las puntuaciones sin realizar la tarea. Deriva una taxonomía de patrones de fallas y utiliza un pipeline generativo-adversarial para mejorar la robustez de los benchmarks.

red-teaming reward hacking security Benchmarks