RESEARCH27

Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack

arXiv CS.AI·May 14, 2026

This paper introduces BenchJack, an automated system designed to audit AI agent benchmarks for "reward hacking," where agents maximize scores without performing the intended task. It derives a taxonomy of recurring flaw patterns and uses an iterative generative-adversarial pipeline to improve benchmark robustness.

red-teaming reward hacking security Benchmarks AI agents

Read original ↗