reward hacking — AI articles, news & research

RESEARCHarXiv CS.AI·27d ago

Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack

This paper introduces BenchJack, an automated system designed to audit AI agent benchmarks for "reward hacking," where agents maximize scores without performing the intended task. It derives a taxonomy of recurring flaw patterns and uses an iterative generative-adversarial pipeline to improve benchmark robustness.

red-teaming reward hacking security benchmarks