RESEARCH27
Escaping the Agreement Trap: Defensibility Signals for Evaluating Rule-Governed AI
arXiv CS.AIΒ·April 25, 2026
This paper proposes a new framework for evaluating rule-governed AI, particularly in content moderation, by moving beyond simple agreement metrics. It introduces the Defensibility Index (DI), Ambiguity Index (AI), and Probabilistic Defensibility Signal (PDS) to assess policy-grounded correctness and reasoning stability, using LLM traces to verify logical derivability from governing rules.
Read original β