RESEARCH27

Escaping the Agreement Trap: Defensibility Signals for Evaluating Rule-Governed AI

arXiv CS.AI·April 25, 2026

This paper proposes a new framework for evaluating rule-governed AI, particularly in content moderation, by moving beyond simple agreement metrics. It introduces the Defensibility Index (DI), Ambiguity Index (AI), and Probabilistic Defensibility Signal (PDS) to assess policy-grounded correctness and reasoning stability, using LLM traces to verify logical derivability from governing rules.

LLMs content moderation AI ethics AI evaluation

Read original ↗