← heapsort-ai

content moderation

23 items

RESEARCHarXiv CS.AI·4/25/2026

Escaping the Agreement Trap: Defensibility Signals for Evaluating Rule-Governed AI

This paper proposes a new framework for evaluating rule-governed AI, particularly in content moderation, by moving beyond simple agreement metrics. It introduces the Defensibility Index (DI), Ambiguity Index (AI), and Probabilistic Defensibility Signal (PDS) to assess policy-grounded correctness and reasoning stability, using LLM traces to verify logical derivability from governing rules.

27