ARTICLEDEV.to AI·5d ago
Calibration set size for LLM-as-judge: when 50 traces is enough and when 200 is mandatory
The size of the human-labeled calibration set for validating an LLM-as-judge depends on label balance. Fifty stratified traces suffice for balanced binary criteria, but 200 or more are mandatory for rare-but-expensive categories like safety violations, as kappa's variance is dominated by minority-class examples.
28