ARTICLE28

Calibration set size for LLM-as-judge: when 50 traces is enough and when 200 is mandatory

DEV.to AI·June 4, 2026

The size of the human-labeled calibration set for validating an LLM-as-judge depends on label balance. Fifty stratified traces suffice for balanced binary criteria, but 200 or more are mandatory for rare-but-expensive categories like safety violations, as kappa's variance is dominated by minority-class examples.

LLM-as-judge Calibration evaluation sample size Cohen's Kappa

Read original ↗