sample size — AI articles, news & research

ARTICLEDEV.to AI·5d ago

Calibration set size for LLM-as-judge: when 50 traces is enough and when 200 is mandatory

The size of the human-labeled calibration set for validating an LLM-as-judge depends on label balance. Fifty stratified traces suffice for balanced binary criteria, but 200 or more are mandatory for rare-but-expensive categories like safety violations, as kappa's variance is dominated by minority-class examples.

LLM-as-judge Calibration evaluation sample size