ARTICLEDEV.to AI·3h ago
More eval traces will not stabilize your kappa. Stratify the ones you have
The content discusses the instability of LLM-as-judge agreement (Cohen's kappa) which swung weekly despite no rubric changes. Increasing sample size did not stabilize it; the solution was to stratify existing samples by score class and failure dimensions, which dramatically reduced variance, showing composition, not volume, was key.
62