← heapsort
ARTICLE62

More eval traces will not stabilize your kappa. Stratify the ones you have

DEV.to AIΒ·June 9, 2026

The content discusses the instability of LLM-as-judge agreement (Cohen's kappa) which swung weekly despite no rubric changes. Increasing sample size did not stabilize it; the solution was to stratify existing samples by score class and failure dimensions, which dramatically reduced variance, showing composition, not volume, was key.

Read original β†—