RESEARCH60

UnpredictaBench: A Benchmark for Evaluating Distributional Randomness in LLMs

arXiv CS.CL·June 8, 2026

UnpredictaBench is introduced as a new benchmark to evaluate large language models' ability to capture true underlying distributions, addressing their tendency to collapse towards single answers. It provides 448 problems and a KS@N metric to test sampling outcomes from various target distributions.

AI models LLMs evaluation Benchmarking randomness

Read original ↗