RESEARCH27

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

arXiv CS.CL·May 20, 2026

Deep research agents automate complex information-seeking tasks, necessitating scalable and reliable evaluation. The paradigm of LLM-as-judge for supervision raises concerns about its reliability, underscoring the critical need for a meta-evaluation specifically for these judges.

REFLECT meta-evaluation evaluation research agents LLM judges

Read original ↗