RESEARCH27
Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
arXiv CS.CLΒ·May 20, 2026
Deep research agents automate complex information-seeking tasks, necessitating scalable and reliable evaluation. The paradigm of LLM-as-judge for supervision raises concerns about its reliability, underscoring the critical need for a meta-evaluation specifically for these judges.
Read original β