← heapsort
RESEARCH27

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

arXiv CS.CLΒ·May 20, 2026

Deep research agents automate complex information-seeking tasks, necessitating scalable and reliable evaluation. The paradigm of LLM-as-judge for supervision raises concerns about its reliability, underscoring the critical need for a meta-evaluation specifically for these judges.

Read original β†—